Cargando…

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

MOTIVATION: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating...

Descripción completa

Detalles Bibliográficos
Autores principales: Wharrie, Sophie, Yang, Zhiyu, Raj, Vishnu, Monti, Remo, Gupta, Rahul, Wang, Ying, Martin, Alicia, O’Connor, Luke J, Kaski, Samuel, Marttinen, Pekka, Palamara, Pier Francesco, Lippert, Christoph, Ganna, Andrea
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10493177/
https://www.ncbi.nlm.nih.gov/pubmed/37647640
http://dx.doi.org/10.1093/bioinformatics/btad535
_version_ 1785104418850471936
author Wharrie, Sophie
Yang, Zhiyu
Raj, Vishnu
Monti, Remo
Gupta, Rahul
Wang, Ying
Martin, Alicia
O’Connor, Luke J
Kaski, Samuel
Marttinen, Pekka
Palamara, Pier Francesco
Lippert, Christoph
Ganna, Andrea
author_facet Wharrie, Sophie
Yang, Zhiyu
Raj, Vishnu
Monti, Remo
Gupta, Rahul
Wang, Ying
Martin, Alicia
O’Connor, Luke J
Kaski, Samuel
Marttinen, Pekka
Palamara, Pier Francesco
Lippert, Christoph
Ganna, Andrea
author_sort Wharrie, Sophie
collection PubMed
description MOTIVATION: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. RESULTS: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. AVAILABILITY AND IMPLEMENTATION: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
format Online
Article
Text
id pubmed-10493177
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-104931772023-09-11 HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes Wharrie, Sophie Yang, Zhiyu Raj, Vishnu Monti, Remo Gupta, Rahul Wang, Ying Martin, Alicia O’Connor, Luke J Kaski, Samuel Marttinen, Pekka Palamara, Pier Francesco Lippert, Christoph Ganna, Andrea Bioinformatics Original Paper MOTIVATION: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. RESULTS: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. AVAILABILITY AND IMPLEMENTATION: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data. Oxford University Press 2023-08-30 /pmc/articles/PMC10493177/ /pubmed/37647640 http://dx.doi.org/10.1093/bioinformatics/btad535 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Wharrie, Sophie
Yang, Zhiyu
Raj, Vishnu
Monti, Remo
Gupta, Rahul
Wang, Ying
Martin, Alicia
O’Connor, Luke J
Kaski, Samuel
Marttinen, Pekka
Palamara, Pier Francesco
Lippert, Christoph
Ganna, Andrea
HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
title HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
title_full HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
title_fullStr HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
title_full_unstemmed HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
title_short HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
title_sort hapnest: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10493177/
https://www.ncbi.nlm.nih.gov/pubmed/37647640
http://dx.doi.org/10.1093/bioinformatics/btad535
work_keys_str_mv AT wharriesophie hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT yangzhiyu hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT rajvishnu hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT montiremo hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT guptarahul hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT wangying hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT martinalicia hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT oconnorlukej hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT kaskisamuel hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT marttinenpekka hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT palamarapierfrancesco hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT lippertchristoph hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes
AT gannaandrea hapnestefficientlargescalegenerationandevaluationofsyntheticdatasetsforgenotypesandphenotypes