Cargando…

EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data

The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV,...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Zhongyang, Cheng, Haoxiang, Hong, Xiumei, Di Narzo, Antonio F, Franzen, Oscar, Peng, Shouneng, Ruusalepp, Arno, Kovacic, Jason C, Bjorkegren, Johan L M, Wang, Xiaobin, Hao, Ke
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6468244/
https://www.ncbi.nlm.nih.gov/pubmed/30722045
http://dx.doi.org/10.1093/nar/gkz068
_version_ 1783411393743028224
author Zhang, Zhongyang
Cheng, Haoxiang
Hong, Xiumei
Di Narzo, Antonio F
Franzen, Oscar
Peng, Shouneng
Ruusalepp, Arno
Kovacic, Jason C
Bjorkegren, Johan L M
Wang, Xiaobin
Hao, Ke
author_facet Zhang, Zhongyang
Cheng, Haoxiang
Hong, Xiumei
Di Narzo, Antonio F
Franzen, Oscar
Peng, Shouneng
Ruusalepp, Arno
Kovacic, Jason C
Bjorkegren, Johan L M
Wang, Xiaobin
Hao, Ke
author_sort Zhang, Zhongyang
collection PubMed
description The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.
format Online
Article
Text
id pubmed-6468244
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-64682442019-04-22 EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data Zhang, Zhongyang Cheng, Haoxiang Hong, Xiumei Di Narzo, Antonio F Franzen, Oscar Peng, Shouneng Ruusalepp, Arno Kovacic, Jason C Bjorkegren, Johan L M Wang, Xiaobin Hao, Ke Nucleic Acids Res Methods Online The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases. Oxford University Press 2019-04-23 2019-02-05 /pmc/articles/PMC6468244/ /pubmed/30722045 http://dx.doi.org/10.1093/nar/gkz068 Text en © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods Online
Zhang, Zhongyang
Cheng, Haoxiang
Hong, Xiumei
Di Narzo, Antonio F
Franzen, Oscar
Peng, Shouneng
Ruusalepp, Arno
Kovacic, Jason C
Bjorkegren, Johan L M
Wang, Xiaobin
Hao, Ke
EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
title EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
title_full EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
title_fullStr EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
title_full_unstemmed EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
title_short EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
title_sort ensemblecnv: an ensemble machine learning algorithm to identify and genotype copy number variation using snp array data
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6468244/
https://www.ncbi.nlm.nih.gov/pubmed/30722045
http://dx.doi.org/10.1093/nar/gkz068
work_keys_str_mv AT zhangzhongyang ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT chenghaoxiang ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT hongxiumei ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT dinarzoantoniof ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT franzenoscar ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT pengshouneng ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT ruusalepparno ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT kovacicjasonc ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT bjorkegrenjohanlm ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT wangxiaobin ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata
AT haoke ensemblecnvanensemblemachinelearningalgorithmtoidentifyandgenotypecopynumbervariationusingsnparraydata