Cargando…

A random forest-based framework for genotyping and accuracy assessment of copy number variations

Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a fr...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhuang, Xuehan, Ye, Rui, So, Man-Ting, Lam, Wai-Yee, Karim, Anwarul, Yu, Michelle, Ngo, Ngoc Diem, Cherny, Stacey S, Tam, Paul Kwong-Hang, Garcia-Barcelo, Maria-Mercè, Tang, Clara Sze-man, Sham, Pak Chung
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671382/
https://www.ncbi.nlm.nih.gov/pubmed/33575619
http://dx.doi.org/10.1093/nargab/lqaa071
_version_ 1783610918853148672
author Zhuang, Xuehan
Ye, Rui
So, Man-Ting
Lam, Wai-Yee
Karim, Anwarul
Yu, Michelle
Ngo, Ngoc Diem
Cherny, Stacey S
Tam, Paul Kwong-Hang
Garcia-Barcelo, Maria-Mercè
Tang, Clara Sze-man
Sham, Pak Chung
author_facet Zhuang, Xuehan
Ye, Rui
So, Man-Ting
Lam, Wai-Yee
Karim, Anwarul
Yu, Michelle
Ngo, Ngoc Diem
Cherny, Stacey S
Tam, Paul Kwong-Hang
Garcia-Barcelo, Maria-Mercè
Tang, Clara Sze-man
Sham, Pak Chung
author_sort Zhuang, Xuehan
collection PubMed
description Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a framework, CNV-JACG, for Judging the Accuracy of CNVs and Genotyping using paired-end WGS data. CNV-JACG is based on a random forest model trained on 21 distinctive features characterizing the CNV region and its breakpoints. Using the data from the 1000 Genomes Project, Genome in a Bottle Consortium, the Human Genome Structural Variation Consortium and in-house technical replicates, we show that CNV-JACG has superior sensitivity over the latest genotyping method, SV(2), particularly for the small CNVs (≤1 kb). We also demonstrate that CNV-JACG outperforms SV(2) in terms of Mendelian inconsistency in trios and concordance between technical replicates. Our study suggests that CNV-JACG would be a useful tool in assessing the accuracy of CNVs to meet the ever-growing needs for uncovering the missing heritability linked to CNVs.
format Online
Article
Text
id pubmed-7671382
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-76713822021-02-10 A random forest-based framework for genotyping and accuracy assessment of copy number variations Zhuang, Xuehan Ye, Rui So, Man-Ting Lam, Wai-Yee Karim, Anwarul Yu, Michelle Ngo, Ngoc Diem Cherny, Stacey S Tam, Paul Kwong-Hang Garcia-Barcelo, Maria-Mercè Tang, Clara Sze-man Sham, Pak Chung NAR Genom Bioinform Methart Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a framework, CNV-JACG, for Judging the Accuracy of CNVs and Genotyping using paired-end WGS data. CNV-JACG is based on a random forest model trained on 21 distinctive features characterizing the CNV region and its breakpoints. Using the data from the 1000 Genomes Project, Genome in a Bottle Consortium, the Human Genome Structural Variation Consortium and in-house technical replicates, we show that CNV-JACG has superior sensitivity over the latest genotyping method, SV(2), particularly for the small CNVs (≤1 kb). We also demonstrate that CNV-JACG outperforms SV(2) in terms of Mendelian inconsistency in trios and concordance between technical replicates. Our study suggests that CNV-JACG would be a useful tool in assessing the accuracy of CNVs to meet the ever-growing needs for uncovering the missing heritability linked to CNVs. Oxford University Press 2020-09-22 /pmc/articles/PMC7671382/ /pubmed/33575619 http://dx.doi.org/10.1093/nargab/lqaa071 Text en © The Author(s) 2019. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methart
Zhuang, Xuehan
Ye, Rui
So, Man-Ting
Lam, Wai-Yee
Karim, Anwarul
Yu, Michelle
Ngo, Ngoc Diem
Cherny, Stacey S
Tam, Paul Kwong-Hang
Garcia-Barcelo, Maria-Mercè
Tang, Clara Sze-man
Sham, Pak Chung
A random forest-based framework for genotyping and accuracy assessment of copy number variations
title A random forest-based framework for genotyping and accuracy assessment of copy number variations
title_full A random forest-based framework for genotyping and accuracy assessment of copy number variations
title_fullStr A random forest-based framework for genotyping and accuracy assessment of copy number variations
title_full_unstemmed A random forest-based framework for genotyping and accuracy assessment of copy number variations
title_short A random forest-based framework for genotyping and accuracy assessment of copy number variations
title_sort random forest-based framework for genotyping and accuracy assessment of copy number variations
topic Methart
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671382/
https://www.ncbi.nlm.nih.gov/pubmed/33575619
http://dx.doi.org/10.1093/nargab/lqaa071
work_keys_str_mv AT zhuangxuehan arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT yerui arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT somanting arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT lamwaiyee arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT karimanwarul arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT yumichelle arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT ngongocdiem arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT chernystaceys arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT tampaulkwonghang arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT garciabarcelomariamerce arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT tangclaraszeman arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT shampakchung arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT zhuangxuehan randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT yerui randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT somanting randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT lamwaiyee randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT karimanwarul randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT yumichelle randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT ngongocdiem randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT chernystaceys randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT tampaulkwonghang randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT garciabarcelomariamerce randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT tangclaraszeman randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations
AT shampakchung randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations