Cargando…
A random forest-based framework for genotyping and accuracy assessment of copy number variations
Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a fr...
Autores principales: | , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671382/ https://www.ncbi.nlm.nih.gov/pubmed/33575619 http://dx.doi.org/10.1093/nargab/lqaa071 |
_version_ | 1783610918853148672 |
---|---|
author | Zhuang, Xuehan Ye, Rui So, Man-Ting Lam, Wai-Yee Karim, Anwarul Yu, Michelle Ngo, Ngoc Diem Cherny, Stacey S Tam, Paul Kwong-Hang Garcia-Barcelo, Maria-Mercè Tang, Clara Sze-man Sham, Pak Chung |
author_facet | Zhuang, Xuehan Ye, Rui So, Man-Ting Lam, Wai-Yee Karim, Anwarul Yu, Michelle Ngo, Ngoc Diem Cherny, Stacey S Tam, Paul Kwong-Hang Garcia-Barcelo, Maria-Mercè Tang, Clara Sze-man Sham, Pak Chung |
author_sort | Zhuang, Xuehan |
collection | PubMed |
description | Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a framework, CNV-JACG, for Judging the Accuracy of CNVs and Genotyping using paired-end WGS data. CNV-JACG is based on a random forest model trained on 21 distinctive features characterizing the CNV region and its breakpoints. Using the data from the 1000 Genomes Project, Genome in a Bottle Consortium, the Human Genome Structural Variation Consortium and in-house technical replicates, we show that CNV-JACG has superior sensitivity over the latest genotyping method, SV(2), particularly for the small CNVs (≤1 kb). We also demonstrate that CNV-JACG outperforms SV(2) in terms of Mendelian inconsistency in trios and concordance between technical replicates. Our study suggests that CNV-JACG would be a useful tool in assessing the accuracy of CNVs to meet the ever-growing needs for uncovering the missing heritability linked to CNVs. |
format | Online Article Text |
id | pubmed-7671382 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-76713822021-02-10 A random forest-based framework for genotyping and accuracy assessment of copy number variations Zhuang, Xuehan Ye, Rui So, Man-Ting Lam, Wai-Yee Karim, Anwarul Yu, Michelle Ngo, Ngoc Diem Cherny, Stacey S Tam, Paul Kwong-Hang Garcia-Barcelo, Maria-Mercè Tang, Clara Sze-man Sham, Pak Chung NAR Genom Bioinform Methart Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a framework, CNV-JACG, for Judging the Accuracy of CNVs and Genotyping using paired-end WGS data. CNV-JACG is based on a random forest model trained on 21 distinctive features characterizing the CNV region and its breakpoints. Using the data from the 1000 Genomes Project, Genome in a Bottle Consortium, the Human Genome Structural Variation Consortium and in-house technical replicates, we show that CNV-JACG has superior sensitivity over the latest genotyping method, SV(2), particularly for the small CNVs (≤1 kb). We also demonstrate that CNV-JACG outperforms SV(2) in terms of Mendelian inconsistency in trios and concordance between technical replicates. Our study suggests that CNV-JACG would be a useful tool in assessing the accuracy of CNVs to meet the ever-growing needs for uncovering the missing heritability linked to CNVs. Oxford University Press 2020-09-22 /pmc/articles/PMC7671382/ /pubmed/33575619 http://dx.doi.org/10.1093/nargab/lqaa071 Text en © The Author(s) 2019. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Methart Zhuang, Xuehan Ye, Rui So, Man-Ting Lam, Wai-Yee Karim, Anwarul Yu, Michelle Ngo, Ngoc Diem Cherny, Stacey S Tam, Paul Kwong-Hang Garcia-Barcelo, Maria-Mercè Tang, Clara Sze-man Sham, Pak Chung A random forest-based framework for genotyping and accuracy assessment of copy number variations |
title | A random forest-based framework for genotyping and accuracy assessment of copy number variations |
title_full | A random forest-based framework for genotyping and accuracy assessment of copy number variations |
title_fullStr | A random forest-based framework for genotyping and accuracy assessment of copy number variations |
title_full_unstemmed | A random forest-based framework for genotyping and accuracy assessment of copy number variations |
title_short | A random forest-based framework for genotyping and accuracy assessment of copy number variations |
title_sort | random forest-based framework for genotyping and accuracy assessment of copy number variations |
topic | Methart |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671382/ https://www.ncbi.nlm.nih.gov/pubmed/33575619 http://dx.doi.org/10.1093/nargab/lqaa071 |
work_keys_str_mv | AT zhuangxuehan arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT yerui arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT somanting arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT lamwaiyee arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT karimanwarul arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT yumichelle arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT ngongocdiem arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT chernystaceys arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT tampaulkwonghang arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT garciabarcelomariamerce arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT tangclaraszeman arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT shampakchung arandomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT zhuangxuehan randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT yerui randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT somanting randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT lamwaiyee randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT karimanwarul randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT yumichelle randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT ngongocdiem randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT chernystaceys randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT tampaulkwonghang randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT garciabarcelomariamerce randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT tangclaraszeman randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations AT shampakchung randomforestbasedframeworkforgenotypingandaccuracyassessmentofcopynumbervariations |