Cargando…

ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest

Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced indivi...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Jiajin, Jew, Brandon, Zhan, Lingyu, Hwang, Sungoo, Coppola, Giovanni, Freimer, Nelson B., Sul, Jae Hoon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6938691/
https://www.ncbi.nlm.nih.gov/pubmed/31851693
http://dx.doi.org/10.1371/journal.pcbi.1007556
_version_ 1783484078093238272
author Li, Jiajin
Jew, Brandon
Zhan, Lingyu
Hwang, Sungoo
Coppola, Giovanni
Freimer, Nelson B.
Sul, Jae Hoon
author_facet Li, Jiajin
Jew, Brandon
Zhan, Lingyu
Hwang, Sungoo
Coppola, Giovanni
Freimer, Nelson B.
Sul, Jae Hoon
author_sort Li, Jiajin
collection PubMed
description Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data.
format Online
Article
Text
id pubmed-6938691
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-69386912020-01-07 ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest Li, Jiajin Jew, Brandon Zhan, Lingyu Hwang, Sungoo Coppola, Giovanni Freimer, Nelson B. Sul, Jae Hoon PLoS Comput Biol Research Article Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data. Public Library of Science 2019-12-18 /pmc/articles/PMC6938691/ /pubmed/31851693 http://dx.doi.org/10.1371/journal.pcbi.1007556 Text en © 2019 Li et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Li, Jiajin
Jew, Brandon
Zhan, Lingyu
Hwang, Sungoo
Coppola, Giovanni
Freimer, Nelson B.
Sul, Jae Hoon
ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest
title ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest
title_full ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest
title_fullStr ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest
title_full_unstemmed ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest
title_short ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest
title_sort forestqc: quality control on genetic variants from next-generation sequencing data using random forest
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6938691/
https://www.ncbi.nlm.nih.gov/pubmed/31851693
http://dx.doi.org/10.1371/journal.pcbi.1007556
work_keys_str_mv AT lijiajin forestqcqualitycontrolongeneticvariantsfromnextgenerationsequencingdatausingrandomforest
AT jewbrandon forestqcqualitycontrolongeneticvariantsfromnextgenerationsequencingdatausingrandomforest
AT zhanlingyu forestqcqualitycontrolongeneticvariantsfromnextgenerationsequencingdatausingrandomforest
AT hwangsungoo forestqcqualitycontrolongeneticvariantsfromnextgenerationsequencingdatausingrandomforest
AT coppolagiovanni forestqcqualitycontrolongeneticvariantsfromnextgenerationsequencingdatausingrandomforest
AT freimernelsonb forestqcqualitycontrolongeneticvariantsfromnextgenerationsequencingdatausingrandomforest
AT suljaehoon forestqcqualitycontrolongeneticvariantsfromnextgenerationsequencingdatausingrandomforest