Cargando…

Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments

BACKGROUND: Next-generation sequencing enables massively parallel processing, allowing lower cost than the other sequencing technologies. In the subsequent analysis with the NGS data, one of the major concerns is the reliability of variant calls. Although researchers can utilize raw quality scores o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cosgun, Erdal, Oh, Min
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7061114/ https://www.ncbi.nlm.nih.gov/pubmed/32219145 http://dx.doi.org/10.1155/2020/8531502

_version_	1783504346177077248
author	Cosgun, Erdal Oh, Min
author_facet	Cosgun, Erdal Oh, Min
author_sort	Cosgun, Erdal
collection	PubMed
description	BACKGROUND: Next-generation sequencing enables massively parallel processing, allowing lower cost than the other sequencing technologies. In the subsequent analysis with the NGS data, one of the major concerns is the reliability of variant calls. Although researchers can utilize raw quality scores of variant calling, they are forced to start the further analysis without any preevaluation of the quality scores. METHOD: We presented a machine learning approach for estimating quality scores of variant calls derived from BWA+GATK. We analyzed correlations between the quality score and these annotations, specifying informative annotations which were used as features to predict variant quality scores. To test the predictive models, we simulated 24 paired-end Illumina sequencing reads with 30x coverage base. Also, twenty-four human genome sequencing reads resulting from Illumina paired-end sequencing with at least 30x coverage were secured from the Sequence Read Archive. RESULTS: Using BWA+GATK, VCFs were derived from simulated and real sequencing reads. We observed that the prediction models learned by RFR outperformed other algorithms in both simulated and real data. The quality scores of variant calls were highly predictable from informative features of GATK Annotation Modules in the simulated human genome VCF data (R2: 96.7%, 94.4%, and 89.8% for RFR, MLR, and NNR, respectively). The robustness of the proposed data-driven models was consistently maintained in the real human genome VCF data (R2: 97.8% and 96.5% for RFR and MLR, respectively).
format	Online Article Text
id	pubmed-7061114
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Hindawi
record_format	MEDLINE/PubMed
spelling	pubmed-70611142020-03-26 Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments Cosgun, Erdal Oh, Min Biomed Res Int Research Article BACKGROUND: Next-generation sequencing enables massively parallel processing, allowing lower cost than the other sequencing technologies. In the subsequent analysis with the NGS data, one of the major concerns is the reliability of variant calls. Although researchers can utilize raw quality scores of variant calling, they are forced to start the further analysis without any preevaluation of the quality scores. METHOD: We presented a machine learning approach for estimating quality scores of variant calls derived from BWA+GATK. We analyzed correlations between the quality score and these annotations, specifying informative annotations which were used as features to predict variant quality scores. To test the predictive models, we simulated 24 paired-end Illumina sequencing reads with 30x coverage base. Also, twenty-four human genome sequencing reads resulting from Illumina paired-end sequencing with at least 30x coverage were secured from the Sequence Read Archive. RESULTS: Using BWA+GATK, VCFs were derived from simulated and real sequencing reads. We observed that the prediction models learned by RFR outperformed other algorithms in both simulated and real data. The quality scores of variant calls were highly predictable from informative features of GATK Annotation Modules in the simulated human genome VCF data (R2: 96.7%, 94.4%, and 89.8% for RFR, MLR, and NNR, respectively). The robustness of the proposed data-driven models was consistently maintained in the real human genome VCF data (R2: 97.8% and 96.5% for RFR and MLR, respectively). Hindawi 2020-02-25 /pmc/articles/PMC7061114/ /pubmed/32219145 http://dx.doi.org/10.1155/2020/8531502 Text en Copyright © 2020 Erdal Cosgun and Min Oh. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Cosgun, Erdal Oh, Min Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments
title	Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments
title_full	Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments
title_fullStr	Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments
title_full_unstemmed	Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments
title_short	Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments
title_sort	exploring the consistency of the quality scores with machine learning for next-generation sequencing experiments
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7061114/ https://www.ncbi.nlm.nih.gov/pubmed/32219145 http://dx.doi.org/10.1155/2020/8531502
work_keys_str_mv	AT cosgunerdal exploringtheconsistencyofthequalityscoreswithmachinelearningfornextgenerationsequencingexperiments AT ohmin exploringtheconsistencyofthequalityscoreswithmachinelearningfornextgenerationsequencingexperiments

Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments

Ejemplares similares