Cargando…

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis

BACKGROUND: Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during...

Descripción completa

Detalles Bibliográficos
Autores principales:	Akogwu, Isaac, Wang, Nan, Zhang, Chaoyang, Gong, Ping
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965716/ https://www.ncbi.nlm.nih.gov/pubmed/27461106 http://dx.doi.org/10.1186/s40246-016-0068-0

_version_	1782445300876574720
author	Akogwu, Isaac Wang, Nan Zhang, Chaoyang Gong, Ping
author_facet	Akogwu, Isaac Wang, Nan Zhang, Chaoyang Gong, Ping
author_sort	Akogwu, Isaac
collection	PubMed
description	BACKGROUND: Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets. METHODS: Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method. RESULTS: Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets. CONCLUSIONS: This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.
format	Online Article Text
id	pubmed-4965716
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-49657162016-08-02 A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis Akogwu, Isaac Wang, Nan Zhang, Chaoyang Gong, Ping Hum Genomics Research BACKGROUND: Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets. METHODS: Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method. RESULTS: Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets. CONCLUSIONS: This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods. BioMed Central 2016-07-25 /pmc/articles/PMC4965716/ /pubmed/27461106 http://dx.doi.org/10.1186/s40246-016-0068-0 Text en © Akogwu et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Akogwu, Isaac Wang, Nan Zhang, Chaoyang Gong, Ping A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
title	A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
title_full	A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
title_fullStr	A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
title_full_unstemmed	A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
title_short	A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
title_sort	comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965716/ https://www.ncbi.nlm.nih.gov/pubmed/27461106 http://dx.doi.org/10.1186/s40246-016-0068-0
work_keys_str_mv	AT akogwuisaac acomparativestudyofkspectrumbasederrorcorrectionmethodsfornextgenerationsequencingdataanalysis AT wangnan acomparativestudyofkspectrumbasederrorcorrectionmethodsfornextgenerationsequencingdataanalysis AT zhangchaoyang acomparativestudyofkspectrumbasederrorcorrectionmethodsfornextgenerationsequencingdataanalysis AT gongping acomparativestudyofkspectrumbasederrorcorrectionmethodsfornextgenerationsequencingdataanalysis AT akogwuisaac comparativestudyofkspectrumbasederrorcorrectionmethodsfornextgenerationsequencingdataanalysis AT wangnan comparativestudyofkspectrumbasederrorcorrectionmethodsfornextgenerationsequencingdataanalysis AT zhangchaoyang comparativestudyofkspectrumbasederrorcorrectionmethodsfornextgenerationsequencingdataanalysis AT gongping comparativestudyofkspectrumbasederrorcorrectionmethodsfornextgenerationsequencingdataanalysis

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis

Ejemplares similares