Cargando…

Mining statistically-solid k-mers for accurate NGS error correction

BACKGROUND: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain er...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Liang, Xie, Jin, Bai, Lin, Chen, Wen, Wang, Mingju, Zhang, Zhonglei, Wang, Yiqi, Zhao, Zhe, Li, Jinyan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311904/
https://www.ncbi.nlm.nih.gov/pubmed/30598110
http://dx.doi.org/10.1186/s12864-018-5272-y
_version_ 1783383697809997824
author Zhao, Liang
Xie, Jin
Bai, Lin
Chen, Wen
Wang, Mingju
Zhang, Zhonglei
Wang, Yiqi
Zhao, Zhe
Li, Jinyan
author_facet Zhao, Liang
Xie, Jin
Bai, Lin
Chen, Wen
Wang, Mingju
Zhang, Zhonglei
Wang, Yiqi
Zhao, Zhe
Li, Jinyan
author_sort Zhao, Liang
collection PubMed
description BACKGROUND: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f(0) to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS: We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f(0). To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION: The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
format Online
Article
Text
id pubmed-6311904
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63119042019-01-07 Mining statistically-solid k-mers for accurate NGS error correction Zhao, Liang Xie, Jin Bai, Lin Chen, Wen Wang, Mingju Zhang, Zhonglei Wang, Yiqi Zhao, Zhe Li, Jinyan BMC Genomics Research BACKGROUND: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f(0) to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS: We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f(0). To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION: The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy. BioMed Central 2018-12-31 /pmc/articles/PMC6311904/ /pubmed/30598110 http://dx.doi.org/10.1186/s12864-018-5272-y Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Zhao, Liang
Xie, Jin
Bai, Lin
Chen, Wen
Wang, Mingju
Zhang, Zhonglei
Wang, Yiqi
Zhao, Zhe
Li, Jinyan
Mining statistically-solid k-mers for accurate NGS error correction
title Mining statistically-solid k-mers for accurate NGS error correction
title_full Mining statistically-solid k-mers for accurate NGS error correction
title_fullStr Mining statistically-solid k-mers for accurate NGS error correction
title_full_unstemmed Mining statistically-solid k-mers for accurate NGS error correction
title_short Mining statistically-solid k-mers for accurate NGS error correction
title_sort mining statistically-solid k-mers for accurate ngs error correction
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311904/
https://www.ncbi.nlm.nih.gov/pubmed/30598110
http://dx.doi.org/10.1186/s12864-018-5272-y
work_keys_str_mv AT zhaoliang miningstatisticallysolidkmersforaccuratengserrorcorrection
AT xiejin miningstatisticallysolidkmersforaccuratengserrorcorrection
AT bailin miningstatisticallysolidkmersforaccuratengserrorcorrection
AT chenwen miningstatisticallysolidkmersforaccuratengserrorcorrection
AT wangmingju miningstatisticallysolidkmersforaccuratengserrorcorrection
AT zhangzhonglei miningstatisticallysolidkmersforaccuratengserrorcorrection
AT wangyiqi miningstatisticallysolidkmersforaccuratengserrorcorrection
AT zhaozhe miningstatisticallysolidkmersforaccuratengserrorcorrection
AT lijinyan miningstatisticallysolidkmersforaccuratengserrorcorrection