Cargando…
Mining statistically-solid k-mers for accurate NGS error correction
BACKGROUND: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain er...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311904/ https://www.ncbi.nlm.nih.gov/pubmed/30598110 http://dx.doi.org/10.1186/s12864-018-5272-y |
_version_ | 1783383697809997824 |
---|---|
author | Zhao, Liang Xie, Jin Bai, Lin Chen, Wen Wang, Mingju Zhang, Zhonglei Wang, Yiqi Zhao, Zhe Li, Jinyan |
author_facet | Zhao, Liang Xie, Jin Bai, Lin Chen, Wen Wang, Mingju Zhang, Zhonglei Wang, Yiqi Zhao, Zhe Li, Jinyan |
author_sort | Zhao, Liang |
collection | PubMed |
description | BACKGROUND: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f(0) to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS: We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f(0). To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION: The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy. |
format | Online Article Text |
id | pubmed-6311904 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-63119042019-01-07 Mining statistically-solid k-mers for accurate NGS error correction Zhao, Liang Xie, Jin Bai, Lin Chen, Wen Wang, Mingju Zhang, Zhonglei Wang, Yiqi Zhao, Zhe Li, Jinyan BMC Genomics Research BACKGROUND: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f(0) to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS: We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f(0). To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION: The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy. BioMed Central 2018-12-31 /pmc/articles/PMC6311904/ /pubmed/30598110 http://dx.doi.org/10.1186/s12864-018-5272-y Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Zhao, Liang Xie, Jin Bai, Lin Chen, Wen Wang, Mingju Zhang, Zhonglei Wang, Yiqi Zhao, Zhe Li, Jinyan Mining statistically-solid k-mers for accurate NGS error correction |
title | Mining statistically-solid k-mers for accurate NGS error correction |
title_full | Mining statistically-solid k-mers for accurate NGS error correction |
title_fullStr | Mining statistically-solid k-mers for accurate NGS error correction |
title_full_unstemmed | Mining statistically-solid k-mers for accurate NGS error correction |
title_short | Mining statistically-solid k-mers for accurate NGS error correction |
title_sort | mining statistically-solid k-mers for accurate ngs error correction |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311904/ https://www.ncbi.nlm.nih.gov/pubmed/30598110 http://dx.doi.org/10.1186/s12864-018-5272-y |
work_keys_str_mv | AT zhaoliang miningstatisticallysolidkmersforaccuratengserrorcorrection AT xiejin miningstatisticallysolidkmersforaccuratengserrorcorrection AT bailin miningstatisticallysolidkmersforaccuratengserrorcorrection AT chenwen miningstatisticallysolidkmersforaccuratengserrorcorrection AT wangmingju miningstatisticallysolidkmersforaccuratengserrorcorrection AT zhangzhonglei miningstatisticallysolidkmersforaccuratengserrorcorrection AT wangyiqi miningstatisticallysolidkmersforaccuratengserrorcorrection AT zhaozhe miningstatisticallysolidkmersforaccuratengserrorcorrection AT lijinyan miningstatisticallysolidkmersforaccuratengserrorcorrection |