Cargando…

Finding optimal threshold for correction error reads in DNA assembling

BACKGROUND: DNA assembling is the problem of determining the nucleotide sequence of a genome from its substrings, called reads. In the experiments, there may be some errors on the reads which affect the performance of the DNA assembly algorithms. Existing algorithms, e.g. ECINDEL and SRCorr, correct...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chin, Francis YL, Leung, Henry CM, Li, Wei-Lin, Yiu, Siu-Ming
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648749/ https://www.ncbi.nlm.nih.gov/pubmed/19208114 http://dx.doi.org/10.1186/1471-2105-10-S1-S15

_version_	1782164979183517696
author	Chin, Francis YL Leung, Henry CM Li, Wei-Lin Yiu, Siu-Ming
author_facet	Chin, Francis YL Leung, Henry CM Li, Wei-Lin Yiu, Siu-Ming
author_sort	Chin, Francis YL
collection	PubMed
description	BACKGROUND: DNA assembling is the problem of determining the nucleotide sequence of a genome from its substrings, called reads. In the experiments, there may be some errors on the reads which affect the performance of the DNA assembly algorithms. Existing algorithms, e.g. ECINDEL and SRCorr, correct the error reads by considering the number of times each length-k substring of the reads appear in the input. They treat those length-k substrings appear at least M times as correct substring and correct the error reads based on these substrings. However, since the threshold M is chosen without any solid theoretical analysis, these algorithms cannot guarantee their performances on error correction. RESULTS: In this paper, we propose a method to calculate the probabilities of false positive and false negative when determining whether a length-k substring is correct using threshold M. Based on this optimal threshold M that minimizes the total errors (false positives and false negatives). Experimental results on both real data and simulated data showed that our calculation is correct and we can reduce the total error substrings by 77.6% and 65.1% when compared to ECINDEL and SRCorr respectively. CONCLUSION: We introduced a method to calculate the probability of false positives and false negatives of the length-k substring using different thresholds. Based on this calculation, we found the optimal threshold to minimize the total error of false positive plus false negative.
format	Text
id	pubmed-2648749
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-26487492009-03-03 Finding optimal threshold for correction error reads in DNA assembling Chin, Francis YL Leung, Henry CM Li, Wei-Lin Yiu, Siu-Ming BMC Bioinformatics Research BACKGROUND: DNA assembling is the problem of determining the nucleotide sequence of a genome from its substrings, called reads. In the experiments, there may be some errors on the reads which affect the performance of the DNA assembly algorithms. Existing algorithms, e.g. ECINDEL and SRCorr, correct the error reads by considering the number of times each length-k substring of the reads appear in the input. They treat those length-k substrings appear at least M times as correct substring and correct the error reads based on these substrings. However, since the threshold M is chosen without any solid theoretical analysis, these algorithms cannot guarantee their performances on error correction. RESULTS: In this paper, we propose a method to calculate the probabilities of false positive and false negative when determining whether a length-k substring is correct using threshold M. Based on this optimal threshold M that minimizes the total errors (false positives and false negatives). Experimental results on both real data and simulated data showed that our calculation is correct and we can reduce the total error substrings by 77.6% and 65.1% when compared to ECINDEL and SRCorr respectively. CONCLUSION: We introduced a method to calculate the probability of false positives and false negatives of the length-k substring using different thresholds. Based on this calculation, we found the optimal threshold to minimize the total error of false positive plus false negative. BioMed Central 2009-01-30 /pmc/articles/PMC2648749/ /pubmed/19208114 http://dx.doi.org/10.1186/1471-2105-10-S1-S15 Text en Copyright © 2009 Chin et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Chin, Francis YL Leung, Henry CM Li, Wei-Lin Yiu, Siu-Ming Finding optimal threshold for correction error reads in DNA assembling
title	Finding optimal threshold for correction error reads in DNA assembling
title_full	Finding optimal threshold for correction error reads in DNA assembling
title_fullStr	Finding optimal threshold for correction error reads in DNA assembling
title_full_unstemmed	Finding optimal threshold for correction error reads in DNA assembling
title_short	Finding optimal threshold for correction error reads in DNA assembling
title_sort	finding optimal threshold for correction error reads in dna assembling
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648749/ https://www.ncbi.nlm.nih.gov/pubmed/19208114 http://dx.doi.org/10.1186/1471-2105-10-S1-S15
work_keys_str_mv	AT chinfrancisyl findingoptimalthresholdforcorrectionerrorreadsindnaassembling AT leunghenrycm findingoptimalthresholdforcorrectionerrorreadsindnaassembling AT liweilin findingoptimalthresholdforcorrectionerrorreadsindnaassembling AT yiusiuming findingoptimalthresholdforcorrectionerrorreadsindnaassembling

Finding optimal threshold for correction error reads in DNA assembling

Ejemplares similares