Cargando…

Decreasing the number of false positives in sequence classification

BACKGROUND: A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the seque...

Descripción completa

Detalles Bibliográficos
Autores principales:	Machado-Lima, Ariane, Kashiwabara, André Yoshiaki, Durham, Alan Mitchell
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3045793/ https://www.ncbi.nlm.nih.gov/pubmed/21210966 http://dx.doi.org/10.1186/1471-2164-11-S5-S10

_version_	1782198870779887616
author	Machado-Lima, Ariane Kashiwabara, André Yoshiaki Durham, Alan Mitchell
author_facet	Machado-Lima, Ariane Kashiwabara, André Yoshiaki Durham, Alan Mitchell
author_sort	Machado-Lima, Ariane
collection	PubMed
description	BACKGROUND: A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation. RESULTS: For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results. CONCLUSIONS: Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity.
format	Text
id	pubmed-3045793
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-30457932011-03-01 Decreasing the number of false positives in sequence classification Machado-Lima, Ariane Kashiwabara, André Yoshiaki Durham, Alan Mitchell BMC Genomics Proceedings BACKGROUND: A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation. RESULTS: For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results. CONCLUSIONS: Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity. BioMed Central 2010-12-22 /pmc/articles/PMC3045793/ /pubmed/21210966 http://dx.doi.org/10.1186/1471-2164-11-S5-S10 Text en Copyright ©2010 Durham et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Machado-Lima, Ariane Kashiwabara, André Yoshiaki Durham, Alan Mitchell Decreasing the number of false positives in sequence classification
title	Decreasing the number of false positives in sequence classification
title_full	Decreasing the number of false positives in sequence classification
title_fullStr	Decreasing the number of false positives in sequence classification
title_full_unstemmed	Decreasing the number of false positives in sequence classification
title_short	Decreasing the number of false positives in sequence classification
title_sort	decreasing the number of false positives in sequence classification
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3045793/ https://www.ncbi.nlm.nih.gov/pubmed/21210966 http://dx.doi.org/10.1186/1471-2164-11-S5-S10
work_keys_str_mv	AT machadolimaariane decreasingthenumberoffalsepositivesinsequenceclassification AT kashiwabaraandreyoshiaki decreasingthenumberoffalsepositivesinsequenceclassification AT durhamalanmitchell decreasingthenumberoffalsepositivesinsequenceclassification

Decreasing the number of false positives in sequence classification

Ejemplares similares