Cargando…

Computational evaluation of TIS annotation for prokaryotic genomes

BACKGROUND: Accurate annotation of translation initiation sites (TISs) is essential for understanding the translation initiation mechanism. However, the reliability of TIS annotation in widely used databases such as RefSeq is uncertain due to the lack of experimental benchmarks. RESULTS: Based on a...

Descripción completa

Detalles Bibliográficos
Autores principales: Hu, Gang-Qing, Zheng, Xiaobin, Ju, Li-Ning, Zhu, Huaiqiu, She, Zhen-Su
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2362131/
https://www.ncbi.nlm.nih.gov/pubmed/18366730
http://dx.doi.org/10.1186/1471-2105-9-160
_version_ 1782153385014722560
author Hu, Gang-Qing
Zheng, Xiaobin
Ju, Li-Ning
Zhu, Huaiqiu
She, Zhen-Su
author_facet Hu, Gang-Qing
Zheng, Xiaobin
Ju, Li-Ning
Zhu, Huaiqiu
She, Zhen-Su
author_sort Hu, Gang-Qing
collection PubMed
description BACKGROUND: Accurate annotation of translation initiation sites (TISs) is essential for understanding the translation initiation mechanism. However, the reliability of TIS annotation in widely used databases such as RefSeq is uncertain due to the lack of experimental benchmarks. RESULTS: Based on a homogeneity assumption that gene translation-related signals are uniformly distributed across a genome, we have established a computational method for a large-scale quantitative assessment of the reliability of TIS annotations for any prokaryotic genome. The method consists of modeling a positional weight matrix (PWM) of aligned sequences around predicted TISs in terms of a linear combination of three elementary PWMs, one for true TIS and the two others for false TISs. The three elementary PWMs are obtained using a reference set with highly reliable TIS predictions. A generalized least square estimator determines the weighting of the true TIS in the observed PWM, from which the accuracy of the prediction is derived. The validity of the method and the extent of the limitation of the assumptions are explicitly addressed by testing on experimentally verified TISs with variable accuracy of the reference sets. The method is applied to estimate the accuracy of TIS annotations that are provided on public databases such as RefSeq and ProTISA and by programs such as EasyGene, GeneMarkS, Glimmer 3 and TiCo. It is shown that RefSeq's TIS prediction is significantly less accurate than two recent predictors, Tico and ProTISA. With convincing proofs, we show two general preferential biases in the RefSeq annotation, i.e. over-annotating the longest open reading frame (LORF) and under-annotating ATG start codon. Finally, we have established a new TIS database, SupTISA, based on the best prediction of all the predictors; SupTISA has achieved an average accuracy of 92% over all 532 complete genomes. CONCLUSION: Large-scale computational evaluation of TIS annotation has been achieved. A new TIS database much better than RefSeq has been constructed, and it provides a valuable resource for further TIS studies.
format Text
id pubmed-2362131
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-23621312008-05-01 Computational evaluation of TIS annotation for prokaryotic genomes Hu, Gang-Qing Zheng, Xiaobin Ju, Li-Ning Zhu, Huaiqiu She, Zhen-Su BMC Bioinformatics Methodology Article BACKGROUND: Accurate annotation of translation initiation sites (TISs) is essential for understanding the translation initiation mechanism. However, the reliability of TIS annotation in widely used databases such as RefSeq is uncertain due to the lack of experimental benchmarks. RESULTS: Based on a homogeneity assumption that gene translation-related signals are uniformly distributed across a genome, we have established a computational method for a large-scale quantitative assessment of the reliability of TIS annotations for any prokaryotic genome. The method consists of modeling a positional weight matrix (PWM) of aligned sequences around predicted TISs in terms of a linear combination of three elementary PWMs, one for true TIS and the two others for false TISs. The three elementary PWMs are obtained using a reference set with highly reliable TIS predictions. A generalized least square estimator determines the weighting of the true TIS in the observed PWM, from which the accuracy of the prediction is derived. The validity of the method and the extent of the limitation of the assumptions are explicitly addressed by testing on experimentally verified TISs with variable accuracy of the reference sets. The method is applied to estimate the accuracy of TIS annotations that are provided on public databases such as RefSeq and ProTISA and by programs such as EasyGene, GeneMarkS, Glimmer 3 and TiCo. It is shown that RefSeq's TIS prediction is significantly less accurate than two recent predictors, Tico and ProTISA. With convincing proofs, we show two general preferential biases in the RefSeq annotation, i.e. over-annotating the longest open reading frame (LORF) and under-annotating ATG start codon. Finally, we have established a new TIS database, SupTISA, based on the best prediction of all the predictors; SupTISA has achieved an average accuracy of 92% over all 532 complete genomes. CONCLUSION: Large-scale computational evaluation of TIS annotation has been achieved. A new TIS database much better than RefSeq has been constructed, and it provides a valuable resource for further TIS studies. BioMed Central 2008-03-25 /pmc/articles/PMC2362131/ /pubmed/18366730 http://dx.doi.org/10.1186/1471-2105-9-160 Text en Copyright © 2008 Hu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Hu, Gang-Qing
Zheng, Xiaobin
Ju, Li-Ning
Zhu, Huaiqiu
She, Zhen-Su
Computational evaluation of TIS annotation for prokaryotic genomes
title Computational evaluation of TIS annotation for prokaryotic genomes
title_full Computational evaluation of TIS annotation for prokaryotic genomes
title_fullStr Computational evaluation of TIS annotation for prokaryotic genomes
title_full_unstemmed Computational evaluation of TIS annotation for prokaryotic genomes
title_short Computational evaluation of TIS annotation for prokaryotic genomes
title_sort computational evaluation of tis annotation for prokaryotic genomes
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2362131/
https://www.ncbi.nlm.nih.gov/pubmed/18366730
http://dx.doi.org/10.1186/1471-2105-9-160
work_keys_str_mv AT hugangqing computationalevaluationoftisannotationforprokaryoticgenomes
AT zhengxiaobin computationalevaluationoftisannotationforprokaryoticgenomes
AT julining computationalevaluationoftisannotationforprokaryoticgenomes
AT zhuhuaiqiu computationalevaluationoftisannotationforprokaryoticgenomes
AT shezhensu computationalevaluationoftisannotationforprokaryoticgenomes