Cargando…

Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets

Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can b...

Descripción completa

Detalles Bibliográficos
Autores principales: Muñoz-Mérida, Antonio, Viguera, Enrique, Claros, M. Gonzalo, Trelles, Oswaldo, Pérez-Pulido, Antonio J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4131829/
https://www.ncbi.nlm.nih.gov/pubmed/24501397
http://dx.doi.org/10.1093/dnares/dsu001
_version_ 1782330526630150144
author Muñoz-Mérida, Antonio
Viguera, Enrique
Claros, M. Gonzalo
Trelles, Oswaldo
Pérez-Pulido, Antonio J.
author_facet Muñoz-Mérida, Antonio
Viguera, Enrique
Claros, M. Gonzalo
Trelles, Oswaldo
Pérez-Pulido, Antonio J.
author_sort Muñoz-Mérida, Antonio
collection PubMed
description Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes.
format Online
Article
Text
id pubmed-4131829
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-41318292014-08-18 Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets Muñoz-Mérida, Antonio Viguera, Enrique Claros, M. Gonzalo Trelles, Oswaldo Pérez-Pulido, Antonio J. DNA Res Full Papers Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. Oxford University Press 2014-08 2014-02-05 /pmc/articles/PMC4131829/ /pubmed/24501397 http://dx.doi.org/10.1093/dnares/dsu001 Text en © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com.
spellingShingle Full Papers
Muñoz-Mérida, Antonio
Viguera, Enrique
Claros, M. Gonzalo
Trelles, Oswaldo
Pérez-Pulido, Antonio J.
Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets
title Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets
title_full Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets
title_fullStr Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets
title_full_unstemmed Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets
title_short Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets
title_sort sma3s: a three-step modular annotator for large sequence datasets
topic Full Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4131829/
https://www.ncbi.nlm.nih.gov/pubmed/24501397
http://dx.doi.org/10.1093/dnares/dsu001
work_keys_str_mv AT munozmeridaantonio sma3sathreestepmodularannotatorforlargesequencedatasets
AT vigueraenrique sma3sathreestepmodularannotatorforlargesequencedatasets
AT clarosmgonzalo sma3sathreestepmodularannotatorforlargesequencedatasets
AT trellesoswaldo sma3sathreestepmodularannotatorforlargesequencedatasets
AT perezpulidoantonioj sma3sathreestepmodularannotatorforlargesequencedatasets