Cargando…

The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words

BACKGROUND: Advances in high-throughput technologies available to modern biology have created an increasing flood of experimentally determined facts. Ordering, managing and describing these raw results is the first step which allows facts to become knowledge. Currently there are limited ways to auto...

Descripción completa

Detalles Bibliográficos
Autores principales: Kaczanowski, Szymon, Siedlecki, Pawel, Zielenkiewicz, Piotr
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2694793/
https://www.ncbi.nlm.nih.gov/pubmed/19445703
http://dx.doi.org/10.1186/1471-2105-10-148
_version_ 1782168123764375552
author Kaczanowski, Szymon
Siedlecki, Pawel
Zielenkiewicz, Piotr
author_facet Kaczanowski, Szymon
Siedlecki, Pawel
Zielenkiewicz, Piotr
author_sort Kaczanowski, Szymon
collection PubMed
description BACKGROUND: Advances in high-throughput technologies available to modern biology have created an increasing flood of experimentally determined facts. Ordering, managing and describing these raw results is the first step which allows facts to become knowledge. Currently there are limited ways to automatically annotate such data, especially utilizing information deposited in published literature. RESULTS: To aid researchers in describing results from high-throughput experiments we developed HT-SAS, a web service for automatic annotation of proteins using general English words. For each protein a poll of Medline abstracts connected to homologous proteins is gathered using the UniProt-Medline link. Overrepresented words are detected using binomial statistics approximation. We tested our automatic approach with a protein test set from SGD to determine the accuracy and usefulness of our approach. We also applied the automatic annotation service to improve annotations of proteins from Plasmodium bergei expressed exclusively during the blood stage. CONCLUSION: Using HT-SAS we created new, or enriched already established annotations for over 20% of proteins from Plasmodium bergei expressed in the blood stage, deposited in PlasmoDB. Our tests show this approach to information extraction provides highly specific keywords, often also when the number of abstracts is limited. Our service should be useful for manual curators, as a complement to manually curated information sources and for researchers working with protein datasets, especially from poorly characterized organisms.
format Text
id pubmed-2694793
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26947932009-06-11 The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words Kaczanowski, Szymon Siedlecki, Pawel Zielenkiewicz, Piotr BMC Bioinformatics Software BACKGROUND: Advances in high-throughput technologies available to modern biology have created an increasing flood of experimentally determined facts. Ordering, managing and describing these raw results is the first step which allows facts to become knowledge. Currently there are limited ways to automatically annotate such data, especially utilizing information deposited in published literature. RESULTS: To aid researchers in describing results from high-throughput experiments we developed HT-SAS, a web service for automatic annotation of proteins using general English words. For each protein a poll of Medline abstracts connected to homologous proteins is gathered using the UniProt-Medline link. Overrepresented words are detected using binomial statistics approximation. We tested our automatic approach with a protein test set from SGD to determine the accuracy and usefulness of our approach. We also applied the automatic annotation service to improve annotations of proteins from Plasmodium bergei expressed exclusively during the blood stage. CONCLUSION: Using HT-SAS we created new, or enriched already established annotations for over 20% of proteins from Plasmodium bergei expressed in the blood stage, deposited in PlasmoDB. Our tests show this approach to information extraction provides highly specific keywords, often also when the number of abstracts is limited. Our service should be useful for manual curators, as a complement to manually curated information sources and for researchers working with protein datasets, especially from poorly characterized organisms. BioMed Central 2009-05-16 /pmc/articles/PMC2694793/ /pubmed/19445703 http://dx.doi.org/10.1186/1471-2105-10-148 Text en Copyright © 2009 Kaczanowski et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Kaczanowski, Szymon
Siedlecki, Pawel
Zielenkiewicz, Piotr
The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words
title The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words
title_full The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words
title_fullStr The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words
title_full_unstemmed The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words
title_short The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words
title_sort high throughput sequence annotation service (ht-sas) – the shortcut from sequence to true medline words
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2694793/
https://www.ncbi.nlm.nih.gov/pubmed/19445703
http://dx.doi.org/10.1186/1471-2105-10-148
work_keys_str_mv AT kaczanowskiszymon thehighthroughputsequenceannotationservicehtsastheshortcutfromsequencetotruemedlinewords
AT siedleckipawel thehighthroughputsequenceannotationservicehtsastheshortcutfromsequencetotruemedlinewords
AT zielenkiewiczpiotr thehighthroughputsequenceannotationservicehtsastheshortcutfromsequencetotruemedlinewords
AT kaczanowskiszymon highthroughputsequenceannotationservicehtsastheshortcutfromsequencetotruemedlinewords
AT siedleckipawel highthroughputsequenceannotationservicehtsastheshortcutfromsequencetotruemedlinewords
AT zielenkiewiczpiotr highthroughputsequenceannotationservicehtsastheshortcutfromsequencetotruemedlinewords