Cargando…

Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text

BACKGROUND: The BioCreative text mining evaluation investigated the application of text mining methods to the task of automatically extracting information from text in biomedical research articles. We participated in Task 2 of the evaluation. For this task, we built a system to automatically annotat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ray, Soumya, Craven, Mark
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2005
Materias:	Report
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1869010/ https://www.ncbi.nlm.nih.gov/pubmed/15960830 http://dx.doi.org/10.1186/1471-2105-6-S1-S18

_version_	1782133426634096640
author	Ray, Soumya Craven, Mark
author_facet	Ray, Soumya Craven, Mark
author_sort	Ray, Soumya
collection	PubMed
description	BACKGROUND: The BioCreative text mining evaluation investigated the application of text mining methods to the task of automatically extracting information from text in biomedical research articles. We participated in Task 2 of the evaluation. For this task, we built a system to automatically annotate a given protein with codes from the Gene Ontology (GO) using the text of an article from the biomedical literature as evidence. METHODS: Our system relies on simple statistical analyses of the full text article provided. We learn n-gram models for each GO code using statistical methods and use these models to hypothesize annotations. We also learn a set of Naïve Bayes models that identify textual clues of possible connections between the given protein and a hypothesized annotation. These models are used to filter and rank the predictions of the n-gram models. RESULTS: We report experiments evaluating the utility of various components of our system on a set of data held out during development, and experiments evaluating the utility of external data sources that we used to learn our models. Finally, we report our evaluation results from the BioCreative organizers. CONCLUSION: We observe that, on the test data, our system performs quite well relative to the other systems submitted to the evaluation. From other experiments on the held-out data, we observe that (i) the Naïve Bayes models were effective in filtering and ranking the initially hypothesized annotations, and (ii) our learned models were significantly more accurate when external data sources were used during learning.
format	Text
id	pubmed-1869010
institution	National Center for Biotechnology Information
language	English
publishDate	2005
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18690102007-05-18 Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text Ray, Soumya Craven, Mark BMC Bioinformatics Report BACKGROUND: The BioCreative text mining evaluation investigated the application of text mining methods to the task of automatically extracting information from text in biomedical research articles. We participated in Task 2 of the evaluation. For this task, we built a system to automatically annotate a given protein with codes from the Gene Ontology (GO) using the text of an article from the biomedical literature as evidence. METHODS: Our system relies on simple statistical analyses of the full text article provided. We learn n-gram models for each GO code using statistical methods and use these models to hypothesize annotations. We also learn a set of Naïve Bayes models that identify textual clues of possible connections between the given protein and a hypothesized annotation. These models are used to filter and rank the predictions of the n-gram models. RESULTS: We report experiments evaluating the utility of various components of our system on a set of data held out during development, and experiments evaluating the utility of external data sources that we used to learn our models. Finally, we report our evaluation results from the BioCreative organizers. CONCLUSION: We observe that, on the test data, our system performs quite well relative to the other systems submitted to the evaluation. From other experiments on the held-out data, we observe that (i) the Naïve Bayes models were effective in filtering and ranking the initially hypothesized annotations, and (ii) our learned models were significantly more accurate when external data sources were used during learning. BioMed Central 2005-05-24 /pmc/articles/PMC1869010/ /pubmed/15960830 http://dx.doi.org/10.1186/1471-2105-6-S1-S18 Text en Copyright © 2005 Ray and Craven; licensee BioMed Central Ltd http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Report Ray, Soumya Craven, Mark Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text
title	Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text
title_full	Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text
title_fullStr	Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text
title_full_unstemmed	Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text
title_short	Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text
title_sort	learning statistical models for annotating proteins with function information using biomedical text
topic	Report
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1869010/ https://www.ncbi.nlm.nih.gov/pubmed/15960830 http://dx.doi.org/10.1186/1471-2105-6-S1-S18
work_keys_str_mv	AT raysoumya learningstatisticalmodelsforannotatingproteinswithfunctioninformationusingbiomedicaltext AT cravenmark learningstatisticalmodelsforannotatingproteinswithfunctioninformationusingbiomedicaltext

Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text

Ejemplares similares