Cargando…

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

BACKGROUND: Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate inform...

Descripción completa

Detalles Bibliográficos
Autores principales: Van Auken, Kimberly, Jaffery, Joshua, Chan, Juancarlos, Müller, Hans-Michael, Sternberg, Paul W
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2719631/
https://www.ncbi.nlm.nih.gov/pubmed/19622167
http://dx.doi.org/10.1186/1471-2105-10-228
_version_ 1782170081592082432
author Van Auken, Kimberly
Jaffery, Joshua
Chan, Juancarlos
Müller, Hans-Michael
Sternberg, Paul W
author_facet Van Auken, Kimberly
Jaffery, Joshua
Chan, Juancarlos
Müller, Hans-Michael
Sternberg, Paul W
author_sort Van Auken, Kimberly
collection PubMed
description BACKGROUND: Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. RESULTS: We employ the Textpresso category-based information retrieval and extraction system , developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed. CONCLUSION: Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation.
format Text
id pubmed-2719631
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27196312009-08-01 Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation Van Auken, Kimberly Jaffery, Joshua Chan, Juancarlos Müller, Hans-Michael Sternberg, Paul W BMC Bioinformatics Methodology Article BACKGROUND: Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. RESULTS: We employ the Textpresso category-based information retrieval and extraction system , developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed. CONCLUSION: Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation. BioMed Central 2009-07-21 /pmc/articles/PMC2719631/ /pubmed/19622167 http://dx.doi.org/10.1186/1471-2105-10-228 Text en Copyright © 2009 Van Auken et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Van Auken, Kimberly
Jaffery, Joshua
Chan, Juancarlos
Müller, Hans-Michael
Sternberg, Paul W
Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation
title Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation
title_full Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation
title_fullStr Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation
title_full_unstemmed Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation
title_short Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation
title_sort semi-automated curation of protein subcellular localization: a text mining-based approach to gene ontology (go) cellular component curation
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2719631/
https://www.ncbi.nlm.nih.gov/pubmed/19622167
http://dx.doi.org/10.1186/1471-2105-10-228
work_keys_str_mv AT vanaukenkimberly semiautomatedcurationofproteinsubcellularlocalizationatextminingbasedapproachtogeneontologygocellularcomponentcuration
AT jafferyjoshua semiautomatedcurationofproteinsubcellularlocalizationatextminingbasedapproachtogeneontologygocellularcomponentcuration
AT chanjuancarlos semiautomatedcurationofproteinsubcellularlocalizationatextminingbasedapproachtogeneontologygocellularcomponentcuration
AT mullerhansmichael semiautomatedcurationofproteinsubcellularlocalizationatextminingbasedapproachtogeneontologygocellularcomponentcuration
AT sternbergpaulw semiautomatedcurationofproteinsubcellularlocalizationatextminingbasedapproachtogeneontologygocellularcomponentcuration