Cargando…

Integrating text mining into the MGI biocuration workflow

A major challenge for functional and comparative genomics resource development is the extraction of data from the biomedical literature. Although text mining for biological data is an active research field, few applications have been integrated into production literature curation systems such as tho...

Descripción completa

Detalles Bibliográficos
Autores principales: Dowell, K.G., McAndrews-Hill, M.S., Hill, D.P., Drabkin, H.J., Blake, J.A.
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2797454/
https://www.ncbi.nlm.nih.gov/pubmed/20157492
http://dx.doi.org/10.1093/database/bap019
_version_ 1782175620682219520
author Dowell, K.G.
McAndrews-Hill, M.S.
Hill, D.P.
Drabkin, H.J.
Blake, J.A.
author_facet Dowell, K.G.
McAndrews-Hill, M.S.
Hill, D.P.
Drabkin, H.J.
Blake, J.A.
author_sort Dowell, K.G.
collection PubMed
description A major challenge for functional and comparative genomics resource development is the extraction of data from the biomedical literature. Although text mining for biological data is an active research field, few applications have been integrated into production literature curation systems such as those of the model organism databases (MODs). Not only are most available biological natural language (bioNLP) and information retrieval and extraction solutions difficult to adapt to existing MOD curation workflows, but many also have high error rates or are unable to process documents available in those formats preferred by scientific journals. In September 2008, Mouse Genome Informatics (MGI) at The Jackson Laboratory initiated a search for dictionary-based text mining tools that we could integrate into our biocuration workflow. MGI has rigorous document triage and annotation procedures designed to identify appropriate articles about mouse genetics and genome biology. We currently screen ∼1000 journal articles a month for Gene Ontology terms, gene mapping, gene expression, phenotype data and other key biological information. Although we do not foresee that curation tasks will ever be fully automated, we are eager to implement named entity recognition (NER) tools for gene tagging that can help streamline our curation workflow and simplify gene indexing tasks within the MGI system. Gene indexing is an MGI-specific curation function that involves identifying which mouse genes are being studied in an article, then associating the appropriate gene symbols with the article reference number in the MGI database. Here, we discuss our search process, performance metrics and success criteria, and how we identified a short list of potential text mining tools for further evaluation. We provide an overview of our pilot projects with NCBO's Open Biomedical Annotator and Fraunhofer SCAI's ProMiner. In doing so, we prove the potential for the further incorporation of semi-automated processes into the curation of the biomedical literature.
format Text
id pubmed-2797454
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-27974542009-12-23 Integrating text mining into the MGI biocuration workflow Dowell, K.G. McAndrews-Hill, M.S. Hill, D.P. Drabkin, H.J. Blake, J.A. Database (Oxford) Original Article A major challenge for functional and comparative genomics resource development is the extraction of data from the biomedical literature. Although text mining for biological data is an active research field, few applications have been integrated into production literature curation systems such as those of the model organism databases (MODs). Not only are most available biological natural language (bioNLP) and information retrieval and extraction solutions difficult to adapt to existing MOD curation workflows, but many also have high error rates or are unable to process documents available in those formats preferred by scientific journals. In September 2008, Mouse Genome Informatics (MGI) at The Jackson Laboratory initiated a search for dictionary-based text mining tools that we could integrate into our biocuration workflow. MGI has rigorous document triage and annotation procedures designed to identify appropriate articles about mouse genetics and genome biology. We currently screen ∼1000 journal articles a month for Gene Ontology terms, gene mapping, gene expression, phenotype data and other key biological information. Although we do not foresee that curation tasks will ever be fully automated, we are eager to implement named entity recognition (NER) tools for gene tagging that can help streamline our curation workflow and simplify gene indexing tasks within the MGI system. Gene indexing is an MGI-specific curation function that involves identifying which mouse genes are being studied in an article, then associating the appropriate gene symbols with the article reference number in the MGI database. Here, we discuss our search process, performance metrics and success criteria, and how we identified a short list of potential text mining tools for further evaluation. We provide an overview of our pilot projects with NCBO's Open Biomedical Annotator and Fraunhofer SCAI's ProMiner. In doing so, we prove the potential for the further incorporation of semi-automated processes into the curation of the biomedical literature. Oxford University Press 2009 2009-11-21 /pmc/articles/PMC2797454/ /pubmed/20157492 http://dx.doi.org/10.1093/database/bap019 Text en © The Author(s) 2009. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.5/uk/ This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Dowell, K.G.
McAndrews-Hill, M.S.
Hill, D.P.
Drabkin, H.J.
Blake, J.A.
Integrating text mining into the MGI biocuration workflow
title Integrating text mining into the MGI biocuration workflow
title_full Integrating text mining into the MGI biocuration workflow
title_fullStr Integrating text mining into the MGI biocuration workflow
title_full_unstemmed Integrating text mining into the MGI biocuration workflow
title_short Integrating text mining into the MGI biocuration workflow
title_sort integrating text mining into the mgi biocuration workflow
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2797454/
https://www.ncbi.nlm.nih.gov/pubmed/20157492
http://dx.doi.org/10.1093/database/bap019
work_keys_str_mv AT dowellkg integratingtextminingintothemgibiocurationworkflow
AT mcandrewshillms integratingtextminingintothemgibiocurationworkflow
AT hilldp integratingtextminingintothemgibiocurationworkflow
AT drabkinhj integratingtextminingintothemgibiocurationworkflow
AT blakeja integratingtextminingintothemgibiocurationworkflow