Cargando…

Towards classifying species in systems biology papers using text mining

BACKGROUND: In recent years high throughput methods have led to a massive expansion in the free text literature on molecular biology. Automated text mining has developed as an application technology for formalizing this wealth of published results into structured database entries. However, database...

Descripción completa

Detalles Bibliográficos
Autores principales: Wei, Qi, Collier, Nigel
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3045319/
https://www.ncbi.nlm.nih.gov/pubmed/21294879
http://dx.doi.org/10.1186/1756-0500-4-32
_version_ 1782198810353598464
author Wei, Qi
Collier, Nigel
author_facet Wei, Qi
Collier, Nigel
author_sort Wei, Qi
collection PubMed
description BACKGROUND: In recent years high throughput methods have led to a massive expansion in the free text literature on molecular biology. Automated text mining has developed as an application technology for formalizing this wealth of published results into structured database entries. However, database curation as a task is still largely done by hand, and although there have been many studies on automated approaches, problems remain in how to classify documents into top-level categories based on the type of organism being investigated. Here we present a comparative analysis of state of the art supervised models that are used to classify both abstracts and full text articles for three model organisms. RESULTS: Ablation experiments were conducted on a large gold standard corpus of 10,000 abstracts and full papers containing data on three model organisms (fly, mouse and yeast). Among the eight learner models tested, the best model achieved an F-score of 97.1% for fly, 88.6% for mouse and 85.5% for yeast using a variety of features that included gene name, organism frequency, MeSH headings and term-species associations. We noted that term-species associations were particularly effective in improving classification performance. The benefit of using full text articles over abstracts was consistently observed across all three organisms. CONCLUSIONS: By comparing various learner algorithms and features we presented an optimized system that automatically detects the major focus organism in full text articles for fly, mouse and yeast. We believe the method will be extensible to other organism types.
format Text
id pubmed-3045319
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30453192011-02-26 Towards classifying species in systems biology papers using text mining Wei, Qi Collier, Nigel BMC Res Notes Research Article BACKGROUND: In recent years high throughput methods have led to a massive expansion in the free text literature on molecular biology. Automated text mining has developed as an application technology for formalizing this wealth of published results into structured database entries. However, database curation as a task is still largely done by hand, and although there have been many studies on automated approaches, problems remain in how to classify documents into top-level categories based on the type of organism being investigated. Here we present a comparative analysis of state of the art supervised models that are used to classify both abstracts and full text articles for three model organisms. RESULTS: Ablation experiments were conducted on a large gold standard corpus of 10,000 abstracts and full papers containing data on three model organisms (fly, mouse and yeast). Among the eight learner models tested, the best model achieved an F-score of 97.1% for fly, 88.6% for mouse and 85.5% for yeast using a variety of features that included gene name, organism frequency, MeSH headings and term-species associations. We noted that term-species associations were particularly effective in improving classification performance. The benefit of using full text articles over abstracts was consistently observed across all three organisms. CONCLUSIONS: By comparing various learner algorithms and features we presented an optimized system that automatically detects the major focus organism in full text articles for fly, mouse and yeast. We believe the method will be extensible to other organism types. BioMed Central 2011-02-04 /pmc/articles/PMC3045319/ /pubmed/21294879 http://dx.doi.org/10.1186/1756-0500-4-32 Text en Copyright ©2011 Collier et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Wei, Qi
Collier, Nigel
Towards classifying species in systems biology papers using text mining
title Towards classifying species in systems biology papers using text mining
title_full Towards classifying species in systems biology papers using text mining
title_fullStr Towards classifying species in systems biology papers using text mining
title_full_unstemmed Towards classifying species in systems biology papers using text mining
title_short Towards classifying species in systems biology papers using text mining
title_sort towards classifying species in systems biology papers using text mining
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3045319/
https://www.ncbi.nlm.nih.gov/pubmed/21294879
http://dx.doi.org/10.1186/1756-0500-4-32
work_keys_str_mv AT weiqi towardsclassifyingspeciesinsystemsbiologypapersusingtextmining
AT colliernigel towardsclassifyingspeciesinsystemsbiologypapersusingtextmining