Cargando…

Automatic categorization of diverse experimental information in the bioscience literature

BACKGROUND: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain r...

Descripción completa

Detalles Bibliográficos
Autores principales: Fang, Ruihua, Schindelman, Gary, Auken, Kimberly Van, Fernandes, Jolene, Chen, Wen, Wang, Xiaodong, Davis, Paul, Tuli, Mary Ann, Marygold, Steven J, Millburn, Gillian, Matthews, Beverley, Zhang, Haiyan, Brown, Nick, Gelbart, William M, Sternberg, Paul W
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3305665/
https://www.ncbi.nlm.nih.gov/pubmed/22280404
http://dx.doi.org/10.1186/1471-2105-13-16
_version_ 1782227120420814848
author Fang, Ruihua
Schindelman, Gary
Auken, Kimberly Van
Fernandes, Jolene
Chen, Wen
Wang, Xiaodong
Davis, Paul
Tuli, Mary Ann
Marygold, Steven J
Millburn, Gillian
Matthews, Beverley
Zhang, Haiyan
Brown, Nick
Gelbart, William M
Sternberg, Paul W
author_facet Fang, Ruihua
Schindelman, Gary
Auken, Kimberly Van
Fernandes, Jolene
Chen, Wen
Wang, Xiaodong
Davis, Paul
Tuli, Mary Ann
Marygold, Steven J
Millburn, Gillian
Matthews, Beverley
Zhang, Haiyan
Brown, Nick
Gelbart, William M
Sternberg, Paul W
author_sort Fang, Ruihua
collection PubMed
description BACKGROUND: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. RESULTS: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. CONCLUSIONS: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.
format Online
Article
Text
id pubmed-3305665
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-33056652012-03-16 Automatic categorization of diverse experimental information in the bioscience literature Fang, Ruihua Schindelman, Gary Auken, Kimberly Van Fernandes, Jolene Chen, Wen Wang, Xiaodong Davis, Paul Tuli, Mary Ann Marygold, Steven J Millburn, Gillian Matthews, Beverley Zhang, Haiyan Brown, Nick Gelbart, William M Sternberg, Paul W BMC Bioinformatics Methodology Article BACKGROUND: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. RESULTS: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. CONCLUSIONS: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort. BioMed Central 2012-01-26 /pmc/articles/PMC3305665/ /pubmed/22280404 http://dx.doi.org/10.1186/1471-2105-13-16 Text en Copyright ©2012 Fang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Fang, Ruihua
Schindelman, Gary
Auken, Kimberly Van
Fernandes, Jolene
Chen, Wen
Wang, Xiaodong
Davis, Paul
Tuli, Mary Ann
Marygold, Steven J
Millburn, Gillian
Matthews, Beverley
Zhang, Haiyan
Brown, Nick
Gelbart, William M
Sternberg, Paul W
Automatic categorization of diverse experimental information in the bioscience literature
title Automatic categorization of diverse experimental information in the bioscience literature
title_full Automatic categorization of diverse experimental information in the bioscience literature
title_fullStr Automatic categorization of diverse experimental information in the bioscience literature
title_full_unstemmed Automatic categorization of diverse experimental information in the bioscience literature
title_short Automatic categorization of diverse experimental information in the bioscience literature
title_sort automatic categorization of diverse experimental information in the bioscience literature
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3305665/
https://www.ncbi.nlm.nih.gov/pubmed/22280404
http://dx.doi.org/10.1186/1471-2105-13-16
work_keys_str_mv AT fangruihua automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT schindelmangary automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT aukenkimberlyvan automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT fernandesjolene automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT chenwen automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT wangxiaodong automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT davispaul automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT tulimaryann automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT marygoldstevenj automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT millburngillian automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT matthewsbeverley automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT zhanghaiyan automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT brownnick automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT gelbartwilliamm automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature
AT sternbergpaulw automaticcategorizationofdiverseexperimentalinformationinthebioscienceliterature