Cargando…

Automatic recognition of conceptualization zones in scientific articles and two life science applications

Motivation: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In p...

Descripción completa

Detalles Bibliográficos
Autores principales: Liakata, Maria, Saha, Shyamasree, Dobnik, Simon, Batchelor, Colin, Rebholz-Schuhmann, Dietrich
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315721/
https://www.ncbi.nlm.nih.gov/pubmed/22321698
http://dx.doi.org/10.1093/bioinformatics/bts071
_version_ 1782228278401040384
author Liakata, Maria
Saha, Shyamasree
Dobnik, Simon
Batchelor, Colin
Rebholz-Schuhmann, Dietrich
author_facet Liakata, Maria
Saha, Shyamasree
Dobnik, Simon
Batchelor, Colin
Rebholz-Schuhmann, Dietrich
author_sort Liakata, Maria
collection PubMed
description Motivation: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication. Results: We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with ‘Experiment’, ‘Background’ and ‘Model’ being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated CoreSCs in two biomedical applications as well as work in progress. Availability: A web-based tool for the automatic annotation of articles with CoreSCs and corresponding documentation is available online at http://www.sapientaproject.com/software http://www.sapientaproject.com also contains detailed information pertaining to CoreSC annotation and links to annotation guidelines as well as a corpus of manually annotated articles, which served as our training data. Contact: liakata@ebi.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-3315721
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-33157212012-03-30 Automatic recognition of conceptualization zones in scientific articles and two life science applications Liakata, Maria Saha, Shyamasree Dobnik, Simon Batchelor, Colin Rebholz-Schuhmann, Dietrich Bioinformatics Original Papers Motivation: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication. Results: We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with ‘Experiment’, ‘Background’ and ‘Model’ being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated CoreSCs in two biomedical applications as well as work in progress. Availability: A web-based tool for the automatic annotation of articles with CoreSCs and corresponding documentation is available online at http://www.sapientaproject.com/software http://www.sapientaproject.com also contains detailed information pertaining to CoreSC annotation and links to annotation guidelines as well as a corpus of manually annotated articles, which served as our training data. Contact: liakata@ebi.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2012-04-01 2012-02-08 /pmc/articles/PMC3315721/ /pubmed/22321698 http://dx.doi.org/10.1093/bioinformatics/bts071 Text en © The Author(s) 2012. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Liakata, Maria
Saha, Shyamasree
Dobnik, Simon
Batchelor, Colin
Rebholz-Schuhmann, Dietrich
Automatic recognition of conceptualization zones in scientific articles and two life science applications
title Automatic recognition of conceptualization zones in scientific articles and two life science applications
title_full Automatic recognition of conceptualization zones in scientific articles and two life science applications
title_fullStr Automatic recognition of conceptualization zones in scientific articles and two life science applications
title_full_unstemmed Automatic recognition of conceptualization zones in scientific articles and two life science applications
title_short Automatic recognition of conceptualization zones in scientific articles and two life science applications
title_sort automatic recognition of conceptualization zones in scientific articles and two life science applications
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315721/
https://www.ncbi.nlm.nih.gov/pubmed/22321698
http://dx.doi.org/10.1093/bioinformatics/bts071
work_keys_str_mv AT liakatamaria automaticrecognitionofconceptualizationzonesinscientificarticlesandtwolifescienceapplications
AT sahashyamasree automaticrecognitionofconceptualizationzonesinscientificarticlesandtwolifescienceapplications
AT dobniksimon automaticrecognitionofconceptualizationzonesinscientificarticlesandtwolifescienceapplications
AT batchelorcolin automaticrecognitionofconceptualizationzonesinscientificarticlesandtwolifescienceapplications
AT rebholzschuhmanndietrich automaticrecognitionofconceptualizationzonesinscientificarticlesandtwolifescienceapplications