Cargando…

Automated detection of discourse segment and experimental types from the text of cancer pathway results sections

Automated machine-reading biocuration systems typically use sentence-by-sentence information extraction to construct meaning representations for use by curators. This does not directly reflect the typical discourse structure used by scientists to construct an argument from the experimental data avai...

Descripción completa

Detalles Bibliográficos
Autores principales: Burns, Gully A.P.C., Dasigi, Pradeep, de Waard, Anita, Hovy, Eduard H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5006090/
https://www.ncbi.nlm.nih.gov/pubmed/27580922
http://dx.doi.org/10.1093/database/baw122
_version_ 1782451010398060544
author Burns, Gully A.P.C.
Dasigi, Pradeep
de Waard, Anita
Hovy, Eduard H.
author_facet Burns, Gully A.P.C.
Dasigi, Pradeep
de Waard, Anita
Hovy, Eduard H.
author_sort Burns, Gully A.P.C.
collection PubMed
description Automated machine-reading biocuration systems typically use sentence-by-sentence information extraction to construct meaning representations for use by curators. This does not directly reflect the typical discourse structure used by scientists to construct an argument from the experimental data available within a article, and is therefore less likely to correspond to representations typically used in biomedical informatics systems (let alone to the mental models that scientists have). In this study, we develop Natural Language Processing methods to locate, extract, and classify the individual passages of text from articles’ Results sections that refer to experimental data. In our domain of interest (molecular biology studies of cancer signal transduction pathways), individual articles may contain as many as 30 small-scale individual experiments describing a variety of findings, upon which authors base their overall research conclusions. Our system automatically classifies discourse segments in these texts into seven categories (fact, hypothesis, problem, goal, method, result, implication) with an F-score of 0.68. These segments describe the essential building blocks of scientific discourse to (i) provide context for each experiment, (ii) report experimental details and (iii) explain the data’s meaning in context. We evaluate our system on text passages from articles that were curated in molecular biology databases (the Pathway Logic Datum repository, the Molecular Interaction MINT and INTACT databases) linking individual experiments in articles to the type of assay used (coprecipitation, phosphorylation, translocation etc.). We use supervised machine learning techniques on text passages containing unambiguous references to experiments to obtain baseline F1 scores of 0.59 for MINT, 0.71 for INTACT and 0.63 for Pathway Logic. Although preliminary, these results support the notion that targeting information extraction methods to experimental results could provide accurate, automated methods for biocuration. We also suggest the need for finer-grained curation of experimental methods used when constructing molecular biology databases
format Online
Article
Text
id pubmed-5006090
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-50060902016-09-06 Automated detection of discourse segment and experimental types from the text of cancer pathway results sections Burns, Gully A.P.C. Dasigi, Pradeep de Waard, Anita Hovy, Eduard H. Database (Oxford) Original Article Automated machine-reading biocuration systems typically use sentence-by-sentence information extraction to construct meaning representations for use by curators. This does not directly reflect the typical discourse structure used by scientists to construct an argument from the experimental data available within a article, and is therefore less likely to correspond to representations typically used in biomedical informatics systems (let alone to the mental models that scientists have). In this study, we develop Natural Language Processing methods to locate, extract, and classify the individual passages of text from articles’ Results sections that refer to experimental data. In our domain of interest (molecular biology studies of cancer signal transduction pathways), individual articles may contain as many as 30 small-scale individual experiments describing a variety of findings, upon which authors base their overall research conclusions. Our system automatically classifies discourse segments in these texts into seven categories (fact, hypothesis, problem, goal, method, result, implication) with an F-score of 0.68. These segments describe the essential building blocks of scientific discourse to (i) provide context for each experiment, (ii) report experimental details and (iii) explain the data’s meaning in context. We evaluate our system on text passages from articles that were curated in molecular biology databases (the Pathway Logic Datum repository, the Molecular Interaction MINT and INTACT databases) linking individual experiments in articles to the type of assay used (coprecipitation, phosphorylation, translocation etc.). We use supervised machine learning techniques on text passages containing unambiguous references to experiments to obtain baseline F1 scores of 0.59 for MINT, 0.71 for INTACT and 0.63 for Pathway Logic. Although preliminary, these results support the notion that targeting information extraction methods to experimental results could provide accurate, automated methods for biocuration. We also suggest the need for finer-grained curation of experimental methods used when constructing molecular biology databases Oxford University Press 2016-08-31 /pmc/articles/PMC5006090/ /pubmed/27580922 http://dx.doi.org/10.1093/database/baw122 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Burns, Gully A.P.C.
Dasigi, Pradeep
de Waard, Anita
Hovy, Eduard H.
Automated detection of discourse segment and experimental types from the text of cancer pathway results sections
title Automated detection of discourse segment and experimental types from the text of cancer pathway results sections
title_full Automated detection of discourse segment and experimental types from the text of cancer pathway results sections
title_fullStr Automated detection of discourse segment and experimental types from the text of cancer pathway results sections
title_full_unstemmed Automated detection of discourse segment and experimental types from the text of cancer pathway results sections
title_short Automated detection of discourse segment and experimental types from the text of cancer pathway results sections
title_sort automated detection of discourse segment and experimental types from the text of cancer pathway results sections
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5006090/
https://www.ncbi.nlm.nih.gov/pubmed/27580922
http://dx.doi.org/10.1093/database/baw122
work_keys_str_mv AT burnsgullyapc automateddetectionofdiscoursesegmentandexperimentaltypesfromthetextofcancerpathwayresultssections
AT dasigipradeep automateddetectionofdiscoursesegmentandexperimentaltypesfromthetextofcancerpathwayresultssections
AT dewaardanita automateddetectionofdiscoursesegmentandexperimentaltypesfromthetextofcancerpathwayresultssections
AT hovyeduardh automateddetectionofdiscoursesegmentandexperimentaltypesfromthetextofcancerpathwayresultssections