Cargando…

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles

BACKGROUND: Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the p...

Descripción completa

Detalles Bibliográficos
Autores principales: Cohen, K. Bretonnel, Lanfranchi, Arrick, Choi, Miji Joo-young, Bada, Michael, Baumgartner, William A., Panteleyeva, Natalya, Verspoor, Karin, Palmer, Martha, Hunter, Lawrence E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5561560/
https://www.ncbi.nlm.nih.gov/pubmed/28818042
http://dx.doi.org/10.1186/s12859-017-1775-9
_version_ 1783257850397589504
author Cohen, K. Bretonnel
Lanfranchi, Arrick
Choi, Miji Joo-young
Bada, Michael
Baumgartner, William A.
Panteleyeva, Natalya
Verspoor, Karin
Palmer, Martha
Hunter, Lawrence E.
author_facet Cohen, K. Bretonnel
Lanfranchi, Arrick
Choi, Miji Joo-young
Bada, Michael
Baumgartner, William A.
Panteleyeva, Natalya
Verspoor, Karin
Palmer, Martha
Hunter, Lawrence E.
author_sort Cohen, K. Bretonnel
collection PubMed
description BACKGROUND: Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations. RESULTS: The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus. CONCLUSIONS: The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.
format Online
Article
Text
id pubmed-5561560
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-55615602017-08-18 Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles Cohen, K. Bretonnel Lanfranchi, Arrick Choi, Miji Joo-young Bada, Michael Baumgartner, William A. Panteleyeva, Natalya Verspoor, Karin Palmer, Martha Hunter, Lawrence E. BMC Bioinformatics Research Article BACKGROUND: Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations. RESULTS: The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus. CONCLUSIONS: The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large. BioMed Central 2017-08-17 /pmc/articles/PMC5561560/ /pubmed/28818042 http://dx.doi.org/10.1186/s12859-017-1775-9 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Cohen, K. Bretonnel
Lanfranchi, Arrick
Choi, Miji Joo-young
Bada, Michael
Baumgartner, William A.
Panteleyeva, Natalya
Verspoor, Karin
Palmer, Martha
Hunter, Lawrence E.
Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles
title Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles
title_full Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles
title_fullStr Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles
title_full_unstemmed Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles
title_short Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles
title_sort coreference annotation and resolution in the colorado richly annotated full text (craft) corpus of biomedical journal articles
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5561560/
https://www.ncbi.nlm.nih.gov/pubmed/28818042
http://dx.doi.org/10.1186/s12859-017-1775-9
work_keys_str_mv AT cohenkbretonnel coreferenceannotationandresolutioninthecoloradorichlyannotatedfulltextcraftcorpusofbiomedicaljournalarticles
AT lanfranchiarrick coreferenceannotationandresolutioninthecoloradorichlyannotatedfulltextcraftcorpusofbiomedicaljournalarticles
AT choimijijooyoung coreferenceannotationandresolutioninthecoloradorichlyannotatedfulltextcraftcorpusofbiomedicaljournalarticles
AT badamichael coreferenceannotationandresolutioninthecoloradorichlyannotatedfulltextcraftcorpusofbiomedicaljournalarticles
AT baumgartnerwilliama coreferenceannotationandresolutioninthecoloradorichlyannotatedfulltextcraftcorpusofbiomedicaljournalarticles
AT panteleyevanatalya coreferenceannotationandresolutioninthecoloradorichlyannotatedfulltextcraftcorpusofbiomedicaljournalarticles
AT verspoorkarin coreferenceannotationandresolutioninthecoloradorichlyannotatedfulltextcraftcorpusofbiomedicaljournalarticles
AT palmermartha coreferenceannotationandresolutioninthecoloradorichlyannotatedfulltextcraftcorpusofbiomedicaljournalarticles
AT hunterlawrencee coreferenceannotationandresolutioninthecoloradorichlyannotatedfulltextcraftcorpusofbiomedicaljournalarticles