Cargando…

Concept annotation in the CRAFT corpus

BACKGROUND: Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. RESULTS: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-acces...

Descripción completa

Detalles Bibliográficos
Autores principales: Bada, Michael, Eckert, Miriam, Evans, Donald, Garcia, Kristin, Shipley, Krista, Sitnikov, Dmitry, Baumgartner, William A, Cohen, K Bretonnel, Verspoor, Karin, Blake, Judith A, Hunter, Lawrence E
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3476437/
https://www.ncbi.nlm.nih.gov/pubmed/22776079
http://dx.doi.org/10.1186/1471-2105-13-161
_version_ 1782247099905081344
author Bada, Michael
Eckert, Miriam
Evans, Donald
Garcia, Kristin
Shipley, Krista
Sitnikov, Dmitry
Baumgartner, William A
Cohen, K Bretonnel
Verspoor, Karin
Blake, Judith A
Hunter, Lawrence E
author_facet Bada, Michael
Eckert, Miriam
Evans, Donald
Garcia, Kristin
Shipley, Krista
Sitnikov, Dmitry
Baumgartner, William A
Cohen, K Bretonnel
Verspoor, Karin
Blake, Judith A
Hunter, Lawrence E
author_sort Bada, Michael
collection PubMed
description BACKGROUND: Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. RESULTS: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. CONCLUSIONS: As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
format Online
Article
Text
id pubmed-3476437
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34764372012-10-20 Concept annotation in the CRAFT corpus Bada, Michael Eckert, Miriam Evans, Donald Garcia, Kristin Shipley, Krista Sitnikov, Dmitry Baumgartner, William A Cohen, K Bretonnel Verspoor, Karin Blake, Judith A Hunter, Lawrence E BMC Bioinformatics Research Article BACKGROUND: Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. RESULTS: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. CONCLUSIONS: As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. BioMed Central 2012-07-09 /pmc/articles/PMC3476437/ /pubmed/22776079 http://dx.doi.org/10.1186/1471-2105-13-161 Text en Copyright ©2012 Bada et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Bada, Michael
Eckert, Miriam
Evans, Donald
Garcia, Kristin
Shipley, Krista
Sitnikov, Dmitry
Baumgartner, William A
Cohen, K Bretonnel
Verspoor, Karin
Blake, Judith A
Hunter, Lawrence E
Concept annotation in the CRAFT corpus
title Concept annotation in the CRAFT corpus
title_full Concept annotation in the CRAFT corpus
title_fullStr Concept annotation in the CRAFT corpus
title_full_unstemmed Concept annotation in the CRAFT corpus
title_short Concept annotation in the CRAFT corpus
title_sort concept annotation in the craft corpus
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3476437/
https://www.ncbi.nlm.nih.gov/pubmed/22776079
http://dx.doi.org/10.1186/1471-2105-13-161
work_keys_str_mv AT badamichael conceptannotationinthecraftcorpus
AT eckertmiriam conceptannotationinthecraftcorpus
AT evansdonald conceptannotationinthecraftcorpus
AT garciakristin conceptannotationinthecraftcorpus
AT shipleykrista conceptannotationinthecraftcorpus
AT sitnikovdmitry conceptannotationinthecraftcorpus
AT baumgartnerwilliama conceptannotationinthecraftcorpus
AT cohenkbretonnel conceptannotationinthecraftcorpus
AT verspoorkarin conceptannotationinthecraftcorpus
AT blakejuditha conceptannotationinthecraftcorpus
AT hunterlawrencee conceptannotationinthecraftcorpus