Cargando…

BioCreative V CDR task corpus: a resource for chemical disease relation extraction

Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Jiao, Sun, Yueping, Johnson, Robin J., Sciaky, Daniela, Wei, Chih-Hsuan, Leaman, Robert, Davis, Allan Peter, Mattingly, Carolyn J., Wiegers, Thomas C., Lu, Zhiyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860626/
https://www.ncbi.nlm.nih.gov/pubmed/27161011
http://dx.doi.org/10.1093/database/baw068
_version_ 1782431097984909312
author Li, Jiao
Sun, Yueping
Johnson, Robin J.
Sciaky, Daniela
Wei, Chih-Hsuan
Leaman, Robert
Davis, Allan Peter
Mattingly, Carolyn J.
Wiegers, Thomas C.
Lu, Zhiyong
author_facet Li, Jiao
Sun, Yueping
Johnson, Robin J.
Sciaky, Daniela
Wei, Chih-Hsuan
Leaman, Robert
Davis, Allan Peter
Mattingly, Carolyn J.
Wiegers, Thomas C.
Lu, Zhiyong
author_sort Li, Jiao
collection PubMed
description Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community. Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/
format Online
Article
Text
id pubmed-4860626
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-48606262016-05-10 BioCreative V CDR task corpus: a resource for chemical disease relation extraction Li, Jiao Sun, Yueping Johnson, Robin J. Sciaky, Daniela Wei, Chih-Hsuan Leaman, Robert Davis, Allan Peter Mattingly, Carolyn J. Wiegers, Thomas C. Lu, Zhiyong Database (Oxford) Original Article Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community. Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/ Oxford University Press 2016-05-08 /pmc/articles/PMC4860626/ /pubmed/27161011 http://dx.doi.org/10.1093/database/baw068 Text en Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the United States.
spellingShingle Original Article
Li, Jiao
Sun, Yueping
Johnson, Robin J.
Sciaky, Daniela
Wei, Chih-Hsuan
Leaman, Robert
Davis, Allan Peter
Mattingly, Carolyn J.
Wiegers, Thomas C.
Lu, Zhiyong
BioCreative V CDR task corpus: a resource for chemical disease relation extraction
title BioCreative V CDR task corpus: a resource for chemical disease relation extraction
title_full BioCreative V CDR task corpus: a resource for chemical disease relation extraction
title_fullStr BioCreative V CDR task corpus: a resource for chemical disease relation extraction
title_full_unstemmed BioCreative V CDR task corpus: a resource for chemical disease relation extraction
title_short BioCreative V CDR task corpus: a resource for chemical disease relation extraction
title_sort biocreative v cdr task corpus: a resource for chemical disease relation extraction
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860626/
https://www.ncbi.nlm.nih.gov/pubmed/27161011
http://dx.doi.org/10.1093/database/baw068
work_keys_str_mv AT lijiao biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction
AT sunyueping biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction
AT johnsonrobinj biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction
AT sciakydaniela biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction
AT weichihhsuan biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction
AT leamanrobert biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction
AT davisallanpeter biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction
AT mattinglycarolynj biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction
AT wiegersthomasc biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction
AT luzhiyong biocreativevcdrtaskcorpusaresourceforchemicaldiseaserelationextraction