Cargando…

Developing a Disease Outbreak Event Corpus

BACKGROUND: In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking. OBJ...

Descripción completa

Detalles Bibliográficos
Autores principales: Conway, Mike, Kawazoe, Ai, Chanlekha, Hutchatai, Collier, Nigel
Formato: Texto
Lenguaje:English
Publicado: Gunther Eysenbach 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2956322/
https://www.ncbi.nlm.nih.gov/pubmed/20876049
http://dx.doi.org/10.2196/jmir.1323
_version_ 1782188135670611968
author Conway, Mike
Kawazoe, Ai
Chanlekha, Hutchatai
Collier, Nigel
author_facet Conway, Mike
Kawazoe, Ai
Chanlekha, Hutchatai
Collier, Nigel
author_sort Conway, Mike
collection PubMed
description BACKGROUND: In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking. OBJECTIVE: This study seeks to create a “gold standard” data set against which to test how accurately disease outbreak information extraction systems can identify the semantics of disease outbreak events. Additionally, we hope that the provision of an annotation scheme (and associated corpus) to the community will encourage open evaluation in this new and growing application area. METHODS: We developed an annotation scheme for identifying infectious disease outbreak events in news texts. An event─in the context of our annotation scheme─consists minimally of geographical (eg, country and province) and disease name information. However, the scheme also allows for the rich encoding of other domain salient concepts (eg, international travel, species, and food contamination). RESULTS: The work resulted in a 200-document corpus of event-annotated disease outbreak reports that can be used to evaluate the accuracy of event detection algorithms (in this case, for the BioCaster biosurveillance online news information extraction system). In the 200 documents, 394 distinct events were identified (mean 1.97 events per document, range 0-25 events per document). We also provide a download script and graphical user interface (GUI)-based event browsing software to facilitate corpus exploration. CONCLUSION: In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus were designed both with the particular evaluation requirements of the BioCaster system in mind as well as the wider need for further evaluation resources in this growing research area.
format Text
id pubmed-2956322
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher Gunther Eysenbach
record_format MEDLINE/PubMed
spelling pubmed-29563222010-10-18 Developing a Disease Outbreak Event Corpus Conway, Mike Kawazoe, Ai Chanlekha, Hutchatai Collier, Nigel J Med Internet Res Original Paper BACKGROUND: In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking. OBJECTIVE: This study seeks to create a “gold standard” data set against which to test how accurately disease outbreak information extraction systems can identify the semantics of disease outbreak events. Additionally, we hope that the provision of an annotation scheme (and associated corpus) to the community will encourage open evaluation in this new and growing application area. METHODS: We developed an annotation scheme for identifying infectious disease outbreak events in news texts. An event─in the context of our annotation scheme─consists minimally of geographical (eg, country and province) and disease name information. However, the scheme also allows for the rich encoding of other domain salient concepts (eg, international travel, species, and food contamination). RESULTS: The work resulted in a 200-document corpus of event-annotated disease outbreak reports that can be used to evaluate the accuracy of event detection algorithms (in this case, for the BioCaster biosurveillance online news information extraction system). In the 200 documents, 394 distinct events were identified (mean 1.97 events per document, range 0-25 events per document). We also provide a download script and graphical user interface (GUI)-based event browsing software to facilitate corpus exploration. CONCLUSION: In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus were designed both with the particular evaluation requirements of the BioCaster system in mind as well as the wider need for further evaluation resources in this growing research area. Gunther Eysenbach 2010-09-28 /pmc/articles/PMC2956322/ /pubmed/20876049 http://dx.doi.org/10.2196/jmir.1323 Text en ©Mike Conway, Ai Kawazoe, Hutchatai Chanlekha, Nigel Collier. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 28.09.2010   http://creativecommons.org/licenses/by/2.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Conway, Mike
Kawazoe, Ai
Chanlekha, Hutchatai
Collier, Nigel
Developing a Disease Outbreak Event Corpus
title Developing a Disease Outbreak Event Corpus
title_full Developing a Disease Outbreak Event Corpus
title_fullStr Developing a Disease Outbreak Event Corpus
title_full_unstemmed Developing a Disease Outbreak Event Corpus
title_short Developing a Disease Outbreak Event Corpus
title_sort developing a disease outbreak event corpus
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2956322/
https://www.ncbi.nlm.nih.gov/pubmed/20876049
http://dx.doi.org/10.2196/jmir.1323
work_keys_str_mv AT conwaymike developingadiseaseoutbreakeventcorpus
AT kawazoeai developingadiseaseoutbreakeventcorpus
AT chanlekhahutchatai developingadiseaseoutbreakeventcorpus
AT colliernigel developingadiseaseoutbreakeventcorpus