Cargando…

Corpus annotation for mining biomedical events from literature

BACKGROUND: Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensa...

Descripción completa

Detalles Bibliográficos
Autores principales: Kim, Jin-Dong, Ohta, Tomoko, Tsujii, Jun'ichi
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2267702/
https://www.ncbi.nlm.nih.gov/pubmed/18182099
http://dx.doi.org/10.1186/1471-2105-9-10
_version_ 1782151645766877184
author Kim, Jin-Dong
Ohta, Tomoko
Tsujii, Jun'ichi
author_facet Kim, Jin-Dong
Ohta, Tomoko
Tsujii, Jun'ichi
author_sort Kim, Jin-Dong
collection PubMed
description BACKGROUND: Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. RESULTS: We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. CONCLUSION: The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.
format Text
id pubmed-2267702
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22677022008-03-15 Corpus annotation for mining biomedical events from literature Kim, Jin-Dong Ohta, Tomoko Tsujii, Jun'ichi BMC Bioinformatics Research Article BACKGROUND: Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. RESULTS: We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. CONCLUSION: The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain. BioMed Central 2008-01-08 /pmc/articles/PMC2267702/ /pubmed/18182099 http://dx.doi.org/10.1186/1471-2105-9-10 Text en Copyright © 2008 Kim et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Kim, Jin-Dong
Ohta, Tomoko
Tsujii, Jun'ichi
Corpus annotation for mining biomedical events from literature
title Corpus annotation for mining biomedical events from literature
title_full Corpus annotation for mining biomedical events from literature
title_fullStr Corpus annotation for mining biomedical events from literature
title_full_unstemmed Corpus annotation for mining biomedical events from literature
title_short Corpus annotation for mining biomedical events from literature
title_sort corpus annotation for mining biomedical events from literature
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2267702/
https://www.ncbi.nlm.nih.gov/pubmed/18182099
http://dx.doi.org/10.1186/1471-2105-9-10
work_keys_str_mv AT kimjindong corpusannotationforminingbiomedicaleventsfromliterature
AT ohtatomoko corpusannotationforminingbiomedicaleventsfromliterature
AT tsujiijunichi corpusannotationforminingbiomedicaleventsfromliterature