Cargando…

Construction of an annotated corpus to support biomedical information extraction

BACKGROUND: Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic an...

Descripción completa

Detalles Bibliográficos
Autores principales: Thompson, Paul, Iqbal, Syed A, McNaught, John, Ananiadou, Sophia
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2774701/
https://www.ncbi.nlm.nih.gov/pubmed/19852798
http://dx.doi.org/10.1186/1471-2105-10-349
_version_ 1782173971241762816
author Thompson, Paul
Iqbal, Syed A
McNaught, John
Ananiadou, Sophia
author_facet Thompson, Paul
Iqbal, Syed A
McNaught, John
Ananiadou, Sophia
author_sort Thompson, Paul
collection PubMed
description BACKGROUND: Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources. RESULTS: We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%. CONCLUSION: The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes.
format Text
id pubmed-2774701
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27747012009-11-10 Construction of an annotated corpus to support biomedical information extraction Thompson, Paul Iqbal, Syed A McNaught, John Ananiadou, Sophia BMC Bioinformatics Research Article BACKGROUND: Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources. RESULTS: We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%. CONCLUSION: The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes. BioMed Central 2009-10-23 /pmc/articles/PMC2774701/ /pubmed/19852798 http://dx.doi.org/10.1186/1471-2105-10-349 Text en Copyright © 2009 Thompson et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Thompson, Paul
Iqbal, Syed A
McNaught, John
Ananiadou, Sophia
Construction of an annotated corpus to support biomedical information extraction
title Construction of an annotated corpus to support biomedical information extraction
title_full Construction of an annotated corpus to support biomedical information extraction
title_fullStr Construction of an annotated corpus to support biomedical information extraction
title_full_unstemmed Construction of an annotated corpus to support biomedical information extraction
title_short Construction of an annotated corpus to support biomedical information extraction
title_sort construction of an annotated corpus to support biomedical information extraction
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2774701/
https://www.ncbi.nlm.nih.gov/pubmed/19852798
http://dx.doi.org/10.1186/1471-2105-10-349
work_keys_str_mv AT thompsonpaul constructionofanannotatedcorpustosupportbiomedicalinformationextraction
AT iqbalsyeda constructionofanannotatedcorpustosupportbiomedicalinformationextraction
AT mcnaughtjohn constructionofanannotatedcorpustosupportbiomedicalinformationextraction
AT ananiadousophia constructionofanannotatedcorpustosupportbiomedicalinformationextraction