Cargando…

New directions in biomedical text annotation: definitions, guidelines and corpus construction

BACKGROUND: While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of info...

Descripción completa

Detalles Bibliográficos
Autores principales: Wilbur, W John, Rzhetsky, Andrey, Shatkay, Hagit
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1559725/
https://www.ncbi.nlm.nih.gov/pubmed/16867190
http://dx.doi.org/10.1186/1471-2105-7-356
_version_ 1782129456748429312
author Wilbur, W John
Rzhetsky, Andrey
Shatkay, Hagit
author_facet Wilbur, W John
Rzhetsky, Andrey
Shatkay, Hagit
author_sort Wilbur, W John
collection PubMed
description BACKGROUND: While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. RESULTS: We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. CONCLUSION: We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available.
format Text
id pubmed-1559725
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-15597252006-09-05 New directions in biomedical text annotation: definitions, guidelines and corpus construction Wilbur, W John Rzhetsky, Andrey Shatkay, Hagit BMC Bioinformatics Methodology Article BACKGROUND: While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. RESULTS: We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. CONCLUSION: We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. BioMed Central 2006-07-25 /pmc/articles/PMC1559725/ /pubmed/16867190 http://dx.doi.org/10.1186/1471-2105-7-356 Text en Copyright © 2006 Wilbur et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Wilbur, W John
Rzhetsky, Andrey
Shatkay, Hagit
New directions in biomedical text annotation: definitions, guidelines and corpus construction
title New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_full New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_fullStr New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_full_unstemmed New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_short New directions in biomedical text annotation: definitions, guidelines and corpus construction
title_sort new directions in biomedical text annotation: definitions, guidelines and corpus construction
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1559725/
https://www.ncbi.nlm.nih.gov/pubmed/16867190
http://dx.doi.org/10.1186/1471-2105-7-356
work_keys_str_mv AT wilburwjohn newdirectionsinbiomedicaltextannotationdefinitionsguidelinesandcorpusconstruction
AT rzhetskyandrey newdirectionsinbiomedicaltextannotationdefinitionsguidelinesandcorpusconstruction
AT shatkayhagit newdirectionsinbiomedicaltextannotationdefinitionsguidelinesandcorpusconstruction