Cargando…
New directions in biomedical text annotation: definitions, guidelines and corpus construction
BACKGROUND: While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of info...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1559725/ https://www.ncbi.nlm.nih.gov/pubmed/16867190 http://dx.doi.org/10.1186/1471-2105-7-356 |
_version_ | 1782129456748429312 |
---|---|
author | Wilbur, W John Rzhetsky, Andrey Shatkay, Hagit |
author_facet | Wilbur, W John Rzhetsky, Andrey Shatkay, Hagit |
author_sort | Wilbur, W John |
collection | PubMed |
description | BACKGROUND: While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. RESULTS: We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. CONCLUSION: We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. |
format | Text |
id | pubmed-1559725 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-15597252006-09-05 New directions in biomedical text annotation: definitions, guidelines and corpus construction Wilbur, W John Rzhetsky, Andrey Shatkay, Hagit BMC Bioinformatics Methodology Article BACKGROUND: While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. RESULTS: We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. CONCLUSION: We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. BioMed Central 2006-07-25 /pmc/articles/PMC1559725/ /pubmed/16867190 http://dx.doi.org/10.1186/1471-2105-7-356 Text en Copyright © 2006 Wilbur et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Wilbur, W John Rzhetsky, Andrey Shatkay, Hagit New directions in biomedical text annotation: definitions, guidelines and corpus construction |
title | New directions in biomedical text annotation: definitions, guidelines and corpus construction |
title_full | New directions in biomedical text annotation: definitions, guidelines and corpus construction |
title_fullStr | New directions in biomedical text annotation: definitions, guidelines and corpus construction |
title_full_unstemmed | New directions in biomedical text annotation: definitions, guidelines and corpus construction |
title_short | New directions in biomedical text annotation: definitions, guidelines and corpus construction |
title_sort | new directions in biomedical text annotation: definitions, guidelines and corpus construction |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1559725/ https://www.ncbi.nlm.nih.gov/pubmed/16867190 http://dx.doi.org/10.1186/1471-2105-7-356 |
work_keys_str_mv | AT wilburwjohn newdirectionsinbiomedicaltextannotationdefinitionsguidelinesandcorpusconstruction AT rzhetskyandrey newdirectionsinbiomedicaltextannotationdefinitionsguidelinesandcorpusconstruction AT shatkayhagit newdirectionsinbiomedicaltextannotationdefinitionsguidelinesandcorpusconstruction |