Cargando…

Abbreviation definition identification based on automatic precision estimates

BACKGROUND: The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sohn, Sunghwan, Comeau, Donald C, Kim, Won, Wilbur, W John
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2576267/ https://www.ncbi.nlm.nih.gov/pubmed/18817555 http://dx.doi.org/10.1186/1471-2105-9-402

_version_	1782160381307781120
author	Sohn, Sunghwan Comeau, Donald C Kim, Won Wilbur, W John
author_facet	Sohn, Sunghwan Comeau, Donald C Kim, Won Wilbur, W John
author_sort	Sohn, Sunghwan
collection	PubMed
description	BACKGROUND: The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation. RESULTS: On the Medstract corpus our algorithm produced 97% precision and 85% recall which is higher than previously reported results. We also annotated 1250 randomly selected MEDLINE records as a gold standard. On this set we achieved 96.5% precision and 83.2% recall. This compares favourably with the well known Schwartz and Hearst algorithm. CONCLUSION: We developed an algorithm for abbreviation identification that uses a variety of strategies to identify the most probable definition for an abbreviation and also produces an estimated accuracy of the result. This process is purely automatic.
format	Text
id	pubmed-2576267
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-25762672008-10-31 Abbreviation definition identification based on automatic precision estimates Sohn, Sunghwan Comeau, Donald C Kim, Won Wilbur, W John BMC Bioinformatics Research Article BACKGROUND: The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation. RESULTS: On the Medstract corpus our algorithm produced 97% precision and 85% recall which is higher than previously reported results. We also annotated 1250 randomly selected MEDLINE records as a gold standard. On this set we achieved 96.5% precision and 83.2% recall. This compares favourably with the well known Schwartz and Hearst algorithm. CONCLUSION: We developed an algorithm for abbreviation identification that uses a variety of strategies to identify the most probable definition for an abbreviation and also produces an estimated accuracy of the result. This process is purely automatic. BioMed Central 2008-09-25 /pmc/articles/PMC2576267/ /pubmed/18817555 http://dx.doi.org/10.1186/1471-2105-9-402 Text en Copyright © 2008 Sohn et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Sohn, Sunghwan Comeau, Donald C Kim, Won Wilbur, W John Abbreviation definition identification based on automatic precision estimates
title	Abbreviation definition identification based on automatic precision estimates
title_full	Abbreviation definition identification based on automatic precision estimates
title_fullStr	Abbreviation definition identification based on automatic precision estimates
title_full_unstemmed	Abbreviation definition identification based on automatic precision estimates
title_short	Abbreviation definition identification based on automatic precision estimates
title_sort	abbreviation definition identification based on automatic precision estimates
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2576267/ https://www.ncbi.nlm.nih.gov/pubmed/18817555 http://dx.doi.org/10.1186/1471-2105-9-402
work_keys_str_mv	AT sohnsunghwan abbreviationdefinitionidentificationbasedonautomaticprecisionestimates AT comeaudonaldc abbreviationdefinitionidentificationbasedonautomaticprecisionestimates AT kimwon abbreviationdefinitionidentificationbasedonautomaticprecisionestimates AT wilburwjohn abbreviationdefinitionidentificationbasedonautomaticprecisionestimates

Abbreviation definition identification based on automatic precision estimates

Ejemplares similares