Cargando…

BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature

BACKGROUND: To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its correspondin...

Descripción completa

Detalles Bibliográficos
Autores principales: Kuo, Cheng-Ju, Ling, Maurice HT, Lin, Kuan-Ting, Hsu, Chun-Nan
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788358/
https://www.ncbi.nlm.nih.gov/pubmed/19958517
http://dx.doi.org/10.1186/1471-2105-10-S15-S7
_version_ 1782174963169492992
author Kuo, Cheng-Ju
Ling, Maurice HT
Lin, Kuan-Ting
Hsu, Chun-Nan
author_facet Kuo, Cheng-Ju
Ling, Maurice HT
Lin, Kuan-Ting
Hsu, Chun-Nan
author_sort Kuo, Cheng-Ju
collection PubMed
description BACKGROUND: To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools. RESULTS: Our approach to abbreviation recognition (AR) is based on machine-learning, which exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus, our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than the result achieved by the existing best AR performance system. We also annotated a new corpus of 1200 PubMed abstracts which was derived from BioCreative II gene normalization corpus. On our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95% recall, which also outperforms all tested systems. CONCLUSION: By applying our system to extract all short form-long form pairs from all available PubMed abstracts, we have constructed BIOADI. Mining BIOADI reveals many interesting trends of bio-medical research. Besides, we also provide an off-line AR software in the download section on http://bioagent.iis.sinica.edu.tw/BIOADI/.
format Text
id pubmed-2788358
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27883582009-12-04 BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature Kuo, Cheng-Ju Ling, Maurice HT Lin, Kuan-Ting Hsu, Chun-Nan BMC Bioinformatics Proceedings BACKGROUND: To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools. RESULTS: Our approach to abbreviation recognition (AR) is based on machine-learning, which exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus, our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than the result achieved by the existing best AR performance system. We also annotated a new corpus of 1200 PubMed abstracts which was derived from BioCreative II gene normalization corpus. On our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95% recall, which also outperforms all tested systems. CONCLUSION: By applying our system to extract all short form-long form pairs from all available PubMed abstracts, we have constructed BIOADI. Mining BIOADI reveals many interesting trends of bio-medical research. Besides, we also provide an off-line AR software in the download section on http://bioagent.iis.sinica.edu.tw/BIOADI/. BioMed Central 2009-12-03 /pmc/articles/PMC2788358/ /pubmed/19958517 http://dx.doi.org/10.1186/1471-2105-10-S15-S7 Text en Copyright © 2009 Kuo et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Kuo, Cheng-Ju
Ling, Maurice HT
Lin, Kuan-Ting
Hsu, Chun-Nan
BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature
title BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature
title_full BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature
title_fullStr BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature
title_full_unstemmed BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature
title_short BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature
title_sort bioadi: a machine learning approach to identifying abbreviations and definitions in biological literature
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788358/
https://www.ncbi.nlm.nih.gov/pubmed/19958517
http://dx.doi.org/10.1186/1471-2105-10-S15-S7
work_keys_str_mv AT kuochengju bioadiamachinelearningapproachtoidentifyingabbreviationsanddefinitionsinbiologicalliterature
AT lingmauriceht bioadiamachinelearningapproachtoidentifyingabbreviationsanddefinitionsinbiologicalliterature
AT linkuanting bioadiamachinelearningapproachtoidentifyingabbreviationsanddefinitionsinbiologicalliterature
AT hsuchunnan bioadiamachinelearningapproachtoidentifyingabbreviationsanddefinitionsinbiologicalliterature