Cargando…

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora

BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomed...

Descripción completa

Detalles Bibliográficos
Autores principales: Islamaj Doğan, Rezarta, Comeau, Donald C., Yeganova, Lana, Wilbur, W. John
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4051513/
https://www.ncbi.nlm.nih.gov/pubmed/24914232
http://dx.doi.org/10.1093/database/bau044
_version_ 1782320103270907904
author Islamaj Doğan, Rezarta
Comeau, Donald C.
Yeganova, Lana
Wilbur, W. John
author_facet Islamaj Doğan, Rezarta
Comeau, Donald C.
Yeganova, Lana
Wilbur, W. John
author_sort Islamaj Doğan, Rezarta
collection PubMed
description BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information—that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net.
format Online
Article
Text
id pubmed-4051513
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-40515132014-06-13 Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora Islamaj Doğan, Rezarta Comeau, Donald C. Yeganova, Lana Wilbur, W. John Database (Oxford) Original Article BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information—that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net. Oxford University Press 2014-06-09 /pmc/articles/PMC4051513/ /pubmed/24914232 http://dx.doi.org/10.1093/database/bau044 Text en Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.
spellingShingle Original Article
Islamaj Doğan, Rezarta
Comeau, Donald C.
Yeganova, Lana
Wilbur, W. John
Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora
title Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora
title_full Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora
title_fullStr Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora
title_full_unstemmed Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora
title_short Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora
title_sort finding abbreviations in biomedical literature: three bioc-compatible modules and four bioc-formatted corpora
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4051513/
https://www.ncbi.nlm.nih.gov/pubmed/24914232
http://dx.doi.org/10.1093/database/bau044
work_keys_str_mv AT islamajdoganrezarta findingabbreviationsinbiomedicalliteraturethreebioccompatiblemodulesandfourbiocformattedcorpora
AT comeaudonaldc findingabbreviationsinbiomedicalliteraturethreebioccompatiblemodulesandfourbiocformattedcorpora
AT yeganovalana findingabbreviationsinbiomedicalliteraturethreebioccompatiblemodulesandfourbiocformattedcorpora
AT wilburwjohn findingabbreviationsinbiomedicalliteraturethreebioccompatiblemodulesandfourbiocformattedcorpora