Cargando…

tmBioC: improving interoperability of text-mining tools with BioC

The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogenei...

Descripción completa

Detalles Bibliográficos
Autores principales: Khare, Ritu, Wei, Chih-Hsuan, Mao, Yuqing, Leaman, Robert, Lu, Zhiyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4110697/
https://www.ncbi.nlm.nih.gov/pubmed/25062914
http://dx.doi.org/10.1093/database/bau073
_version_ 1782328018864177152
author Khare, Ritu
Wei, Chih-Hsuan
Mao, Yuqing
Leaman, Robert
Lu, Zhiyong
author_facet Khare, Ritu
Wei, Chih-Hsuan
Mao, Yuqing
Leaman, Robert
Lu, Zhiyong
author_sort Khare, Ritu
collection PubMed
description The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/
format Online
Article
Text
id pubmed-4110697
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-41106972014-09-04 tmBioC: improving interoperability of text-mining tools with BioC Khare, Ritu Wei, Chih-Hsuan Mao, Yuqing Leaman, Robert Lu, Zhiyong Database (Oxford) Original Article The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/ Oxford University Press 2014-07-25 /pmc/articles/PMC4110697/ /pubmed/25062914 http://dx.doi.org/10.1093/database/bau073 Text en Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.
spellingShingle Original Article
Khare, Ritu
Wei, Chih-Hsuan
Mao, Yuqing
Leaman, Robert
Lu, Zhiyong
tmBioC: improving interoperability of text-mining tools with BioC
title tmBioC: improving interoperability of text-mining tools with BioC
title_full tmBioC: improving interoperability of text-mining tools with BioC
title_fullStr tmBioC: improving interoperability of text-mining tools with BioC
title_full_unstemmed tmBioC: improving interoperability of text-mining tools with BioC
title_short tmBioC: improving interoperability of text-mining tools with BioC
title_sort tmbioc: improving interoperability of text-mining tools with bioc
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4110697/
https://www.ncbi.nlm.nih.gov/pubmed/25062914
http://dx.doi.org/10.1093/database/bau073
work_keys_str_mv AT khareritu tmbiocimprovinginteroperabilityoftextminingtoolswithbioc
AT weichihhsuan tmbiocimprovinginteroperabilityoftextminingtoolswithbioc
AT maoyuqing tmbiocimprovinginteroperabilityoftextminingtoolswithbioc
AT leamanrobert tmbiocimprovinginteroperabilityoftextminingtoolswithbioc
AT luzhiyong tmbiocimprovinginteroperabilityoftextminingtoolswithbioc