Cargando…
tmBioC: improving interoperability of text-mining tools with BioC
The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogenei...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4110697/ https://www.ncbi.nlm.nih.gov/pubmed/25062914 http://dx.doi.org/10.1093/database/bau073 |
_version_ | 1782328018864177152 |
---|---|
author | Khare, Ritu Wei, Chih-Hsuan Mao, Yuqing Leaman, Robert Lu, Zhiyong |
author_facet | Khare, Ritu Wei, Chih-Hsuan Mao, Yuqing Leaman, Robert Lu, Zhiyong |
author_sort | Khare, Ritu |
collection | PubMed |
description | The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/ |
format | Online Article Text |
id | pubmed-4110697 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-41106972014-09-04 tmBioC: improving interoperability of text-mining tools with BioC Khare, Ritu Wei, Chih-Hsuan Mao, Yuqing Leaman, Robert Lu, Zhiyong Database (Oxford) Original Article The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/ Oxford University Press 2014-07-25 /pmc/articles/PMC4110697/ /pubmed/25062914 http://dx.doi.org/10.1093/database/bau073 Text en Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US. |
spellingShingle | Original Article Khare, Ritu Wei, Chih-Hsuan Mao, Yuqing Leaman, Robert Lu, Zhiyong tmBioC: improving interoperability of text-mining tools with BioC |
title | tmBioC: improving interoperability of text-mining tools with BioC |
title_full | tmBioC: improving interoperability of text-mining tools with BioC |
title_fullStr | tmBioC: improving interoperability of text-mining tools with BioC |
title_full_unstemmed | tmBioC: improving interoperability of text-mining tools with BioC |
title_short | tmBioC: improving interoperability of text-mining tools with BioC |
title_sort | tmbioc: improving interoperability of text-mining tools with bioc |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4110697/ https://www.ncbi.nlm.nih.gov/pubmed/25062914 http://dx.doi.org/10.1093/database/bau073 |
work_keys_str_mv | AT khareritu tmbiocimprovinginteroperabilityoftextminingtoolswithbioc AT weichihhsuan tmbiocimprovinginteroperabilityoftextminingtoolswithbioc AT maoyuqing tmbiocimprovinginteroperabilityoftextminingtoolswithbioc AT leamanrobert tmbiocimprovinginteroperabilityoftextminingtoolswithbioc AT luzhiyong tmbiocimprovinginteroperabilityoftextminingtoolswithbioc |