Cargando…

Ambiguity and variability of database and software names in bioinformatics

BACKGROUND: There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and softwa...

Descripción completa

Detalles Bibliográficos
Autores principales: Duck, Geraint, Kovacevic, Aleksandar, Robertson, David L., Stevens, Robert, Nenadic, Goran
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4485340/
https://www.ncbi.nlm.nih.gov/pubmed/26131352
http://dx.doi.org/10.1186/s13326-015-0026-0
_version_ 1782378776244518912
author Duck, Geraint
Kovacevic, Aleksandar
Robertson, David L.
Stevens, Robert
Nenadic, Goran
author_facet Duck, Geraint
Kovacevic, Aleksandar
Robertson, David L.
Stevens, Robert
Nenadic, Goran
author_sort Duck, Geraint
collection PubMed
description BACKGROUND: There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification. RESULTS: Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature. CONCLUSIONS: Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.
format Online
Article
Text
id pubmed-4485340
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-44853402015-07-01 Ambiguity and variability of database and software names in bioinformatics Duck, Geraint Kovacevic, Aleksandar Robertson, David L. Stevens, Robert Nenadic, Goran J Biomed Semantics Research Article BACKGROUND: There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification. RESULTS: Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature. CONCLUSIONS: Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy. BioMed Central 2015-06-29 /pmc/articles/PMC4485340/ /pubmed/26131352 http://dx.doi.org/10.1186/s13326-015-0026-0 Text en © Duck et al. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated
spellingShingle Research Article
Duck, Geraint
Kovacevic, Aleksandar
Robertson, David L.
Stevens, Robert
Nenadic, Goran
Ambiguity and variability of database and software names in bioinformatics
title Ambiguity and variability of database and software names in bioinformatics
title_full Ambiguity and variability of database and software names in bioinformatics
title_fullStr Ambiguity and variability of database and software names in bioinformatics
title_full_unstemmed Ambiguity and variability of database and software names in bioinformatics
title_short Ambiguity and variability of database and software names in bioinformatics
title_sort ambiguity and variability of database and software names in bioinformatics
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4485340/
https://www.ncbi.nlm.nih.gov/pubmed/26131352
http://dx.doi.org/10.1186/s13326-015-0026-0
work_keys_str_mv AT duckgeraint ambiguityandvariabilityofdatabaseandsoftwarenamesinbioinformatics
AT kovacevicaleksandar ambiguityandvariabilityofdatabaseandsoftwarenamesinbioinformatics
AT robertsondavidl ambiguityandvariabilityofdatabaseandsoftwarenamesinbioinformatics
AT stevensrobert ambiguityandvariabilityofdatabaseandsoftwarenamesinbioinformatics
AT nenadicgoran ambiguityandvariabilityofdatabaseandsoftwarenamesinbioinformatics