Cargando…

“gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar

BACKGROUND: Scientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, missp...

Descripción completa

Detalles Bibliográficos
Autores principales: Mozzherin, Dmitry Y., Myltsev, Alexander A., Patterson, David J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5446698/
https://www.ncbi.nlm.nih.gov/pubmed/28549446
http://dx.doi.org/10.1186/s12859-017-1663-3
_version_ 1783239138853519360
author Mozzherin, Dmitry Y.
Myltsev, Alexander A.
Patterson, David J.
author_facet Mozzherin, Dmitry Y.
Myltsev, Alexander A.
Patterson, David J.
author_sort Mozzherin, Dmitry Y.
collection PubMed
description BACKGROUND: Scientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, misspellings, etc. Authorship is a part of a scientific name and may also differ significantly. To match all possible variations of a name we need to divide them into their elements and classify each element according to its role. We refer to this as ‘parsing’ the name. Parsing categorizes name’s elements into those that are stable and those that are prone to change. Names are matched first by combining them according to their stable elements. Matches are then refined by examining their varying elements. This two stage process dramatically improves the number and quality of matches. It is especially useful for the automatic data exchange within the context of “Big Data” in biology. RESULTS: We introduce Global Names Parser (gnparser). It is a Java tool written in Scala language (a language for Java Virtual Machine) to parse scientific names. It is based on a Parsing Expression Grammar. The parser can be applied to scientific names of any complexity. It assigns a semantic meaning (such as genus name, species epithet, rank, year of publication, authorship, annotations, etc.) to all elements of a name. It is able to work with nested structures as in the names of hybrids. gnparser performs with ≈99% accuracy and processes 30 million name-strings/hour per CPU thread. The gnparser library is compatible with Scala, Java, R, Jython, and JRuby. The parser can be used as a command line application, as a socket server, a web-app or as a RESTful HTTP-service. It is released under an Open source MIT license. CONCLUSIONS: Global Names Parser (gnparser) is a fast, high precision tool for biodiversity informaticians and biologists working with large numbers of scientific names. It can replace expensive and error-prone manual parsing and standardization of scientific names in many situations, and can quickly enhance the interoperability of distributed biological information. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1663-3) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5446698
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-54466982017-05-30 “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar Mozzherin, Dmitry Y. Myltsev, Alexander A. Patterson, David J. BMC Bioinformatics Software BACKGROUND: Scientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, misspellings, etc. Authorship is a part of a scientific name and may also differ significantly. To match all possible variations of a name we need to divide them into their elements and classify each element according to its role. We refer to this as ‘parsing’ the name. Parsing categorizes name’s elements into those that are stable and those that are prone to change. Names are matched first by combining them according to their stable elements. Matches are then refined by examining their varying elements. This two stage process dramatically improves the number and quality of matches. It is especially useful for the automatic data exchange within the context of “Big Data” in biology. RESULTS: We introduce Global Names Parser (gnparser). It is a Java tool written in Scala language (a language for Java Virtual Machine) to parse scientific names. It is based on a Parsing Expression Grammar. The parser can be applied to scientific names of any complexity. It assigns a semantic meaning (such as genus name, species epithet, rank, year of publication, authorship, annotations, etc.) to all elements of a name. It is able to work with nested structures as in the names of hybrids. gnparser performs with ≈99% accuracy and processes 30 million name-strings/hour per CPU thread. The gnparser library is compatible with Scala, Java, R, Jython, and JRuby. The parser can be used as a command line application, as a socket server, a web-app or as a RESTful HTTP-service. It is released under an Open source MIT license. CONCLUSIONS: Global Names Parser (gnparser) is a fast, high precision tool for biodiversity informaticians and biologists working with large numbers of scientific names. It can replace expensive and error-prone manual parsing and standardization of scientific names in many situations, and can quickly enhance the interoperability of distributed biological information. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1663-3) contains supplementary material, which is available to authorized users. BioMed Central 2017-05-26 /pmc/articles/PMC5446698/ /pubmed/28549446 http://dx.doi.org/10.1186/s12859-017-1663-3 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Mozzherin, Dmitry Y.
Myltsev, Alexander A.
Patterson, David J.
“gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar
title “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar
title_full “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar
title_fullStr “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar
title_full_unstemmed “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar
title_short “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar
title_sort “gnparser”: a powerful parser for scientific names based on parsing expression grammar
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5446698/
https://www.ncbi.nlm.nih.gov/pubmed/28549446
http://dx.doi.org/10.1186/s12859-017-1663-3
work_keys_str_mv AT mozzherindmitryy gnparserapowerfulparserforscientificnamesbasedonparsingexpressiongrammar
AT myltsevalexandera gnparserapowerfulparserforscientificnamesbasedonparsingexpressiongrammar
AT pattersondavidj gnparserapowerfulparserforscientificnamesbasedonparsingexpressiongrammar