Cargando…
“gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar
BACKGROUND: Scientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, missp...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5446698/ https://www.ncbi.nlm.nih.gov/pubmed/28549446 http://dx.doi.org/10.1186/s12859-017-1663-3 |
_version_ | 1783239138853519360 |
---|---|
author | Mozzherin, Dmitry Y. Myltsev, Alexander A. Patterson, David J. |
author_facet | Mozzherin, Dmitry Y. Myltsev, Alexander A. Patterson, David J. |
author_sort | Mozzherin, Dmitry Y. |
collection | PubMed |
description | BACKGROUND: Scientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, misspellings, etc. Authorship is a part of a scientific name and may also differ significantly. To match all possible variations of a name we need to divide them into their elements and classify each element according to its role. We refer to this as ‘parsing’ the name. Parsing categorizes name’s elements into those that are stable and those that are prone to change. Names are matched first by combining them according to their stable elements. Matches are then refined by examining their varying elements. This two stage process dramatically improves the number and quality of matches. It is especially useful for the automatic data exchange within the context of “Big Data” in biology. RESULTS: We introduce Global Names Parser (gnparser). It is a Java tool written in Scala language (a language for Java Virtual Machine) to parse scientific names. It is based on a Parsing Expression Grammar. The parser can be applied to scientific names of any complexity. It assigns a semantic meaning (such as genus name, species epithet, rank, year of publication, authorship, annotations, etc.) to all elements of a name. It is able to work with nested structures as in the names of hybrids. gnparser performs with ≈99% accuracy and processes 30 million name-strings/hour per CPU thread. The gnparser library is compatible with Scala, Java, R, Jython, and JRuby. The parser can be used as a command line application, as a socket server, a web-app or as a RESTful HTTP-service. It is released under an Open source MIT license. CONCLUSIONS: Global Names Parser (gnparser) is a fast, high precision tool for biodiversity informaticians and biologists working with large numbers of scientific names. It can replace expensive and error-prone manual parsing and standardization of scientific names in many situations, and can quickly enhance the interoperability of distributed biological information. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1663-3) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5446698 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-54466982017-05-30 “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar Mozzherin, Dmitry Y. Myltsev, Alexander A. Patterson, David J. BMC Bioinformatics Software BACKGROUND: Scientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, misspellings, etc. Authorship is a part of a scientific name and may also differ significantly. To match all possible variations of a name we need to divide them into their elements and classify each element according to its role. We refer to this as ‘parsing’ the name. Parsing categorizes name’s elements into those that are stable and those that are prone to change. Names are matched first by combining them according to their stable elements. Matches are then refined by examining their varying elements. This two stage process dramatically improves the number and quality of matches. It is especially useful for the automatic data exchange within the context of “Big Data” in biology. RESULTS: We introduce Global Names Parser (gnparser). It is a Java tool written in Scala language (a language for Java Virtual Machine) to parse scientific names. It is based on a Parsing Expression Grammar. The parser can be applied to scientific names of any complexity. It assigns a semantic meaning (such as genus name, species epithet, rank, year of publication, authorship, annotations, etc.) to all elements of a name. It is able to work with nested structures as in the names of hybrids. gnparser performs with ≈99% accuracy and processes 30 million name-strings/hour per CPU thread. The gnparser library is compatible with Scala, Java, R, Jython, and JRuby. The parser can be used as a command line application, as a socket server, a web-app or as a RESTful HTTP-service. It is released under an Open source MIT license. CONCLUSIONS: Global Names Parser (gnparser) is a fast, high precision tool for biodiversity informaticians and biologists working with large numbers of scientific names. It can replace expensive and error-prone manual parsing and standardization of scientific names in many situations, and can quickly enhance the interoperability of distributed biological information. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1663-3) contains supplementary material, which is available to authorized users. BioMed Central 2017-05-26 /pmc/articles/PMC5446698/ /pubmed/28549446 http://dx.doi.org/10.1186/s12859-017-1663-3 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Software Mozzherin, Dmitry Y. Myltsev, Alexander A. Patterson, David J. “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar |
title | “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar |
title_full | “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar |
title_fullStr | “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar |
title_full_unstemmed | “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar |
title_short | “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar |
title_sort | “gnparser”: a powerful parser for scientific names based on parsing expression grammar |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5446698/ https://www.ncbi.nlm.nih.gov/pubmed/28549446 http://dx.doi.org/10.1186/s12859-017-1663-3 |
work_keys_str_mv | AT mozzherindmitryy gnparserapowerfulparserforscientificnamesbasedonparsingexpressiongrammar AT myltsevalexandera gnparserapowerfulparserforscientificnamesbasedonparsingexpressiongrammar AT pattersondavidj gnparserapowerfulparserforscientificnamesbasedonparsingexpressiongrammar |