Cargando…

Solr-Plant: efficient extraction of plant names from text

BACKGROUND: The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Scalable solutions are needed for identifying plant name mentions from text and resolving them to accepted taxonomi...

Descripción completa

Detalles Bibliográficos
Autores principales: Sharma, Vivekanand, Restrepo, Maria Isabel, Sarkar, Indra Neil
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6530169/
https://www.ncbi.nlm.nih.gov/pubmed/31117932
http://dx.doi.org/10.1186/s12859-019-2874-6
_version_ 1783420571813412864
author Sharma, Vivekanand
Restrepo, Maria Isabel
Sarkar, Indra Neil
author_facet Sharma, Vivekanand
Restrepo, Maria Isabel
Sarkar, Indra Neil
author_sort Sharma, Vivekanand
collection PubMed
description BACKGROUND: The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Scalable solutions are needed for identifying plant name mentions from text and resolving them to accepted taxonomic names. RESULTS: An Apache Solr-based fuzzy matching system enhanced with the Smith-Waterman alignment algorithm (“Solr-Plant”) was developed for mapping and resolution to a plant name and synonym thesaurus. Evaluation of Solr-Plant suggests promising results in terms of both accuracy and processing efficiency on misspelled species names from two benchmark datasets: (1) SALVIAS and (2) National Center for Biotechnology Information (NCBI) Taxonomy. Additional evaluation using S800 text corpus also reflects high precision and recall. The latest version of the source code is available at https://github.com/bcbi/SolrPlantAPI. A REST-compliant web interface and service for Solr-Plant is hosted at http://bcbi.brown.edu/solrplant. CONCLUSION: Automated techniques are needed for efficient and accurate identification of knowledge linked with biological scientific names. Solr-Plant complements the current state-of-the-art in terms of both efficiency and accuracy in identification of names restricted at species level. The approach can be extended to identify broader groups of organisms at different taxonomic levels. The results reflect potential utility of Solr-Plant as a data mining tool for extracting and correcting plant species names.
format Online
Article
Text
id pubmed-6530169
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-65301692019-05-28 Solr-Plant: efficient extraction of plant names from text Sharma, Vivekanand Restrepo, Maria Isabel Sarkar, Indra Neil BMC Bioinformatics Software BACKGROUND: The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Scalable solutions are needed for identifying plant name mentions from text and resolving them to accepted taxonomic names. RESULTS: An Apache Solr-based fuzzy matching system enhanced with the Smith-Waterman alignment algorithm (“Solr-Plant”) was developed for mapping and resolution to a plant name and synonym thesaurus. Evaluation of Solr-Plant suggests promising results in terms of both accuracy and processing efficiency on misspelled species names from two benchmark datasets: (1) SALVIAS and (2) National Center for Biotechnology Information (NCBI) Taxonomy. Additional evaluation using S800 text corpus also reflects high precision and recall. The latest version of the source code is available at https://github.com/bcbi/SolrPlantAPI. A REST-compliant web interface and service for Solr-Plant is hosted at http://bcbi.brown.edu/solrplant. CONCLUSION: Automated techniques are needed for efficient and accurate identification of knowledge linked with biological scientific names. Solr-Plant complements the current state-of-the-art in terms of both efficiency and accuracy in identification of names restricted at species level. The approach can be extended to identify broader groups of organisms at different taxonomic levels. The results reflect potential utility of Solr-Plant as a data mining tool for extracting and correcting plant species names. BioMed Central 2019-05-22 /pmc/articles/PMC6530169/ /pubmed/31117932 http://dx.doi.org/10.1186/s12859-019-2874-6 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Sharma, Vivekanand
Restrepo, Maria Isabel
Sarkar, Indra Neil
Solr-Plant: efficient extraction of plant names from text
title Solr-Plant: efficient extraction of plant names from text
title_full Solr-Plant: efficient extraction of plant names from text
title_fullStr Solr-Plant: efficient extraction of plant names from text
title_full_unstemmed Solr-Plant: efficient extraction of plant names from text
title_short Solr-Plant: efficient extraction of plant names from text
title_sort solr-plant: efficient extraction of plant names from text
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6530169/
https://www.ncbi.nlm.nih.gov/pubmed/31117932
http://dx.doi.org/10.1186/s12859-019-2874-6
work_keys_str_mv AT sharmavivekanand solrplantefficientextractionofplantnamesfromtext
AT restrepomariaisabel solrplantefficientextractionofplantnamesfromtext
AT sarkarindraneil solrplantefficientextractionofplantnamesfromtext