Cargando…
Solr-Plant: efficient extraction of plant names from text
BACKGROUND: The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Scalable solutions are needed for identifying plant name mentions from text and resolving them to accepted taxonomi...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6530169/ https://www.ncbi.nlm.nih.gov/pubmed/31117932 http://dx.doi.org/10.1186/s12859-019-2874-6 |
_version_ | 1783420571813412864 |
---|---|
author | Sharma, Vivekanand Restrepo, Maria Isabel Sarkar, Indra Neil |
author_facet | Sharma, Vivekanand Restrepo, Maria Isabel Sarkar, Indra Neil |
author_sort | Sharma, Vivekanand |
collection | PubMed |
description | BACKGROUND: The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Scalable solutions are needed for identifying plant name mentions from text and resolving them to accepted taxonomic names. RESULTS: An Apache Solr-based fuzzy matching system enhanced with the Smith-Waterman alignment algorithm (“Solr-Plant”) was developed for mapping and resolution to a plant name and synonym thesaurus. Evaluation of Solr-Plant suggests promising results in terms of both accuracy and processing efficiency on misspelled species names from two benchmark datasets: (1) SALVIAS and (2) National Center for Biotechnology Information (NCBI) Taxonomy. Additional evaluation using S800 text corpus also reflects high precision and recall. The latest version of the source code is available at https://github.com/bcbi/SolrPlantAPI. A REST-compliant web interface and service for Solr-Plant is hosted at http://bcbi.brown.edu/solrplant. CONCLUSION: Automated techniques are needed for efficient and accurate identification of knowledge linked with biological scientific names. Solr-Plant complements the current state-of-the-art in terms of both efficiency and accuracy in identification of names restricted at species level. The approach can be extended to identify broader groups of organisms at different taxonomic levels. The results reflect potential utility of Solr-Plant as a data mining tool for extracting and correcting plant species names. |
format | Online Article Text |
id | pubmed-6530169 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-65301692019-05-28 Solr-Plant: efficient extraction of plant names from text Sharma, Vivekanand Restrepo, Maria Isabel Sarkar, Indra Neil BMC Bioinformatics Software BACKGROUND: The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Scalable solutions are needed for identifying plant name mentions from text and resolving them to accepted taxonomic names. RESULTS: An Apache Solr-based fuzzy matching system enhanced with the Smith-Waterman alignment algorithm (“Solr-Plant”) was developed for mapping and resolution to a plant name and synonym thesaurus. Evaluation of Solr-Plant suggests promising results in terms of both accuracy and processing efficiency on misspelled species names from two benchmark datasets: (1) SALVIAS and (2) National Center for Biotechnology Information (NCBI) Taxonomy. Additional evaluation using S800 text corpus also reflects high precision and recall. The latest version of the source code is available at https://github.com/bcbi/SolrPlantAPI. A REST-compliant web interface and service for Solr-Plant is hosted at http://bcbi.brown.edu/solrplant. CONCLUSION: Automated techniques are needed for efficient and accurate identification of knowledge linked with biological scientific names. Solr-Plant complements the current state-of-the-art in terms of both efficiency and accuracy in identification of names restricted at species level. The approach can be extended to identify broader groups of organisms at different taxonomic levels. The results reflect potential utility of Solr-Plant as a data mining tool for extracting and correcting plant species names. BioMed Central 2019-05-22 /pmc/articles/PMC6530169/ /pubmed/31117932 http://dx.doi.org/10.1186/s12859-019-2874-6 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Software Sharma, Vivekanand Restrepo, Maria Isabel Sarkar, Indra Neil Solr-Plant: efficient extraction of plant names from text |
title | Solr-Plant: efficient extraction of plant names from text |
title_full | Solr-Plant: efficient extraction of plant names from text |
title_fullStr | Solr-Plant: efficient extraction of plant names from text |
title_full_unstemmed | Solr-Plant: efficient extraction of plant names from text |
title_short | Solr-Plant: efficient extraction of plant names from text |
title_sort | solr-plant: efficient extraction of plant names from text |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6530169/ https://www.ncbi.nlm.nih.gov/pubmed/31117932 http://dx.doi.org/10.1186/s12859-019-2874-6 |
work_keys_str_mv | AT sharmavivekanand solrplantefficientextractionofplantnamesfromtext AT restrepomariaisabel solrplantefficientextractionofplantnamesfromtext AT sarkarindraneil solrplantefficientextractionofplantnamesfromtext |