Cargando…

NetiNeti: discovery of scientific names from text using machine learning methods

BACKGROUND: A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important m...

Descripción completa

Detalles Bibliográficos
Autores principales: Akella, Lakshmi Manohar, Norton, Catherine N, Miller, Holly
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3542245/
https://www.ncbi.nlm.nih.gov/pubmed/22913485
http://dx.doi.org/10.1186/1471-2105-13-211
_version_ 1782255476136738816
author Akella, Lakshmi Manohar
Norton, Catherine N
Miller, Holly
author_facet Akella, Lakshmi Manohar
Norton, Catherine N
Miller, Holly
author_sort Akella, Lakshmi Manohar
collection PubMed
description BACKGROUND: A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information. RESULTS: We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9% and recall = 70.5%) compared to a popular dictionary based approach (precision = 97.5% and recall = 54.3%) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central’s full text articles annotated with scientific names, the precision and recall values are 98.5% and 96.2% respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages. CONCLUSIONS: We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org.
format Online
Article
Text
id pubmed-3542245
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-35422452013-01-11 NetiNeti: discovery of scientific names from text using machine learning methods Akella, Lakshmi Manohar Norton, Catherine N Miller, Holly BMC Bioinformatics Research Article BACKGROUND: A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information. RESULTS: We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9% and recall = 70.5%) compared to a popular dictionary based approach (precision = 97.5% and recall = 54.3%) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central’s full text articles annotated with scientific names, the precision and recall values are 98.5% and 96.2% respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages. CONCLUSIONS: We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org. BioMed Central 2012-08-22 /pmc/articles/PMC3542245/ /pubmed/22913485 http://dx.doi.org/10.1186/1471-2105-13-211 Text en Copyright ©2012 Akella et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Akella, Lakshmi Manohar
Norton, Catherine N
Miller, Holly
NetiNeti: discovery of scientific names from text using machine learning methods
title NetiNeti: discovery of scientific names from text using machine learning methods
title_full NetiNeti: discovery of scientific names from text using machine learning methods
title_fullStr NetiNeti: discovery of scientific names from text using machine learning methods
title_full_unstemmed NetiNeti: discovery of scientific names from text using machine learning methods
title_short NetiNeti: discovery of scientific names from text using machine learning methods
title_sort netineti: discovery of scientific names from text using machine learning methods
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3542245/
https://www.ncbi.nlm.nih.gov/pubmed/22913485
http://dx.doi.org/10.1186/1471-2105-13-211
work_keys_str_mv AT akellalakshmimanohar netinetidiscoveryofscientificnamesfromtextusingmachinelearningmethods
AT nortoncatherinen netinetidiscoveryofscientificnamesfromtextusingmachinelearningmethods
AT millerholly netinetidiscoveryofscientificnamesfromtextusingmachinelearningmethods