Cargando…

A Standardized Reference Data Set for Vertebrate Taxon Name Resolution

Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zermoglio, Paula F., Guralnick, Robert P., Wieczorek, John R.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4711887/ https://www.ncbi.nlm.nih.gov/pubmed/26760296 http://dx.doi.org/10.1371/journal.pone.0146894

_version_	1782409979891810304
author	Zermoglio, Paula F. Guralnick, Robert P. Wieczorek, John R.
author_facet	Zermoglio, Paula F. Guralnick, Robert P. Wieczorek, John R.
author_sort	Zermoglio, Paula F.
collection	PubMed
description	Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.
format	Online Article Text
id	pubmed-4711887
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-47118872016-01-26 A Standardized Reference Data Set for Vertebrate Taxon Name Resolution Zermoglio, Paula F. Guralnick, Robert P. Wieczorek, John R. PLoS One Research Article Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data. Public Library of Science 2016-01-13 /pmc/articles/PMC4711887/ /pubmed/26760296 http://dx.doi.org/10.1371/journal.pone.0146894 Text en © 2016 Zermoglio et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Zermoglio, Paula F. Guralnick, Robert P. Wieczorek, John R. A Standardized Reference Data Set for Vertebrate Taxon Name Resolution
title	A Standardized Reference Data Set for Vertebrate Taxon Name Resolution
title_full	A Standardized Reference Data Set for Vertebrate Taxon Name Resolution
title_fullStr	A Standardized Reference Data Set for Vertebrate Taxon Name Resolution
title_full_unstemmed	A Standardized Reference Data Set for Vertebrate Taxon Name Resolution
title_short	A Standardized Reference Data Set for Vertebrate Taxon Name Resolution
title_sort	standardized reference data set for vertebrate taxon name resolution
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4711887/ https://www.ncbi.nlm.nih.gov/pubmed/26760296 http://dx.doi.org/10.1371/journal.pone.0146894
work_keys_str_mv	AT zermogliopaulaf astandardizedreferencedatasetforvertebratetaxonnameresolution AT guralnickrobertp astandardizedreferencedatasetforvertebratetaxonnameresolution AT wieczorekjohnr astandardizedreferencedatasetforvertebratetaxonnameresolution AT zermogliopaulaf standardizedreferencedatasetforvertebratetaxonnameresolution AT guralnickrobertp standardizedreferencedatasetforvertebratetaxonnameresolution AT wieczorekjohnr standardizedreferencedatasetforvertebratetaxonnameresolution

A Standardized Reference Data Set for Vertebrate Taxon Name Resolution

Ejemplares similares