Cargando…

Improved standardization of transcribed digital specimen data

There are more than 1.2 billion biological specimens in the world’s museums and herbaria. These objects are particularly important forms of biological sample and observation. They underpin biological taxonomy but the data they contain have many other uses in the biological and environmental sciences...

Descripción completa

Detalles Bibliográficos
Autores principales: Groom, Quentin, Dillen, Mathias, Hardy, Helen, Phillips, Sarah, Willemse, Luc, Wu, Zhengzhe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6901386/
https://www.ncbi.nlm.nih.gov/pubmed/31819990
http://dx.doi.org/10.1093/database/baz129
_version_ 1783477487512059904
author Groom, Quentin
Dillen, Mathias
Hardy, Helen
Phillips, Sarah
Willemse, Luc
Wu, Zhengzhe
author_facet Groom, Quentin
Dillen, Mathias
Hardy, Helen
Phillips, Sarah
Willemse, Luc
Wu, Zhengzhe
author_sort Groom, Quentin
collection PubMed
description There are more than 1.2 billion biological specimens in the world’s museums and herbaria. These objects are particularly important forms of biological sample and observation. They underpin biological taxonomy but the data they contain have many other uses in the biological and environmental sciences. Nevertheless, from their conception they are almost entirely documented on paper, either as labels attached to the specimens or in catalogues linked with catalogue numbers. In order to make the best use of these data and to improve the findability of these specimens, these data must be transcribed digitally and made to conform to standards, so that these data are also interoperable and reusable. Through various digitization projects, the authors have experimented with transcription by volunteers, expert technicians, scientists, commercial transcription services and automated systems. We have also been consumers of specimen data for taxonomical, biogeographical and ecological research. In this paper, we draw from our experiences to make specific recommendations to improve transcription data. The paper is split into two sections. We first address issues related to database implementation with relevance to data transcription, namely versioning, annotation, unknown and incomplete data and issues related to language. We then focus on particular data types that are relevant to biological collection specimens, namely nomenclature, dates, geography, collector numbers and uniquely identifying people. We make recommendations to standards organizations, software developers, data scientists and transcribers to improve these data with the specific aim of improving interoperability between collection datasets.
format Online
Article
Text
id pubmed-6901386
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-69013862019-12-16 Improved standardization of transcribed digital specimen data Groom, Quentin Dillen, Mathias Hardy, Helen Phillips, Sarah Willemse, Luc Wu, Zhengzhe Database (Oxford) Original Article There are more than 1.2 billion biological specimens in the world’s museums and herbaria. These objects are particularly important forms of biological sample and observation. They underpin biological taxonomy but the data they contain have many other uses in the biological and environmental sciences. Nevertheless, from their conception they are almost entirely documented on paper, either as labels attached to the specimens or in catalogues linked with catalogue numbers. In order to make the best use of these data and to improve the findability of these specimens, these data must be transcribed digitally and made to conform to standards, so that these data are also interoperable and reusable. Through various digitization projects, the authors have experimented with transcription by volunteers, expert technicians, scientists, commercial transcription services and automated systems. We have also been consumers of specimen data for taxonomical, biogeographical and ecological research. In this paper, we draw from our experiences to make specific recommendations to improve transcription data. The paper is split into two sections. We first address issues related to database implementation with relevance to data transcription, namely versioning, annotation, unknown and incomplete data and issues related to language. We then focus on particular data types that are relevant to biological collection specimens, namely nomenclature, dates, geography, collector numbers and uniquely identifying people. We make recommendations to standards organizations, software developers, data scientists and transcribers to improve these data with the specific aim of improving interoperability between collection datasets. Oxford University Press 2019-12-09 /pmc/articles/PMC6901386/ /pubmed/31819990 http://dx.doi.org/10.1093/database/baz129 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Groom, Quentin
Dillen, Mathias
Hardy, Helen
Phillips, Sarah
Willemse, Luc
Wu, Zhengzhe
Improved standardization of transcribed digital specimen data
title Improved standardization of transcribed digital specimen data
title_full Improved standardization of transcribed digital specimen data
title_fullStr Improved standardization of transcribed digital specimen data
title_full_unstemmed Improved standardization of transcribed digital specimen data
title_short Improved standardization of transcribed digital specimen data
title_sort improved standardization of transcribed digital specimen data
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6901386/
https://www.ncbi.nlm.nih.gov/pubmed/31819990
http://dx.doi.org/10.1093/database/baz129
work_keys_str_mv AT groomquentin improvedstandardizationoftranscribeddigitalspecimendata
AT dillenmathias improvedstandardizationoftranscribeddigitalspecimendata
AT hardyhelen improvedstandardizationoftranscribeddigitalspecimendata
AT phillipssarah improvedstandardizationoftranscribeddigitalspecimendata
AT willemseluc improvedstandardizationoftranscribeddigitalspecimendata
AT wuzhengzhe improvedstandardizationoftranscribeddigitalspecimendata