Cargando…

Repositories for Taxonomic Data: Where We Are and What is Missing

Natural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of o...

Descripción completa

Detalles Bibliográficos
Autores principales: Miralles, Aurélien, Bruy, Teddy, Wolcott, Katherine, Scherz, Mark D, Begerow, Dominik, Beszteri, Bank, Bonkowski, Michael, Felden, Janine, Gemeinholzer, Birgit, Glaw, Frank, Glöckner, Frank Oliver, Hawlitschek, Oliver, Kostadinov, Ivaylo, Nattkemper, Tim W, Printzen, Christian, Renz, Jasmin, Rybalka, Nataliya, Stadler, Marc, Weibulat, Tanja, Wilke, Thomas, Renner, Susanne S, Vences, Miguel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7584136/
https://www.ncbi.nlm.nih.gov/pubmed/32298457
http://dx.doi.org/10.1093/sysbio/syaa026
_version_ 1783599534796963840
author Miralles, Aurélien
Bruy, Teddy
Wolcott, Katherine
Scherz, Mark D
Begerow, Dominik
Beszteri, Bank
Bonkowski, Michael
Felden, Janine
Gemeinholzer, Birgit
Glaw, Frank
Glöckner, Frank Oliver
Hawlitschek, Oliver
Kostadinov, Ivaylo
Nattkemper, Tim W
Printzen, Christian
Renz, Jasmin
Rybalka, Nataliya
Stadler, Marc
Weibulat, Tanja
Wilke, Thomas
Renner, Susanne S
Vences, Miguel
author_facet Miralles, Aurélien
Bruy, Teddy
Wolcott, Katherine
Scherz, Mark D
Begerow, Dominik
Beszteri, Bank
Bonkowski, Michael
Felden, Janine
Gemeinholzer, Birgit
Glaw, Frank
Glöckner, Frank Oliver
Hawlitschek, Oliver
Kostadinov, Ivaylo
Nattkemper, Tim W
Printzen, Christian
Renz, Jasmin
Rybalka, Nataliya
Stadler, Marc
Weibulat, Tanja
Wilke, Thomas
Renner, Susanne S
Vences, Miguel
author_sort Miralles, Aurélien
collection PubMed
description Natural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000–20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in >80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term—ideally perpetual—data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach—linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated [Formula: see text] 2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000–40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.]
format Online
Article
Text
id pubmed-7584136
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-75841362020-10-29 Repositories for Taxonomic Data: Where We Are and What is Missing Miralles, Aurélien Bruy, Teddy Wolcott, Katherine Scherz, Mark D Begerow, Dominik Beszteri, Bank Bonkowski, Michael Felden, Janine Gemeinholzer, Birgit Glaw, Frank Glöckner, Frank Oliver Hawlitschek, Oliver Kostadinov, Ivaylo Nattkemper, Tim W Printzen, Christian Renz, Jasmin Rybalka, Nataliya Stadler, Marc Weibulat, Tanja Wilke, Thomas Renner, Susanne S Vences, Miguel Syst Biol Point of View Natural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000–20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in >80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term—ideally perpetual—data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach—linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated [Formula: see text] 2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000–40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.] Oxford University Press 2020-04-16 /pmc/articles/PMC7584136/ /pubmed/32298457 http://dx.doi.org/10.1093/sysbio/syaa026 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Point of View
Miralles, Aurélien
Bruy, Teddy
Wolcott, Katherine
Scherz, Mark D
Begerow, Dominik
Beszteri, Bank
Bonkowski, Michael
Felden, Janine
Gemeinholzer, Birgit
Glaw, Frank
Glöckner, Frank Oliver
Hawlitschek, Oliver
Kostadinov, Ivaylo
Nattkemper, Tim W
Printzen, Christian
Renz, Jasmin
Rybalka, Nataliya
Stadler, Marc
Weibulat, Tanja
Wilke, Thomas
Renner, Susanne S
Vences, Miguel
Repositories for Taxonomic Data: Where We Are and What is Missing
title Repositories for Taxonomic Data: Where We Are and What is Missing
title_full Repositories for Taxonomic Data: Where We Are and What is Missing
title_fullStr Repositories for Taxonomic Data: Where We Are and What is Missing
title_full_unstemmed Repositories for Taxonomic Data: Where We Are and What is Missing
title_short Repositories for Taxonomic Data: Where We Are and What is Missing
title_sort repositories for taxonomic data: where we are and what is missing
topic Point of View
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7584136/
https://www.ncbi.nlm.nih.gov/pubmed/32298457
http://dx.doi.org/10.1093/sysbio/syaa026
work_keys_str_mv AT mirallesaurelien repositoriesfortaxonomicdatawhereweareandwhatismissing
AT bruyteddy repositoriesfortaxonomicdatawhereweareandwhatismissing
AT wolcottkatherine repositoriesfortaxonomicdatawhereweareandwhatismissing
AT scherzmarkd repositoriesfortaxonomicdatawhereweareandwhatismissing
AT begerowdominik repositoriesfortaxonomicdatawhereweareandwhatismissing
AT beszteribank repositoriesfortaxonomicdatawhereweareandwhatismissing
AT bonkowskimichael repositoriesfortaxonomicdatawhereweareandwhatismissing
AT feldenjanine repositoriesfortaxonomicdatawhereweareandwhatismissing
AT gemeinholzerbirgit repositoriesfortaxonomicdatawhereweareandwhatismissing
AT glawfrank repositoriesfortaxonomicdatawhereweareandwhatismissing
AT glocknerfrankoliver repositoriesfortaxonomicdatawhereweareandwhatismissing
AT hawlitschekoliver repositoriesfortaxonomicdatawhereweareandwhatismissing
AT kostadinovivaylo repositoriesfortaxonomicdatawhereweareandwhatismissing
AT nattkempertimw repositoriesfortaxonomicdatawhereweareandwhatismissing
AT printzenchristian repositoriesfortaxonomicdatawhereweareandwhatismissing
AT renzjasmin repositoriesfortaxonomicdatawhereweareandwhatismissing
AT rybalkanataliya repositoriesfortaxonomicdatawhereweareandwhatismissing
AT stadlermarc repositoriesfortaxonomicdatawhereweareandwhatismissing
AT weibulattanja repositoriesfortaxonomicdatawhereweareandwhatismissing
AT wilkethomas repositoriesfortaxonomicdatawhereweareandwhatismissing
AT rennersusannes repositoriesfortaxonomicdatawhereweareandwhatismissing
AT vencesmiguel repositoriesfortaxonomicdatawhereweareandwhatismissing