Cargando…

The variable quality of metadata about biological samples used in biomedical experiments

We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample—a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples—a repository managed...

Descripción completa

Detalles Bibliográficos
Autores principales: Gonçalves, Rafael S., Musen, Mark A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6380228/
https://www.ncbi.nlm.nih.gov/pubmed/30778255
http://dx.doi.org/10.1038/sdata.2019.21
_version_ 1783396280996724736
author Gonçalves, Rafael S.
Musen, Mark A.
author_facet Gonçalves, Rafael S.
Musen, Mark A.
author_sort Gonçalves, Rafael S.
collection PubMed
description We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample—a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples—a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.
format Online
Article
Text
id pubmed-6380228
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-63802282019-02-21 The variable quality of metadata about biological samples used in biomedical experiments Gonçalves, Rafael S. Musen, Mark A. Sci Data Analysis We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample—a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples—a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets. Nature Publishing Group 2019-02-19 /pmc/articles/PMC6380228/ /pubmed/30778255 http://dx.doi.org/10.1038/sdata.2019.21 Text en Copyright © 2019, The Author(s) http://creativecommons.org/licenses/by/4.0/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
spellingShingle Analysis
Gonçalves, Rafael S.
Musen, Mark A.
The variable quality of metadata about biological samples used in biomedical experiments
title The variable quality of metadata about biological samples used in biomedical experiments
title_full The variable quality of metadata about biological samples used in biomedical experiments
title_fullStr The variable quality of metadata about biological samples used in biomedical experiments
title_full_unstemmed The variable quality of metadata about biological samples used in biomedical experiments
title_short The variable quality of metadata about biological samples used in biomedical experiments
title_sort variable quality of metadata about biological samples used in biomedical experiments
topic Analysis
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6380228/
https://www.ncbi.nlm.nih.gov/pubmed/30778255
http://dx.doi.org/10.1038/sdata.2019.21
work_keys_str_mv AT goncalvesrafaels thevariablequalityofmetadataaboutbiologicalsamplesusedinbiomedicalexperiments
AT musenmarka thevariablequalityofmetadataaboutbiologicalsamplesusedinbiomedicalexperiments
AT goncalvesrafaels variablequalityofmetadataaboutbiologicalsamplesusedinbiomedicalexperiments
AT musenmarka variablequalityofmetadataaboutbiologicalsamplesusedinbiomedicalexperiments