Cargando…
The variable quality of metadata about biological samples used in biomedical experiments
We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample—a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples—a repository managed...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6380228/ https://www.ncbi.nlm.nih.gov/pubmed/30778255 http://dx.doi.org/10.1038/sdata.2019.21 |
_version_ | 1783396280996724736 |
---|---|
author | Gonçalves, Rafael S. Musen, Mark A. |
author_facet | Gonçalves, Rafael S. Musen, Mark A. |
author_sort | Gonçalves, Rafael S. |
collection | PubMed |
description | We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample—a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples—a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets. |
format | Online Article Text |
id | pubmed-6380228 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Nature Publishing Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-63802282019-02-21 The variable quality of metadata about biological samples used in biomedical experiments Gonçalves, Rafael S. Musen, Mark A. Sci Data Analysis We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample—a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples—a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets. Nature Publishing Group 2019-02-19 /pmc/articles/PMC6380228/ /pubmed/30778255 http://dx.doi.org/10.1038/sdata.2019.21 Text en Copyright © 2019, The Author(s) http://creativecommons.org/licenses/by/4.0/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ |
spellingShingle | Analysis Gonçalves, Rafael S. Musen, Mark A. The variable quality of metadata about biological samples used in biomedical experiments |
title | The variable quality of metadata about biological samples used in biomedical experiments |
title_full | The variable quality of metadata about biological samples used in biomedical experiments |
title_fullStr | The variable quality of metadata about biological samples used in biomedical experiments |
title_full_unstemmed | The variable quality of metadata about biological samples used in biomedical experiments |
title_short | The variable quality of metadata about biological samples used in biomedical experiments |
title_sort | variable quality of metadata about biological samples used in biomedical experiments |
topic | Analysis |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6380228/ https://www.ncbi.nlm.nih.gov/pubmed/30778255 http://dx.doi.org/10.1038/sdata.2019.21 |
work_keys_str_mv | AT goncalvesrafaels thevariablequalityofmetadataaboutbiologicalsamplesusedinbiomedicalexperiments AT musenmarka thevariablequalityofmetadataaboutbiologicalsamplesusedinbiomedicalexperiments AT goncalvesrafaels variablequalityofmetadataaboutbiologicalsamplesusedinbiomedicalexperiments AT musenmarka variablequalityofmetadataaboutbiologicalsamplesusedinbiomedicalexperiments |