Cargando…

Non-negligible Occurrence of Errors in Gender Description in Public Data Sets

Due to advances in omics technologies, numerous genome-wide studies on human samples have been published, and most of the omics data with the associated clinical information are available in public repositories, such as Gene Expression Omnibus and ArrayExpress. While analyzing several public dataset...

Descripción completa

Detalles Bibliográficos
Autores principales: Kim, Jong Hwan, Park, Jong-Luyl, Kim, Seon-Young
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Korea Genome Organization 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4838528/
https://www.ncbi.nlm.nih.gov/pubmed/27103889
http://dx.doi.org/10.5808/GI.2016.14.1.34
_version_ 1782427990187048960
author Kim, Jong Hwan
Park, Jong-Luyl
Kim, Seon-Young
author_facet Kim, Jong Hwan
Park, Jong-Luyl
Kim, Seon-Young
author_sort Kim, Jong Hwan
collection PubMed
description Due to advances in omics technologies, numerous genome-wide studies on human samples have been published, and most of the omics data with the associated clinical information are available in public repositories, such as Gene Expression Omnibus and ArrayExpress. While analyzing several public datasets, we observed that errors in gender information occur quite often in public datasets. When we analyzed the gender description and the methylation patterns of gender-specific probes (glucose-6-phosphate dehydrogenase [G6PD], ephrin-B1 [EFNB1], and testis specific protein, Y-linked 2 [TSPY2]) in 5,611 samples produced using Infinium 450K HumanMethylation arrays, we found that 19 samples from 7 datasets were erroneously described. We also analyzed 1,819 samples produced using the Affymetrix U133Plus2 array using several gender-specific genes (X (inactive)-specific transcript [XIST], eukaryotic translation initiation factor 1A, Y-linked [EIF1AY], and DEAD [Asp-Glu-Ala-Asp] box polypeptide 3, Y-linked [DDDX3Y]) and found that 40 samples from 3 datasets were erroneously described. We suggest that the users of public datasets should not expect that the data are error-free and, whenever possible, that they should check the consistency of the data.
format Online
Article
Text
id pubmed-4838528
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Korea Genome Organization
record_format MEDLINE/PubMed
spelling pubmed-48385282016-04-21 Non-negligible Occurrence of Errors in Gender Description in Public Data Sets Kim, Jong Hwan Park, Jong-Luyl Kim, Seon-Young Genomics Inform Original Article Due to advances in omics technologies, numerous genome-wide studies on human samples have been published, and most of the omics data with the associated clinical information are available in public repositories, such as Gene Expression Omnibus and ArrayExpress. While analyzing several public datasets, we observed that errors in gender information occur quite often in public datasets. When we analyzed the gender description and the methylation patterns of gender-specific probes (glucose-6-phosphate dehydrogenase [G6PD], ephrin-B1 [EFNB1], and testis specific protein, Y-linked 2 [TSPY2]) in 5,611 samples produced using Infinium 450K HumanMethylation arrays, we found that 19 samples from 7 datasets were erroneously described. We also analyzed 1,819 samples produced using the Affymetrix U133Plus2 array using several gender-specific genes (X (inactive)-specific transcript [XIST], eukaryotic translation initiation factor 1A, Y-linked [EIF1AY], and DEAD [Asp-Glu-Ala-Asp] box polypeptide 3, Y-linked [DDDX3Y]) and found that 40 samples from 3 datasets were erroneously described. We suggest that the users of public datasets should not expect that the data are error-free and, whenever possible, that they should check the consistency of the data. Korea Genome Organization 2016-03 2016-03-31 /pmc/articles/PMC4838528/ /pubmed/27103889 http://dx.doi.org/10.5808/GI.2016.14.1.34 Text en Copyright © 2016 by the Korea Genome Organization http://creativecommons.org/licenses/by-nc/4.0/ It is identical to the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/).
spellingShingle Original Article
Kim, Jong Hwan
Park, Jong-Luyl
Kim, Seon-Young
Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_full Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_fullStr Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_full_unstemmed Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_short Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_sort non-negligible occurrence of errors in gender description in public data sets
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4838528/
https://www.ncbi.nlm.nih.gov/pubmed/27103889
http://dx.doi.org/10.5808/GI.2016.14.1.34
work_keys_str_mv AT kimjonghwan nonnegligibleoccurrenceoferrorsingenderdescriptioninpublicdatasets
AT parkjongluyl nonnegligibleoccurrenceoferrorsingenderdescriptioninpublicdatasets
AT kimseonyoung nonnegligibleoccurrenceoferrorsingenderdescriptioninpublicdatasets