Cargando…

Identification of sample annotation errors in gene expression datasets

The comprehensive transcriptomic analysis of clinically annotated human tissue has found widespread use in oncology, cell biology, immunology, and toxicology. In cancer research, microarray-based gene expression profiling has successfully been applied to subclassify disease entities, predict therapy...

Descripción completa

Detalles Bibliográficos
Autores principales: Lohr, Miriam, Hellwig, Birte, Edlund, Karolina, Mattsson, Johanna S. M., Botling, Johan, Schmidt, Marcus, Hengstler, Jan G., Micke, Patrick, Rahnenführer, Jörg
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Berlin Heidelberg 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4673097/
https://www.ncbi.nlm.nih.gov/pubmed/26608184
http://dx.doi.org/10.1007/s00204-015-1632-4
_version_ 1782404665952960512
author Lohr, Miriam
Hellwig, Birte
Edlund, Karolina
Mattsson, Johanna S. M.
Botling, Johan
Schmidt, Marcus
Hengstler, Jan G.
Micke, Patrick
Rahnenführer, Jörg
author_facet Lohr, Miriam
Hellwig, Birte
Edlund, Karolina
Mattsson, Johanna S. M.
Botling, Johan
Schmidt, Marcus
Hengstler, Jan G.
Micke, Patrick
Rahnenführer, Jörg
author_sort Lohr, Miriam
collection PubMed
description The comprehensive transcriptomic analysis of clinically annotated human tissue has found widespread use in oncology, cell biology, immunology, and toxicology. In cancer research, microarray-based gene expression profiling has successfully been applied to subclassify disease entities, predict therapy response, and identify cellular mechanisms. Public accessibility of raw data, together with corresponding information on clinicopathological parameters, offers the opportunity to reuse previously analyzed data and to gain statistical power by combining multiple datasets. However, results and conclusions obviously depend on the reliability of the available information. Here, we propose gene expression-based methods for identifying sample misannotations in public transcriptomic datasets. Sample mix-up can be detected by a classifier that differentiates between samples from male and female patients. Correlation analysis identifies multiple measurements of material from the same sample. The analysis of 45 datasets (including 4913 patients) revealed that erroneous sample annotation, affecting 40 % of the analyzed datasets, may be a more widespread phenomenon than previously thought. Removal of erroneously labelled samples may influence the results of the statistical evaluation in some datasets. Our methods may help to identify individual datasets that contain numerous discrepancies and could be routinely included into the statistical analysis of clinical gene expression data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s00204-015-1632-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4673097
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Springer Berlin Heidelberg
record_format MEDLINE/PubMed
spelling pubmed-46730972015-12-16 Identification of sample annotation errors in gene expression datasets Lohr, Miriam Hellwig, Birte Edlund, Karolina Mattsson, Johanna S. M. Botling, Johan Schmidt, Marcus Hengstler, Jan G. Micke, Patrick Rahnenführer, Jörg Arch Toxicol Toxicogenomics The comprehensive transcriptomic analysis of clinically annotated human tissue has found widespread use in oncology, cell biology, immunology, and toxicology. In cancer research, microarray-based gene expression profiling has successfully been applied to subclassify disease entities, predict therapy response, and identify cellular mechanisms. Public accessibility of raw data, together with corresponding information on clinicopathological parameters, offers the opportunity to reuse previously analyzed data and to gain statistical power by combining multiple datasets. However, results and conclusions obviously depend on the reliability of the available information. Here, we propose gene expression-based methods for identifying sample misannotations in public transcriptomic datasets. Sample mix-up can be detected by a classifier that differentiates between samples from male and female patients. Correlation analysis identifies multiple measurements of material from the same sample. The analysis of 45 datasets (including 4913 patients) revealed that erroneous sample annotation, affecting 40 % of the analyzed datasets, may be a more widespread phenomenon than previously thought. Removal of erroneously labelled samples may influence the results of the statistical evaluation in some datasets. Our methods may help to identify individual datasets that contain numerous discrepancies and could be routinely included into the statistical analysis of clinical gene expression data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s00204-015-1632-4) contains supplementary material, which is available to authorized users. Springer Berlin Heidelberg 2015-11-25 2015 /pmc/articles/PMC4673097/ /pubmed/26608184 http://dx.doi.org/10.1007/s00204-015-1632-4 Text en © The Author(s) 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle Toxicogenomics
Lohr, Miriam
Hellwig, Birte
Edlund, Karolina
Mattsson, Johanna S. M.
Botling, Johan
Schmidt, Marcus
Hengstler, Jan G.
Micke, Patrick
Rahnenführer, Jörg
Identification of sample annotation errors in gene expression datasets
title Identification of sample annotation errors in gene expression datasets
title_full Identification of sample annotation errors in gene expression datasets
title_fullStr Identification of sample annotation errors in gene expression datasets
title_full_unstemmed Identification of sample annotation errors in gene expression datasets
title_short Identification of sample annotation errors in gene expression datasets
title_sort identification of sample annotation errors in gene expression datasets
topic Toxicogenomics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4673097/
https://www.ncbi.nlm.nih.gov/pubmed/26608184
http://dx.doi.org/10.1007/s00204-015-1632-4
work_keys_str_mv AT lohrmiriam identificationofsampleannotationerrorsingeneexpressiondatasets
AT hellwigbirte identificationofsampleannotationerrorsingeneexpressiondatasets
AT edlundkarolina identificationofsampleannotationerrorsingeneexpressiondatasets
AT mattssonjohannasm identificationofsampleannotationerrorsingeneexpressiondatasets
AT botlingjohan identificationofsampleannotationerrorsingeneexpressiondatasets
AT schmidtmarcus identificationofsampleannotationerrorsingeneexpressiondatasets
AT hengstlerjang identificationofsampleannotationerrorsingeneexpressiondatasets
AT mickepatrick identificationofsampleannotationerrorsingeneexpressiondatasets
AT rahnenfuhrerjorg identificationofsampleannotationerrorsingeneexpressiondatasets