Cargando…

A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis

BACKGROUND: Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Dat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lee, Eunjee, Yoo, Seungyeul, Wang, Wenhui, Tu, Zhidong, Zhu, Jun
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6615984/ https://www.ncbi.nlm.nih.gov/pubmed/31289834 http://dx.doi.org/10.1093/gigascience/giz080

_version_	1783433428615561216
author	Lee, Eunjee Yoo, Seungyeul Wang, Wenhui Tu, Zhidong Zhu, Jun
author_facet	Lee, Eunjee Yoo, Seungyeul Wang, Wenhui Tu, Zhidong Zhu, Jun
author_sort	Lee, Eunjee
collection	PubMed
description	BACKGROUND: Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Data with labeling errors dampen true biological signals. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. We developed a robust probabilistic multi-omics data matching procedure, proMODMatcher, to curate data and identify and correct data annotation and errors in large databases. RESULTS: Application to simulated datasets suggests that proMODMatcher achieved robust statistical power even when the number of cis-associations was small and/or the number of samples was large. Application of our proMODMatcher to multi-omics datasets in The Cancer Genome Atlas and International Cancer Genome Consortium identified sample errors in multiple cancer datasets. Our procedure was not only able to identify sample-labeling errors but also to unambiguously identify the source of the errors. Our results demonstrate that these errors should be identified and corrected before integrative analysis. CONCLUSIONS: Our results indicate that sample-labeling errors were common in large multi-omics datasets. These errors should be corrected before integrative analysis.
format	Online Article Text
id	pubmed-6615984
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-66159842019-07-15 A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis Lee, Eunjee Yoo, Seungyeul Wang, Wenhui Tu, Zhidong Zhu, Jun Gigascience Research BACKGROUND: Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Data with labeling errors dampen true biological signals. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. We developed a robust probabilistic multi-omics data matching procedure, proMODMatcher, to curate data and identify and correct data annotation and errors in large databases. RESULTS: Application to simulated datasets suggests that proMODMatcher achieved robust statistical power even when the number of cis-associations was small and/or the number of samples was large. Application of our proMODMatcher to multi-omics datasets in The Cancer Genome Atlas and International Cancer Genome Consortium identified sample errors in multiple cancer datasets. Our procedure was not only able to identify sample-labeling errors but also to unambiguously identify the source of the errors. Our results demonstrate that these errors should be identified and corrected before integrative analysis. CONCLUSIONS: Our results indicate that sample-labeling errors were common in large multi-omics datasets. These errors should be corrected before integrative analysis. Oxford University Press 2019-07-09 /pmc/articles/PMC6615984/ /pubmed/31289834 http://dx.doi.org/10.1093/gigascience/giz080 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Lee, Eunjee Yoo, Seungyeul Wang, Wenhui Tu, Zhidong Zhu, Jun A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis
title	A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis
title_full	A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis
title_fullStr	A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis
title_full_unstemmed	A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis
title_short	A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis
title_sort	probabilistic multi-omics data matching method for detecting sample errors in integrative analysis
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6615984/ https://www.ncbi.nlm.nih.gov/pubmed/31289834 http://dx.doi.org/10.1093/gigascience/giz080
work_keys_str_mv	AT leeeunjee aprobabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis AT yooseungyeul aprobabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis AT wangwenhui aprobabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis AT tuzhidong aprobabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis AT zhujun aprobabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis AT leeeunjee probabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis AT yooseungyeul probabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis AT wangwenhui probabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis AT tuzhidong probabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis AT zhujun probabilisticmultiomicsdatamatchingmethodfordetectingsampleerrorsinintegrativeanalysis

A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis

Ejemplares similares