Cargando…

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data ca...

Descripción completa

Detalles Bibliográficos
Autores principales: Thompson, Jeffrey A., Tan, Jie, Greene, Casey S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4736986/
https://www.ncbi.nlm.nih.gov/pubmed/26844019
http://dx.doi.org/10.7717/peerj.1621
_version_ 1782413393896931328
author Thompson, Jeffrey A.
Tan, Jie
Greene, Casey S.
author_facet Thompson, Jeffrey A.
Tan, Jie
Greene, Casey S.
author_sort Thompson, Jeffrey A.
collection PubMed
description Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log(2) transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.
format Online
Article
Text
id pubmed-4736986
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-47369862016-02-03 Cross-platform normalization of microarray and RNA-seq data for machine learning applications Thompson, Jeffrey A. Tan, Jie Greene, Casey S. PeerJ Bioinformatics Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log(2) transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language. PeerJ Inc. 2016-01-21 /pmc/articles/PMC4736986/ /pubmed/26844019 http://dx.doi.org/10.7717/peerj.1621 Text en © 2016 Thompson et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Thompson, Jeffrey A.
Tan, Jie
Greene, Casey S.
Cross-platform normalization of microarray and RNA-seq data for machine learning applications
title Cross-platform normalization of microarray and RNA-seq data for machine learning applications
title_full Cross-platform normalization of microarray and RNA-seq data for machine learning applications
title_fullStr Cross-platform normalization of microarray and RNA-seq data for machine learning applications
title_full_unstemmed Cross-platform normalization of microarray and RNA-seq data for machine learning applications
title_short Cross-platform normalization of microarray and RNA-seq data for machine learning applications
title_sort cross-platform normalization of microarray and rna-seq data for machine learning applications
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4736986/
https://www.ncbi.nlm.nih.gov/pubmed/26844019
http://dx.doi.org/10.7717/peerj.1621
work_keys_str_mv AT thompsonjeffreya crossplatformnormalizationofmicroarrayandrnaseqdataformachinelearningapplications
AT tanjie crossplatformnormalizationofmicroarrayandrnaseqdataformachinelearningapplications
AT greenecaseys crossplatformnormalizationofmicroarrayandrnaseqdataformachinelearningapplications