Cargando…

Improving the value of public RNA-seq expression data by phenotype prediction

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We de...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ellis, Shannon E, Collado-Torres, Leonardo, Jaffe, Andrew, Leek, Jeffrey T
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Methods Online
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961118/ https://www.ncbi.nlm.nih.gov/pubmed/29514223 http://dx.doi.org/10.1093/nar/gky102

_version_	1783324682409213952
author	Ellis, Shannon E Collado-Torres, Leonardo Jaffe, Andrew Leek, Jeffrey T
author_facet	Ellis, Shannon E Collado-Torres, Leonardo Jaffe, Andrew Leek, Jeffrey T
author_sort	Ellis, Shannon E
collection	PubMed
description	Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.
format	Online Article Text
id	pubmed-5961118
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-59611182018-06-06 Improving the value of public RNA-seq expression data by phenotype prediction Ellis, Shannon E Collado-Torres, Leonardo Jaffe, Andrew Leek, Jeffrey T Nucleic Acids Res Methods Online Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible. Oxford University Press 2018-05-18 2018-03-05 /pmc/articles/PMC5961118/ /pubmed/29514223 http://dx.doi.org/10.1093/nar/gky102 Text en © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methods Online Ellis, Shannon E Collado-Torres, Leonardo Jaffe, Andrew Leek, Jeffrey T Improving the value of public RNA-seq expression data by phenotype prediction
title	Improving the value of public RNA-seq expression data by phenotype prediction
title_full	Improving the value of public RNA-seq expression data by phenotype prediction
title_fullStr	Improving the value of public RNA-seq expression data by phenotype prediction
title_full_unstemmed	Improving the value of public RNA-seq expression data by phenotype prediction
title_short	Improving the value of public RNA-seq expression data by phenotype prediction
title_sort	improving the value of public rna-seq expression data by phenotype prediction
topic	Methods Online
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961118/ https://www.ncbi.nlm.nih.gov/pubmed/29514223 http://dx.doi.org/10.1093/nar/gky102
work_keys_str_mv	AT ellisshannone improvingthevalueofpublicrnaseqexpressiondatabyphenotypeprediction AT colladotorresleonardo improvingthevalueofpublicrnaseqexpressiondatabyphenotypeprediction AT jaffeandrew improvingthevalueofpublicrnaseqexpressiondatabyphenotypeprediction AT leekjeffreyt improvingthevalueofpublicrnaseqexpressiondatabyphenotypeprediction

Improving the value of public RNA-seq expression data by phenotype prediction

Ejemplares similares