Cargando…

Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data

MOTIVATION: Gene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for the accurate prediction of gene co-expression from such large datase...

Descripción completa

Detalles Bibliográficos
Autor principal: Vandenbon, Alexis
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8797241/
https://www.ncbi.nlm.nih.gov/pubmed/35089979
http://dx.doi.org/10.1371/journal.pone.0263344
_version_ 1784641503859048448
author Vandenbon, Alexis
author_facet Vandenbon, Alexis
author_sort Vandenbon, Alexis
collection PubMed
description MOTIVATION: Gene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for the accurate prediction of gene co-expression from such large datasets remain unclear. Especially the importance of batch effect correction is understudied. RESULTS: We processed RNA-seq data of 68 human and 76 mouse cell types and tissues using 50 different workflows into 7,200 genome-wide gene co-expression networks. We then conducted a systematic analysis of the factors that result in high-quality co-expression predictions, focusing on normalization, batch effect correction, and measure of correlation. We confirmed the key importance of high sample counts for high-quality predictions. However, choosing a suitable normalization approach and applying batch effect correction can further improve the quality of co-expression estimates, equivalent to a >80% and >40% increase in samples. In larger datasets, batch effect removal was equivalent to a more than doubling of the sample size. Finally, Pearson correlation appears more suitable than Spearman correlation, except for smaller datasets. CONCLUSION: A key point for accurate prediction of gene co-expression is the collection of many samples. However, paying attention to data normalization, batch effects, and the measure of correlation can significantly improve the quality of co-expression estimates.
format Online
Article
Text
id pubmed-8797241
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-87972412022-01-29 Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data Vandenbon, Alexis PLoS One Research Article MOTIVATION: Gene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for the accurate prediction of gene co-expression from such large datasets remain unclear. Especially the importance of batch effect correction is understudied. RESULTS: We processed RNA-seq data of 68 human and 76 mouse cell types and tissues using 50 different workflows into 7,200 genome-wide gene co-expression networks. We then conducted a systematic analysis of the factors that result in high-quality co-expression predictions, focusing on normalization, batch effect correction, and measure of correlation. We confirmed the key importance of high sample counts for high-quality predictions. However, choosing a suitable normalization approach and applying batch effect correction can further improve the quality of co-expression estimates, equivalent to a >80% and >40% increase in samples. In larger datasets, batch effect removal was equivalent to a more than doubling of the sample size. Finally, Pearson correlation appears more suitable than Spearman correlation, except for smaller datasets. CONCLUSION: A key point for accurate prediction of gene co-expression is the collection of many samples. However, paying attention to data normalization, batch effects, and the measure of correlation can significantly improve the quality of co-expression estimates. Public Library of Science 2022-01-28 /pmc/articles/PMC8797241/ /pubmed/35089979 http://dx.doi.org/10.1371/journal.pone.0263344 Text en © 2022 Alexis Vandenbon https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Vandenbon, Alexis
Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data
title Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data
title_full Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data
title_fullStr Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data
title_full_unstemmed Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data
title_short Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data
title_sort evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of rna-seq data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8797241/
https://www.ncbi.nlm.nih.gov/pubmed/35089979
http://dx.doi.org/10.1371/journal.pone.0263344
work_keys_str_mv AT vandenbonalexis evaluationofcriticaldataprocessingstepsforreliablepredictionofgenecoexpressionfromlargecollectionsofrnaseqdata