Cargando…
Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data
MOTIVATION: Gene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for the accurate prediction of gene co-expression from such large datase...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8797241/ https://www.ncbi.nlm.nih.gov/pubmed/35089979 http://dx.doi.org/10.1371/journal.pone.0263344 |
_version_ | 1784641503859048448 |
---|---|
author | Vandenbon, Alexis |
author_facet | Vandenbon, Alexis |
author_sort | Vandenbon, Alexis |
collection | PubMed |
description | MOTIVATION: Gene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for the accurate prediction of gene co-expression from such large datasets remain unclear. Especially the importance of batch effect correction is understudied. RESULTS: We processed RNA-seq data of 68 human and 76 mouse cell types and tissues using 50 different workflows into 7,200 genome-wide gene co-expression networks. We then conducted a systematic analysis of the factors that result in high-quality co-expression predictions, focusing on normalization, batch effect correction, and measure of correlation. We confirmed the key importance of high sample counts for high-quality predictions. However, choosing a suitable normalization approach and applying batch effect correction can further improve the quality of co-expression estimates, equivalent to a >80% and >40% increase in samples. In larger datasets, batch effect removal was equivalent to a more than doubling of the sample size. Finally, Pearson correlation appears more suitable than Spearman correlation, except for smaller datasets. CONCLUSION: A key point for accurate prediction of gene co-expression is the collection of many samples. However, paying attention to data normalization, batch effects, and the measure of correlation can significantly improve the quality of co-expression estimates. |
format | Online Article Text |
id | pubmed-8797241 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-87972412022-01-29 Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data Vandenbon, Alexis PLoS One Research Article MOTIVATION: Gene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for the accurate prediction of gene co-expression from such large datasets remain unclear. Especially the importance of batch effect correction is understudied. RESULTS: We processed RNA-seq data of 68 human and 76 mouse cell types and tissues using 50 different workflows into 7,200 genome-wide gene co-expression networks. We then conducted a systematic analysis of the factors that result in high-quality co-expression predictions, focusing on normalization, batch effect correction, and measure of correlation. We confirmed the key importance of high sample counts for high-quality predictions. However, choosing a suitable normalization approach and applying batch effect correction can further improve the quality of co-expression estimates, equivalent to a >80% and >40% increase in samples. In larger datasets, batch effect removal was equivalent to a more than doubling of the sample size. Finally, Pearson correlation appears more suitable than Spearman correlation, except for smaller datasets. CONCLUSION: A key point for accurate prediction of gene co-expression is the collection of many samples. However, paying attention to data normalization, batch effects, and the measure of correlation can significantly improve the quality of co-expression estimates. Public Library of Science 2022-01-28 /pmc/articles/PMC8797241/ /pubmed/35089979 http://dx.doi.org/10.1371/journal.pone.0263344 Text en © 2022 Alexis Vandenbon https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Vandenbon, Alexis Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data |
title | Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data |
title_full | Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data |
title_fullStr | Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data |
title_full_unstemmed | Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data |
title_short | Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data |
title_sort | evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of rna-seq data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8797241/ https://www.ncbi.nlm.nih.gov/pubmed/35089979 http://dx.doi.org/10.1371/journal.pone.0263344 |
work_keys_str_mv | AT vandenbonalexis evaluationofcriticaldataprocessingstepsforreliablepredictionofgenecoexpressionfromlargecollectionsofrnaseqdata |