Cargando…

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality

BACKGROUND: The constant evolving and development of next-generation sequencing techniques lead to high throughput data composed of datasets that include a large number of biological samples. Although a large number of samples are usually experimentally processed by batches, scientific publications...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sprang, Maximilian, Andrade-Navarro, Miguel A., Fontaine, Jean-Fred
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9284682/ https://www.ncbi.nlm.nih.gov/pubmed/35836114 http://dx.doi.org/10.1186/s12859-022-04775-y

_version_	1784747617032339456
author	Sprang, Maximilian Andrade-Navarro, Miguel A. Fontaine, Jean-Fred
author_facet	Sprang, Maximilian Andrade-Navarro, Miguel A. Fontaine, Jean-Fred
author_sort	Sprang, Maximilian
collection	PubMed
description	BACKGROUND: The constant evolving and development of next-generation sequencing techniques lead to high throughput data composed of datasets that include a large number of biological samples. Although a large number of samples are usually experimentally processed by batches, scientific publications are often elusive about this information, which can greatly impact the quality of the samples and confound further statistical analyzes. Because dedicated bioinformatics methods developed to detect unwanted sources of variance in the data can wrongly detect real biological signals, such methods could benefit from using a quality-aware approach. RESULTS: We recently developed statistical guidelines and a machine learning tool to automatically evaluate the quality of a next-generation-sequencing sample. We leveraged this quality assessment to detect and correct batch effects in 12 publicly available RNA-seq datasets with available batch information. We were able to distinguish batches by our quality score and used it to correct for some batch effects in sample clustering. Overall, the correction was evaluated as comparable to or better than the reference method that uses a priori knowledge of the batches (in 10 and 1 datasets of 12, respectively; total = 92%). When coupled to outlier removal, the correction was more often evaluated as better than the reference (comparable or better in 5 and 6 datasets of 12, respectively; total = 92%). CONCLUSIONS: In this work, we show the capabilities of our software to detect batches in public RNA-seq datasets from differences in the predicted quality of their samples. We also use these insights to correct the batch effect and observe the relation of sample quality and batch effect. These observations reinforce our expectation that while batch effects do correlate with differences in quality, batch effects also arise from other artifacts and are more suitably corrected statistically in well-designed experiments. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04775-y.
format	Online Article Text
id	pubmed-9284682
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-92846822022-07-16 Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality Sprang, Maximilian Andrade-Navarro, Miguel A. Fontaine, Jean-Fred BMC Bioinformatics Research BACKGROUND: The constant evolving and development of next-generation sequencing techniques lead to high throughput data composed of datasets that include a large number of biological samples. Although a large number of samples are usually experimentally processed by batches, scientific publications are often elusive about this information, which can greatly impact the quality of the samples and confound further statistical analyzes. Because dedicated bioinformatics methods developed to detect unwanted sources of variance in the data can wrongly detect real biological signals, such methods could benefit from using a quality-aware approach. RESULTS: We recently developed statistical guidelines and a machine learning tool to automatically evaluate the quality of a next-generation-sequencing sample. We leveraged this quality assessment to detect and correct batch effects in 12 publicly available RNA-seq datasets with available batch information. We were able to distinguish batches by our quality score and used it to correct for some batch effects in sample clustering. Overall, the correction was evaluated as comparable to or better than the reference method that uses a priori knowledge of the batches (in 10 and 1 datasets of 12, respectively; total = 92%). When coupled to outlier removal, the correction was more often evaluated as better than the reference (comparable or better in 5 and 6 datasets of 12, respectively; total = 92%). CONCLUSIONS: In this work, we show the capabilities of our software to detect batches in public RNA-seq datasets from differences in the predicted quality of their samples. We also use these insights to correct the batch effect and observe the relation of sample quality and batch effect. These observations reinforce our expectation that while batch effects do correlate with differences in quality, batch effects also arise from other artifacts and are more suitably corrected statistically in well-designed experiments. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04775-y. BioMed Central 2022-07-14 /pmc/articles/PMC9284682/ /pubmed/35836114 http://dx.doi.org/10.1186/s12859-022-04775-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Sprang, Maximilian Andrade-Navarro, Miguel A. Fontaine, Jean-Fred Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality
title	Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality
title_full	Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality
title_fullStr	Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality
title_full_unstemmed	Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality
title_short	Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality
title_sort	batch effect detection and correction in rna-seq data using machine-learning-based automated assessment of quality
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9284682/ https://www.ncbi.nlm.nih.gov/pubmed/35836114 http://dx.doi.org/10.1186/s12859-022-04775-y
work_keys_str_mv	AT sprangmaximilian batcheffectdetectionandcorrectioninrnaseqdatausingmachinelearningbasedautomatedassessmentofquality AT andradenavarromiguela batcheffectdetectionandcorrectioninrnaseqdatausingmachinelearningbasedautomatedassessmentofquality AT fontainejeanfred batcheffectdetectionandcorrectioninrnaseqdatausingmachinelearningbasedautomatedassessmentofquality

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality

Ejemplares similares