Cargando…

Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets

BACKGROUND: Batch effects are a persistent and pervasive form of measurement noise which undermine the scientific utility of high-throughput genomic datasets. At their most benign, they reduce the power of statistical tests resulting in actual effects going unidentified. At their worst, they constit...

Descripción completa

Detalles Bibliográficos
Autores principales: Oytam, Yalchin, Sobhanmanesh, Fariborz, Duesing, Konsta, Bowden, Joshua C., Osmond-McLeod, Megan, Ross, Jason
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009651/
https://www.ncbi.nlm.nih.gov/pubmed/27585881
http://dx.doi.org/10.1186/s12859-016-1212-5
_version_ 1782451552700596224
author Oytam, Yalchin
Sobhanmanesh, Fariborz
Duesing, Konsta
Bowden, Joshua C.
Osmond-McLeod, Megan
Ross, Jason
author_facet Oytam, Yalchin
Sobhanmanesh, Fariborz
Duesing, Konsta
Bowden, Joshua C.
Osmond-McLeod, Megan
Ross, Jason
author_sort Oytam, Yalchin
collection PubMed
description BACKGROUND: Batch effects are a persistent and pervasive form of measurement noise which undermine the scientific utility of high-throughput genomic datasets. At their most benign, they reduce the power of statistical tests resulting in actual effects going unidentified. At their worst, they constitute confounds and render datasets useless. Attempting to remove batch effects will result in some of the biologically meaningful component of the measurement (i.e. signal) being lost. We present and benchmark a novel technique, called Harman. Harman maximises the removal of batch noise with the constraint that the risk of also losing biologically meaningful component of the measurement is kept to a fraction which is set by the user. RESULTS: Analyses of three independent publically available datasets reveal that Harman removes more batch noise and preserves more signal at the same time, than the current leading technique. Results also show that Harman is able to identify and remove batch effects no matter what their relative size compared to other sources of variation in the dataset. Of particular advantage for meta-analyses and data integration is Harman’s superior consistency in achieving comparable noise suppression - signal preservation trade-offs across multiple datasets, with differing number of treatments, replicates and processing batches. CONCLUSION: Harman’s ability to better remove batch noise, and better preserve biologically meaningful signal simultaneously within a single study, and maintain the user-set trade-off between batch noise rejection and signal preservation across different studies makes it an effective alternative method to deal with batch effects in high-throughput genomic datasets. Harman is flexible in terms of the data types it can process. It is available publically as an R package (https://bioconductor.org/packages/release/bioc/html/Harman.html), as well as a compiled Matlab package (http://www.bioinformatics.csiro.au/harman/) which does not require a Matlab license to run. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1212-5) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5009651
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-50096512016-09-09 Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets Oytam, Yalchin Sobhanmanesh, Fariborz Duesing, Konsta Bowden, Joshua C. Osmond-McLeod, Megan Ross, Jason BMC Bioinformatics Methodology Article BACKGROUND: Batch effects are a persistent and pervasive form of measurement noise which undermine the scientific utility of high-throughput genomic datasets. At their most benign, they reduce the power of statistical tests resulting in actual effects going unidentified. At their worst, they constitute confounds and render datasets useless. Attempting to remove batch effects will result in some of the biologically meaningful component of the measurement (i.e. signal) being lost. We present and benchmark a novel technique, called Harman. Harman maximises the removal of batch noise with the constraint that the risk of also losing biologically meaningful component of the measurement is kept to a fraction which is set by the user. RESULTS: Analyses of three independent publically available datasets reveal that Harman removes more batch noise and preserves more signal at the same time, than the current leading technique. Results also show that Harman is able to identify and remove batch effects no matter what their relative size compared to other sources of variation in the dataset. Of particular advantage for meta-analyses and data integration is Harman’s superior consistency in achieving comparable noise suppression - signal preservation trade-offs across multiple datasets, with differing number of treatments, replicates and processing batches. CONCLUSION: Harman’s ability to better remove batch noise, and better preserve biologically meaningful signal simultaneously within a single study, and maintain the user-set trade-off between batch noise rejection and signal preservation across different studies makes it an effective alternative method to deal with batch effects in high-throughput genomic datasets. Harman is flexible in terms of the data types it can process. It is available publically as an R package (https://bioconductor.org/packages/release/bioc/html/Harman.html), as well as a compiled Matlab package (http://www.bioinformatics.csiro.au/harman/) which does not require a Matlab license to run. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1212-5) contains supplementary material, which is available to authorized users. BioMed Central 2016-09-01 /pmc/articles/PMC5009651/ /pubmed/27585881 http://dx.doi.org/10.1186/s12859-016-1212-5 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Oytam, Yalchin
Sobhanmanesh, Fariborz
Duesing, Konsta
Bowden, Joshua C.
Osmond-McLeod, Megan
Ross, Jason
Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets
title Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets
title_full Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets
title_fullStr Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets
title_full_unstemmed Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets
title_short Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets
title_sort risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009651/
https://www.ncbi.nlm.nih.gov/pubmed/27585881
http://dx.doi.org/10.1186/s12859-016-1212-5
work_keys_str_mv AT oytamyalchin riskconsciouscorrectionofbatcheffectsmaximisinginformationextractionfromhighthroughputgenomicdatasets
AT sobhanmaneshfariborz riskconsciouscorrectionofbatcheffectsmaximisinginformationextractionfromhighthroughputgenomicdatasets
AT duesingkonsta riskconsciouscorrectionofbatcheffectsmaximisinginformationextractionfromhighthroughputgenomicdatasets
AT bowdenjoshuac riskconsciouscorrectionofbatcheffectsmaximisinginformationextractionfromhighthroughputgenomicdatasets
AT osmondmcleodmegan riskconsciouscorrectionofbatcheffectsmaximisinginformationextractionfromhighthroughputgenomicdatasets
AT rossjason riskconsciouscorrectionofbatcheffectsmaximisinginformationextractionfromhighthroughputgenomicdatasets