Cargando…

Robust principal component analysis for accurate outlier sample detection in RNA-Seq data

BACKGROUND: High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Xiaoying, Zhang, Bo, Wang, Ting, Bonni, Azad, Zhao, Guoyan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7324992/
https://www.ncbi.nlm.nih.gov/pubmed/32600248
http://dx.doi.org/10.1186/s12859-020-03608-0
_version_ 1783552066355986432
author Chen, Xiaoying
Zhang, Bo
Wang, Ting
Bonni, Azad
Zhao, Guoyan
author_facet Chen, Xiaoying
Zhang, Bo
Wang, Ting
Bonni, Azad
Zhao, Guoyan
author_sort Chen, Xiaoying
collection PubMed
description BACKGROUND: High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis. RESULTS: We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes. CONCLUSIONS: rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.
format Online
Article
Text
id pubmed-7324992
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-73249922020-06-30 Robust principal component analysis for accurate outlier sample detection in RNA-Seq data Chen, Xiaoying Zhang, Bo Wang, Ting Bonni, Azad Zhao, Guoyan BMC Bioinformatics Research Article BACKGROUND: High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis. RESULTS: We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes. CONCLUSIONS: rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis. BioMed Central 2020-06-29 /pmc/articles/PMC7324992/ /pubmed/32600248 http://dx.doi.org/10.1186/s12859-020-03608-0 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Chen, Xiaoying
Zhang, Bo
Wang, Ting
Bonni, Azad
Zhao, Guoyan
Robust principal component analysis for accurate outlier sample detection in RNA-Seq data
title Robust principal component analysis for accurate outlier sample detection in RNA-Seq data
title_full Robust principal component analysis for accurate outlier sample detection in RNA-Seq data
title_fullStr Robust principal component analysis for accurate outlier sample detection in RNA-Seq data
title_full_unstemmed Robust principal component analysis for accurate outlier sample detection in RNA-Seq data
title_short Robust principal component analysis for accurate outlier sample detection in RNA-Seq data
title_sort robust principal component analysis for accurate outlier sample detection in rna-seq data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7324992/
https://www.ncbi.nlm.nih.gov/pubmed/32600248
http://dx.doi.org/10.1186/s12859-020-03608-0
work_keys_str_mv AT chenxiaoying robustprincipalcomponentanalysisforaccurateoutliersampledetectioninrnaseqdata
AT zhangbo robustprincipalcomponentanalysisforaccurateoutliersampledetectioninrnaseqdata
AT wangting robustprincipalcomponentanalysisforaccurateoutliersampledetectioninrnaseqdata
AT bonniazad robustprincipalcomponentanalysisforaccurateoutliersampledetectioninrnaseqdata
AT zhaoguoyan robustprincipalcomponentanalysisforaccurateoutliersampledetectioninrnaseqdata