Cargando…

Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis

BACKGROUND: Alternative polyadenylation (APA) has emerged as a pervasive mechanism that contributes to the transcriptome complexity and dynamics of gene regulation. The current tsunami of whole genome poly(A) site data from various conditions generated by 3′ end sequencing provides a valuable data s...

Descripción completa

Detalles Bibliográficos
Autores principales: Ye, Wenbin, Long, Yuqi, Ji, Guoli, Su, Yaru, Ye, Pengchao, Fu, Hongjuan, Wu, Xiaohui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6343338/
https://www.ncbi.nlm.nih.gov/pubmed/30669970
http://dx.doi.org/10.1186/s12864-019-5433-7
_version_ 1783389267099123712
author Ye, Wenbin
Long, Yuqi
Ji, Guoli
Su, Yaru
Ye, Pengchao
Fu, Hongjuan
Wu, Xiaohui
author_facet Ye, Wenbin
Long, Yuqi
Ji, Guoli
Su, Yaru
Ye, Pengchao
Fu, Hongjuan
Wu, Xiaohui
author_sort Ye, Wenbin
collection PubMed
description BACKGROUND: Alternative polyadenylation (APA) has emerged as a pervasive mechanism that contributes to the transcriptome complexity and dynamics of gene regulation. The current tsunami of whole genome poly(A) site data from various conditions generated by 3′ end sequencing provides a valuable data source for the study of APA-related gene expression. Cluster analysis is a powerful technique for investigating the association structure among genes, however, conventional gene clustering methods are not suitable for APA-related data as they fail to consider the information of poly(A) sites (e.g., location, abundance, number, etc.) within each gene or measure the association among poly(A) sites between two genes. RESULTS: Here we proposed a computational framework, named PASCCA, for clustering genes from replicated or unreplicated poly(A) site data using canonical correlation analysis (CCA). PASCCA incorporates multiple layers of gene expression data from both the poly(A) site level and gene level and takes into account the number of replicates and the variability within each experimental group. Moreover, PASCCA characterizes poly(A) sites in various ways including the abundance and relative usage, which can exploit the advantages of 3′ end deep sequencing in quantifying APA sites. Using both real and synthetic poly(A) site data sets, the cluster analysis demonstrates that PASCCA outperforms other widely-used distance measures under five performance metrics including connectivity, the Dunn index, average distance, average distance between means, and the biological homogeneity index. We also used PASCCA to infer APA-specific gene modules from recently published poly(A) site data of rice and discovered some distinct functional gene modules. We have made PASCCA an easy-to-use R package for APA-related gene expression analyses, including the characterization of poly(A) sites, quantification of association between genes, and clustering of genes. CONCLUSIONS: By providing a better treatment of the noise inherent in repeated measurements and taking into account multiple layers of poly(A) site data, PASCCA could be a general tool for clustering and analyzing APA-specific gene expression data. PASCCA could be used to elucidate the dynamic interplay of genes and their APA sites among various biological conditions from emerging 3′ end sequencing data to address the complex biological phenomenon. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5433-7) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6343338
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63433382019-01-24 Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis Ye, Wenbin Long, Yuqi Ji, Guoli Su, Yaru Ye, Pengchao Fu, Hongjuan Wu, Xiaohui BMC Genomics Methodology Article BACKGROUND: Alternative polyadenylation (APA) has emerged as a pervasive mechanism that contributes to the transcriptome complexity and dynamics of gene regulation. The current tsunami of whole genome poly(A) site data from various conditions generated by 3′ end sequencing provides a valuable data source for the study of APA-related gene expression. Cluster analysis is a powerful technique for investigating the association structure among genes, however, conventional gene clustering methods are not suitable for APA-related data as they fail to consider the information of poly(A) sites (e.g., location, abundance, number, etc.) within each gene or measure the association among poly(A) sites between two genes. RESULTS: Here we proposed a computational framework, named PASCCA, for clustering genes from replicated or unreplicated poly(A) site data using canonical correlation analysis (CCA). PASCCA incorporates multiple layers of gene expression data from both the poly(A) site level and gene level and takes into account the number of replicates and the variability within each experimental group. Moreover, PASCCA characterizes poly(A) sites in various ways including the abundance and relative usage, which can exploit the advantages of 3′ end deep sequencing in quantifying APA sites. Using both real and synthetic poly(A) site data sets, the cluster analysis demonstrates that PASCCA outperforms other widely-used distance measures under five performance metrics including connectivity, the Dunn index, average distance, average distance between means, and the biological homogeneity index. We also used PASCCA to infer APA-specific gene modules from recently published poly(A) site data of rice and discovered some distinct functional gene modules. We have made PASCCA an easy-to-use R package for APA-related gene expression analyses, including the characterization of poly(A) sites, quantification of association between genes, and clustering of genes. CONCLUSIONS: By providing a better treatment of the noise inherent in repeated measurements and taking into account multiple layers of poly(A) site data, PASCCA could be a general tool for clustering and analyzing APA-specific gene expression data. PASCCA could be used to elucidate the dynamic interplay of genes and their APA sites among various biological conditions from emerging 3′ end sequencing data to address the complex biological phenomenon. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5433-7) contains supplementary material, which is available to authorized users. BioMed Central 2019-01-22 /pmc/articles/PMC6343338/ /pubmed/30669970 http://dx.doi.org/10.1186/s12864-019-5433-7 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Ye, Wenbin
Long, Yuqi
Ji, Guoli
Su, Yaru
Ye, Pengchao
Fu, Hongjuan
Wu, Xiaohui
Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis
title Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis
title_full Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis
title_fullStr Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis
title_full_unstemmed Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis
title_short Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis
title_sort cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6343338/
https://www.ncbi.nlm.nih.gov/pubmed/30669970
http://dx.doi.org/10.1186/s12864-019-5433-7
work_keys_str_mv AT yewenbin clusteranalysisofreplicatedalternativepolyadenylationdatausingcanonicalcorrelationanalysis
AT longyuqi clusteranalysisofreplicatedalternativepolyadenylationdatausingcanonicalcorrelationanalysis
AT jiguoli clusteranalysisofreplicatedalternativepolyadenylationdatausingcanonicalcorrelationanalysis
AT suyaru clusteranalysisofreplicatedalternativepolyadenylationdatausingcanonicalcorrelationanalysis
AT yepengchao clusteranalysisofreplicatedalternativepolyadenylationdatausingcanonicalcorrelationanalysis
AT fuhongjuan clusteranalysisofreplicatedalternativepolyadenylationdatausingcanonicalcorrelationanalysis
AT wuxiaohui clusteranalysisofreplicatedalternativepolyadenylationdatausingcanonicalcorrelationanalysis