Cargando…
Statistical integration of two omics datasets using GO2PLS
BACKGROUND: Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various me...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7977326/ https://www.ncbi.nlm.nih.gov/pubmed/33736604 http://dx.doi.org/10.1186/s12859-021-03958-3 |
_version_ | 1783667110441910272 |
---|---|
author | Gu, Zhujie el Bouhaddani, Said Pei, Jiayi Houwing-Duistermaat, Jeanine Uh, Hae-Won |
author_facet | Gu, Zhujie el Bouhaddani, Said Pei, Jiayi Houwing-Duistermaat, Jeanine Uh, Hae-Won |
author_sort | Gu, Zhujie |
collection | PubMed |
description | BACKGROUND: Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace. RESULTS: The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVON-DOSIS (a small case-control study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease. CONCLUSIONS: GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability. SUPPLEMENTARY INFORMATION: The online version supplementary material available at 10.1186/s12859-021-03958-3. |
format | Online Article Text |
id | pubmed-7977326 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-79773262021-03-22 Statistical integration of two omics datasets using GO2PLS Gu, Zhujie el Bouhaddani, Said Pei, Jiayi Houwing-Duistermaat, Jeanine Uh, Hae-Won BMC Bioinformatics Methodology Article BACKGROUND: Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace. RESULTS: The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVON-DOSIS (a small case-control study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease. CONCLUSIONS: GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability. SUPPLEMENTARY INFORMATION: The online version supplementary material available at 10.1186/s12859-021-03958-3. BioMed Central 2021-03-18 /pmc/articles/PMC7977326/ /pubmed/33736604 http://dx.doi.org/10.1186/s12859-021-03958-3 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Methodology Article Gu, Zhujie el Bouhaddani, Said Pei, Jiayi Houwing-Duistermaat, Jeanine Uh, Hae-Won Statistical integration of two omics datasets using GO2PLS |
title | Statistical integration of two omics datasets using GO2PLS |
title_full | Statistical integration of two omics datasets using GO2PLS |
title_fullStr | Statistical integration of two omics datasets using GO2PLS |
title_full_unstemmed | Statistical integration of two omics datasets using GO2PLS |
title_short | Statistical integration of two omics datasets using GO2PLS |
title_sort | statistical integration of two omics datasets using go2pls |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7977326/ https://www.ncbi.nlm.nih.gov/pubmed/33736604 http://dx.doi.org/10.1186/s12859-021-03958-3 |
work_keys_str_mv | AT guzhujie statisticalintegrationoftwoomicsdatasetsusinggo2pls AT elbouhaddanisaid statisticalintegrationoftwoomicsdatasetsusinggo2pls AT peijiayi statisticalintegrationoftwoomicsdatasetsusinggo2pls AT houwingduistermaatjeanine statisticalintegrationoftwoomicsdatasetsusinggo2pls AT uhhaewon statisticalintegrationoftwoomicsdatasetsusinggo2pls |