Cargando…

A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources

Copy number variation (CNV) may contribute to the development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yuan, Lin, Sun, Tao, Zhao, Jing, Shen, Zhen
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8276077/ https://www.ncbi.nlm.nih.gov/pubmed/34267783 http://dx.doi.org/10.3389/fgene.2021.696956

_version_	1783721842860621824
author	Yuan, Lin Sun, Tao Zhao, Jing Shen, Zhen
author_facet	Yuan, Lin Sun, Tao Zhao, Jing Shen, Zhen
author_sort	Yuan, Lin
collection	PubMed
description	Copy number variation (CNV) may contribute to the development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene, and disease label data provides us with an opportunity to design a new machine learning framework to predict potential disease-related CNVs. In this paper, we developed a novel machine learning approach, namely, IHI-BMLLR (Integrating Heterogeneous Information sources with Biweight Mid-correlation and L1-regularized Logistic Regression under stability selection), to predict the CNV-disease path associations by using a data set containing CNV, disease state labels, and gene data. CNVs, genes, and diseases are connected through edges and then constitute a biological association network. To construct a biological network, we first used a self-adaptive biweight mid-correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self-adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs. The experimental results on both simulation and prostate cancer data show that IHI-BMLLR is significantly better than two state-of-the-art CNV detection methods (i.e., CCRET and DPtest) under false-positive control. Furthermore, we applied IHI-BMLLR to prostate cancer data and found significant path associations. Three new cancer-related genes were discovered in the paths, and these genes need to be verified by biological research in the future.
format	Online Article Text
id	pubmed-8276077
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-82760772021-07-14 A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources Yuan, Lin Sun, Tao Zhao, Jing Shen, Zhen Front Genet Genetics Copy number variation (CNV) may contribute to the development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene, and disease label data provides us with an opportunity to design a new machine learning framework to predict potential disease-related CNVs. In this paper, we developed a novel machine learning approach, namely, IHI-BMLLR (Integrating Heterogeneous Information sources with Biweight Mid-correlation and L1-regularized Logistic Regression under stability selection), to predict the CNV-disease path associations by using a data set containing CNV, disease state labels, and gene data. CNVs, genes, and diseases are connected through edges and then constitute a biological association network. To construct a biological network, we first used a self-adaptive biweight mid-correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self-adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs. The experimental results on both simulation and prostate cancer data show that IHI-BMLLR is significantly better than two state-of-the-art CNV detection methods (i.e., CCRET and DPtest) under false-positive control. Furthermore, we applied IHI-BMLLR to prostate cancer data and found significant path associations. Three new cancer-related genes were discovered in the paths, and these genes need to be verified by biological research in the future. Frontiers Media S.A. 2021-06-29 /pmc/articles/PMC8276077/ /pubmed/34267783 http://dx.doi.org/10.3389/fgene.2021.696956 Text en Copyright © 2021 Yuan, Sun, Zhao and Shen. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Yuan, Lin Sun, Tao Zhao, Jing Shen, Zhen A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources
title	A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources
title_full	A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources
title_fullStr	A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources
title_full_unstemmed	A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources
title_short	A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources
title_sort	novel computational framework to predict disease-related copy number variations by integrating multiple data sources
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8276077/ https://www.ncbi.nlm.nih.gov/pubmed/34267783 http://dx.doi.org/10.3389/fgene.2021.696956
work_keys_str_mv	AT yuanlin anovelcomputationalframeworktopredictdiseaserelatedcopynumbervariationsbyintegratingmultipledatasources AT suntao anovelcomputationalframeworktopredictdiseaserelatedcopynumbervariationsbyintegratingmultipledatasources AT zhaojing anovelcomputationalframeworktopredictdiseaserelatedcopynumbervariationsbyintegratingmultipledatasources AT shenzhen anovelcomputationalframeworktopredictdiseaserelatedcopynumbervariationsbyintegratingmultipledatasources AT yuanlin novelcomputationalframeworktopredictdiseaserelatedcopynumbervariationsbyintegratingmultipledatasources AT suntao novelcomputationalframeworktopredictdiseaserelatedcopynumbervariationsbyintegratingmultipledatasources AT zhaojing novelcomputationalframeworktopredictdiseaserelatedcopynumbervariationsbyintegratingmultipledatasources AT shenzhen novelcomputationalframeworktopredictdiseaserelatedcopynumbervariationsbyintegratingmultipledatasources

A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources

Ejemplares similares