Cargando…

A fast and high performance multiple data integration algorithm for identifying human disease genes

BACKGROUND: Integrating multiple data sources is indispensable in improving disease gene identification. It is not only due to the fact that disease genes associated with similar genetic diseases tend to lie close with each other in various biological networks, but also due to the fact that gene-dis...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chen, Bolin, Li, Min, Wang, Jianxin, Shang, Xuequn, Wu, Fang-Xiang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4582601/ https://www.ncbi.nlm.nih.gov/pubmed/26399620 http://dx.doi.org/10.1186/1755-8794-8-S3-S2

_version_	1782391729449598976
author	Chen, Bolin Li, Min Wang, Jianxin Shang, Xuequn Wu, Fang-Xiang
author_facet	Chen, Bolin Li, Min Wang, Jianxin Shang, Xuequn Wu, Fang-Xiang
author_sort	Chen, Bolin
collection	PubMed
description	BACKGROUND: Integrating multiple data sources is indispensable in improving disease gene identification. It is not only due to the fact that disease genes associated with similar genetic diseases tend to lie close with each other in various biological networks, but also due to the fact that gene-disease associations are complex. Although various algorithms have been proposed to identify disease genes, their prediction performances and the computational time still should be further improved. RESULTS: In this study, we propose a fast and high performance multiple data integration algorithm for identifying human disease genes. A posterior probability of each candidate gene associated with individual diseases is calculated by using a Bayesian analysis method and a binary logistic regression model. Two prior probability estimation strategies and two feature vector construction methods are developed to test the performance of the proposed algorithm. CONCLUSIONS: The proposed algorithm is not only generated predictions with high AUC scores, but also runs very fast. When only a single PPI network is employed, the AUC score is 0.769 by using F(2 )as feature vectors. The average running time for each leave-one-out experiment is only around 1.5 seconds. When three biological networks are integrated, the AUC score using F(3 )as feature vectors increases to 0.830, and the average running time for each leave-one-out experiment takes only about 12.54 seconds. It is better than many existing algorithms.
format	Online Article Text
id	pubmed-4582601
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-45826012015-09-28 A fast and high performance multiple data integration algorithm for identifying human disease genes Chen, Bolin Li, Min Wang, Jianxin Shang, Xuequn Wu, Fang-Xiang BMC Med Genomics Research BACKGROUND: Integrating multiple data sources is indispensable in improving disease gene identification. It is not only due to the fact that disease genes associated with similar genetic diseases tend to lie close with each other in various biological networks, but also due to the fact that gene-disease associations are complex. Although various algorithms have been proposed to identify disease genes, their prediction performances and the computational time still should be further improved. RESULTS: In this study, we propose a fast and high performance multiple data integration algorithm for identifying human disease genes. A posterior probability of each candidate gene associated with individual diseases is calculated by using a Bayesian analysis method and a binary logistic regression model. Two prior probability estimation strategies and two feature vector construction methods are developed to test the performance of the proposed algorithm. CONCLUSIONS: The proposed algorithm is not only generated predictions with high AUC scores, but also runs very fast. When only a single PPI network is employed, the AUC score is 0.769 by using F(2 )as feature vectors. The average running time for each leave-one-out experiment is only around 1.5 seconds. When three biological networks are integrated, the AUC score using F(3 )as feature vectors increases to 0.830, and the average running time for each leave-one-out experiment takes only about 12.54 seconds. It is better than many existing algorithms. BioMed Central 2015-09-23 /pmc/articles/PMC4582601/ /pubmed/26399620 http://dx.doi.org/10.1186/1755-8794-8-S3-S2 Text en Copyright © 2015 Chen et al.; http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Chen, Bolin Li, Min Wang, Jianxin Shang, Xuequn Wu, Fang-Xiang A fast and high performance multiple data integration algorithm for identifying human disease genes
title	A fast and high performance multiple data integration algorithm for identifying human disease genes
title_full	A fast and high performance multiple data integration algorithm for identifying human disease genes
title_fullStr	A fast and high performance multiple data integration algorithm for identifying human disease genes
title_full_unstemmed	A fast and high performance multiple data integration algorithm for identifying human disease genes
title_short	A fast and high performance multiple data integration algorithm for identifying human disease genes
title_sort	fast and high performance multiple data integration algorithm for identifying human disease genes
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4582601/ https://www.ncbi.nlm.nih.gov/pubmed/26399620 http://dx.doi.org/10.1186/1755-8794-8-S3-S2
work_keys_str_mv	AT chenbolin afastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes AT limin afastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes AT wangjianxin afastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes AT shangxuequn afastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes AT wufangxiang afastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes AT chenbolin fastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes AT limin fastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes AT wangjianxin fastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes AT shangxuequn fastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes AT wufangxiang fastandhighperformancemultipledataintegrationalgorithmforidentifyinghumandiseasegenes

A fast and high performance multiple data integration algorithm for identifying human disease genes

Ejemplares similares