Cargando…

Positive-unlabeled learning for disease gene identification

Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yang, Peng, Li, Xiao-Li, Mei, Jian-Ping, Kwoh, Chee-Keong, Ng, See-Kiong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2012
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3467748/ https://www.ncbi.nlm.nih.gov/pubmed/22923290 http://dx.doi.org/10.1093/bioinformatics/bts504

_version_	1782245866070867968
author	Yang, Peng Li, Xiao-Li Mei, Jian-Ping Kwoh, Chee-Keong Ng, See-Kiong
author_facet	Yang, Peng Li, Xiao-Li Mei, Jian-Ping Kwoh, Chee-Keong Ng, See-Kiong
author_sort	Yang, Peng
collection	PubMed
description	Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the negative training set N (non-disease gene set does not exist) to build classifiers to identify new disease genes from the unknown genes. However, such kind of classifiers is actually built from a noisy negative set N as there can be unknown disease genes in N itself. As a result, the classifiers do not perform as well as they could be. Result: Instead of treating the unknown genes as negative examples in N, we treat them as an unlabeled set U. We design a novel positive-unlabeled (PU) learning algorithm PUDI (PU learning for disease gene identification) to build a classifier using P and U. We first partition U into four sets, namely, reliable negative set RN, likely positive set LP, likely negative set LN and weak negative set WN. The weighted support vector machines are then used to build a multi-level classifier based on the four training sets and positive training set P to identify disease genes. Our experimental results demonstrate that our proposed PUDI algorithm outperformed the existing methods significantly. Conclusion: The proposed PUDI algorithm is able to identify disease genes more accurately by treating the unknown data more appropriately as unlabeled set U instead of negative set N. Given that many machine learning problems in biomedical research do involve positive and unlabeled data instead of negative data, it is possible that the machine learning methods for these problems can be further improved by adopting PU learning methods, as we have done here for disease gene identification. Availability and implementation: The executable program and data are available at http://www1.i2r.a-star.edu.sg/∼xlli/PUDI/PUDI.html. Contact: xlli@i2r.a-star.edu.sg or yang0293@e.ntu.edu.sg Supplementary information: Supplementary Data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-3467748
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-34677482012-12-12 Positive-unlabeled learning for disease gene identification Yang, Peng Li, Xiao-Li Mei, Jian-Ping Kwoh, Chee-Keong Ng, See-Kiong Bioinformatics Original Papers Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the negative training set N (non-disease gene set does not exist) to build classifiers to identify new disease genes from the unknown genes. However, such kind of classifiers is actually built from a noisy negative set N as there can be unknown disease genes in N itself. As a result, the classifiers do not perform as well as they could be. Result: Instead of treating the unknown genes as negative examples in N, we treat them as an unlabeled set U. We design a novel positive-unlabeled (PU) learning algorithm PUDI (PU learning for disease gene identification) to build a classifier using P and U. We first partition U into four sets, namely, reliable negative set RN, likely positive set LP, likely negative set LN and weak negative set WN. The weighted support vector machines are then used to build a multi-level classifier based on the four training sets and positive training set P to identify disease genes. Our experimental results demonstrate that our proposed PUDI algorithm outperformed the existing methods significantly. Conclusion: The proposed PUDI algorithm is able to identify disease genes more accurately by treating the unknown data more appropriately as unlabeled set U instead of negative set N. Given that many machine learning problems in biomedical research do involve positive and unlabeled data instead of negative data, it is possible that the machine learning methods for these problems can be further improved by adopting PU learning methods, as we have done here for disease gene identification. Availability and implementation: The executable program and data are available at http://www1.i2r.a-star.edu.sg/∼xlli/PUDI/PUDI.html. Contact: xlli@i2r.a-star.edu.sg or yang0293@e.ntu.edu.sg Supplementary information: Supplementary Data are available at Bioinformatics online. Oxford University Press 2012-10-15 2012-08-24 /pmc/articles/PMC3467748/ /pubmed/22923290 http://dx.doi.org/10.1093/bioinformatics/bts504 Text en © The Author 2012. Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Yang, Peng Li, Xiao-Li Mei, Jian-Ping Kwoh, Chee-Keong Ng, See-Kiong Positive-unlabeled learning for disease gene identification
title	Positive-unlabeled learning for disease gene identification
title_full	Positive-unlabeled learning for disease gene identification
title_fullStr	Positive-unlabeled learning for disease gene identification
title_full_unstemmed	Positive-unlabeled learning for disease gene identification
title_short	Positive-unlabeled learning for disease gene identification
title_sort	positive-unlabeled learning for disease gene identification
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3467748/ https://www.ncbi.nlm.nih.gov/pubmed/22923290 http://dx.doi.org/10.1093/bioinformatics/bts504
work_keys_str_mv	AT yangpeng positiveunlabeledlearningfordiseasegeneidentification AT lixiaoli positiveunlabeledlearningfordiseasegeneidentification AT meijianping positiveunlabeledlearningfordiseasegeneidentification AT kwohcheekeong positiveunlabeledlearningfordiseasegeneidentification AT ngseekiong positiveunlabeledlearningfordiseasegeneidentification

Positive-unlabeled learning for disease gene identification

Ejemplares similares