Cargando…

Network-based logistic regression integration method for biomarker identification

BACKGROUND: Many mathematical and statistical models and algorithms have been proposed to do biomarker identification in recent years. However, the biomarkers inferred from different datasets suffer a lack of reproducibilities due to the heterogeneity of the data generated from different platforms o...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Ke, Geng, Wei, Zhang, Shuqin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311907/
https://www.ncbi.nlm.nih.gov/pubmed/30598085
http://dx.doi.org/10.1186/s12918-018-0657-8
Descripción
Sumario:BACKGROUND: Many mathematical and statistical models and algorithms have been proposed to do biomarker identification in recent years. However, the biomarkers inferred from different datasets suffer a lack of reproducibilities due to the heterogeneity of the data generated from different platforms or laboratories. This motivates us to develop robust biomarker identification methods by integrating multiple datasets. METHODS: In this paper, we developed an integrative method for classification based on logistic regression. Different constant terms are set in the logistic regression model to measure the heterogeneity of the samples. By minimizing the differences of the constant terms within the same dataset, both the homogeneity within the same dataset and the heterogeneity in multiple datasets can be kept. The model is formulated as an optimization problem with a network penalty measuring the differences of the constant terms. The L(1) penalty, elastic penalty and network related penalties are added to the objective function for the biomarker discovery purpose. Algorithms based on proximal Newton method are proposed to solve the optimization problem. RESULTS: We first applied the proposed method to the simulated datasets. Both the AUC of the prediction and the biomarker identification accuracy are improved. We then applied the method to two breast cancer gene expression datasets. By integrating both datasets, the prediction AUC is improved over directly merging the datasets and MetaLasso. And it’s comparable to the best AUC when doing biomarker identification in an individual dataset. The identified biomarkers using network related penalty for variables were further analyzed. Meaningful subnetworks enriched by breast cancer were identified. CONCLUSION: A network-based integrative logistic regression model is proposed in the paper. It improves both the prediction and biomarker identification accuracy. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12918-018-0657-8) contains supplementary material, which is available to authorized users.