Cargando…

LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer

Although modern methods of whole genome DNA methylation analysis have a wide range of applications, they are not suitable for clinical diagnostics due to their high cost and complexity and due to the large amount of sample DNA required for the analysis. Therefore, it is crucial to be able to identif...

Descripción completa

Detalles Bibliográficos
Autores principales: Babalyan, K., Sultanov, R., Generozov, E., Sharova, E., Kostryukova, E., Larin, A., Kanygina, A., Govorun, V., Arapidi, G.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6214495/
https://www.ncbi.nlm.nih.gov/pubmed/30388122
http://dx.doi.org/10.1371/journal.pone.0204371
_version_ 1783367979147198464
author Babalyan, K.
Sultanov, R.
Generozov, E.
Sharova, E.
Kostryukova, E.
Larin, A.
Kanygina, A.
Govorun, V.
Arapidi, G.
author_facet Babalyan, K.
Sultanov, R.
Generozov, E.
Sharova, E.
Kostryukova, E.
Larin, A.
Kanygina, A.
Govorun, V.
Arapidi, G.
author_sort Babalyan, K.
collection PubMed
description Although modern methods of whole genome DNA methylation analysis have a wide range of applications, they are not suitable for clinical diagnostics due to their high cost and complexity and due to the large amount of sample DNA required for the analysis. Therefore, it is crucial to be able to identify a relatively small number of methylation sites that provide high precision and sensitivity for the diagnosis of pathological states. We propose an algorithm for constructing limited subsamples from high-dimensional data to form diagnostic panels. We have developed a tool that utilizes different methods of selection to find an optimal, minimum necessary combination of factors using cross-entropy loss metrics (LogLoss) to identify a subset of methylation sites. We show that the algorithm can work effectively with different genome methylation patterns using ensemble-based machine learning methods. Algorithm efficiency, precision and robustness were evaluated using five genome-wide DNA methylation datasets (totaling 626 samples), and each dataset was classified into tumor and non-tumor samples. The algorithm produced an AUC of 0.97 (95% CI: 0.94–0.99, 9 sites) for prostate adenocarcinoma and an AUC of 1.0 (from 2 to 6 sites) for urothelial bladder carcinoma, two types of kidney carcinoma and colorectal carcinoma. For prostate adenocarcinoma we showed that identified differential variability methylation patterns distinguish cluster of samples with higher recurrence rate (hazard ratio for recurrence = 0.48, 95% CI: 0.05–0.92; log-rank test, p-value < 0.03). We also identified several clusters of correlated interchangeable methylation sites that can be used for the elaboration of biological interpretation of the resulting models and for further selection of the sites most suitable for designing diagnostic panels. LogLoss-BERAF is implemented as a standalone python code and open-source code is freely available from https://github.com/bioinformatics-IBCH/logloss-beraf along with the models described in this article.
format Online
Article
Text
id pubmed-6214495
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-62144952018-11-19 LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer Babalyan, K. Sultanov, R. Generozov, E. Sharova, E. Kostryukova, E. Larin, A. Kanygina, A. Govorun, V. Arapidi, G. PLoS One Research Article Although modern methods of whole genome DNA methylation analysis have a wide range of applications, they are not suitable for clinical diagnostics due to their high cost and complexity and due to the large amount of sample DNA required for the analysis. Therefore, it is crucial to be able to identify a relatively small number of methylation sites that provide high precision and sensitivity for the diagnosis of pathological states. We propose an algorithm for constructing limited subsamples from high-dimensional data to form diagnostic panels. We have developed a tool that utilizes different methods of selection to find an optimal, minimum necessary combination of factors using cross-entropy loss metrics (LogLoss) to identify a subset of methylation sites. We show that the algorithm can work effectively with different genome methylation patterns using ensemble-based machine learning methods. Algorithm efficiency, precision and robustness were evaluated using five genome-wide DNA methylation datasets (totaling 626 samples), and each dataset was classified into tumor and non-tumor samples. The algorithm produced an AUC of 0.97 (95% CI: 0.94–0.99, 9 sites) for prostate adenocarcinoma and an AUC of 1.0 (from 2 to 6 sites) for urothelial bladder carcinoma, two types of kidney carcinoma and colorectal carcinoma. For prostate adenocarcinoma we showed that identified differential variability methylation patterns distinguish cluster of samples with higher recurrence rate (hazard ratio for recurrence = 0.48, 95% CI: 0.05–0.92; log-rank test, p-value < 0.03). We also identified several clusters of correlated interchangeable methylation sites that can be used for the elaboration of biological interpretation of the resulting models and for further selection of the sites most suitable for designing diagnostic panels. LogLoss-BERAF is implemented as a standalone python code and open-source code is freely available from https://github.com/bioinformatics-IBCH/logloss-beraf along with the models described in this article. Public Library of Science 2018-11-02 /pmc/articles/PMC6214495/ /pubmed/30388122 http://dx.doi.org/10.1371/journal.pone.0204371 Text en © 2018 Babalyan et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Babalyan, K.
Sultanov, R.
Generozov, E.
Sharova, E.
Kostryukova, E.
Larin, A.
Kanygina, A.
Govorun, V.
Arapidi, G.
LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer
title LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer
title_full LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer
title_fullStr LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer
title_full_unstemmed LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer
title_short LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer
title_sort logloss-beraf: an ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6214495/
https://www.ncbi.nlm.nih.gov/pubmed/30388122
http://dx.doi.org/10.1371/journal.pone.0204371
work_keys_str_mv AT babalyank loglossberafanensemblebasedmachinelearningmodelforconstructinghighlyaccuratediagnosticsetsofmethylationsitesaccountingforheterogeneityinprostatecancer
AT sultanovr loglossberafanensemblebasedmachinelearningmodelforconstructinghighlyaccuratediagnosticsetsofmethylationsitesaccountingforheterogeneityinprostatecancer
AT generozove loglossberafanensemblebasedmachinelearningmodelforconstructinghighlyaccuratediagnosticsetsofmethylationsitesaccountingforheterogeneityinprostatecancer
AT sharovae loglossberafanensemblebasedmachinelearningmodelforconstructinghighlyaccuratediagnosticsetsofmethylationsitesaccountingforheterogeneityinprostatecancer
AT kostryukovae loglossberafanensemblebasedmachinelearningmodelforconstructinghighlyaccuratediagnosticsetsofmethylationsitesaccountingforheterogeneityinprostatecancer
AT larina loglossberafanensemblebasedmachinelearningmodelforconstructinghighlyaccuratediagnosticsetsofmethylationsitesaccountingforheterogeneityinprostatecancer
AT kanyginaa loglossberafanensemblebasedmachinelearningmodelforconstructinghighlyaccuratediagnosticsetsofmethylationsitesaccountingforheterogeneityinprostatecancer
AT govorunv loglossberafanensemblebasedmachinelearningmodelforconstructinghighlyaccuratediagnosticsetsofmethylationsitesaccountingforheterogeneityinprostatecancer
AT arapidig loglossberafanensemblebasedmachinelearningmodelforconstructinghighlyaccuratediagnosticsetsofmethylationsitesaccountingforheterogeneityinprostatecancer