Cargando…

Robust identification of molecular phenotypes using semi-supervised learning

BACKGROUND: Modern molecular profiling techniques are yielding vast amounts of data from patient samples that could be utilized with machine learning methods to provide important biological insights and improvements in patient outcomes. Unsupervised methods have been successfully used to identify mo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Roder, Heinrich, Oliveira, Carlos, Net, Lelia, Linstid, Benjamin, Tsypin, Maxim, Roder, Joanna
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6540576/ https://www.ncbi.nlm.nih.gov/pubmed/31138112 http://dx.doi.org/10.1186/s12859-019-2885-3

_version_	1783422651593654272
author	Roder, Heinrich Oliveira, Carlos Net, Lelia Linstid, Benjamin Tsypin, Maxim Roder, Joanna
author_facet	Roder, Heinrich Oliveira, Carlos Net, Lelia Linstid, Benjamin Tsypin, Maxim Roder, Joanna
author_sort	Roder, Heinrich
collection	PubMed
description	BACKGROUND: Modern molecular profiling techniques are yielding vast amounts of data from patient samples that could be utilized with machine learning methods to provide important biological insights and improvements in patient outcomes. Unsupervised methods have been successfully used to identify molecularly-defined disease subtypes. However, these approaches do not take advantage of potential additional clinical outcome information. Supervised methods can be implemented when training classes are apparent (e.g., responders or non-responders to treatment). However, training classes can be difficult to define when assessing relative benefit of one therapy over another using gold standard clinical endpoints, since it is often not clear how much benefit each individual patient receives. RESULTS: We introduce an iterative approach to binary classification tasks based on the simultaneous refinement of training class labels and classifiers towards self-consistency. As training labels are refined during the process, the method is well suited to cases where training class definitions are not obvious or noisy. Clinical data, including time-to-event endpoints, can be incorporated into the approach to enable the iterative refinement to identify molecular phenotypes associated with a particular clinical variable. Using synthetic data, we show how this approach can be used to increase the accuracy of identification of outcome-related phenotypes and their associated molecular attributes. Further, we demonstrate that the advantages of the method persist in real world genomic datasets, allowing the reliable identification of molecular phenotypes and estimation of their association with outcome that generalizes to validation datasets. We show that at convergence of the iterative refinement, there is a consistent incorporation of the molecular data into the classifier yielding the molecular phenotype and that this allows a robust identification of associated attributes and the underlying biological processes. CONCLUSIONS: The consistent incorporation of the structure of the molecular data into the classifier helps to minimize overfitting and facilitates not only good generalization of classification and molecular phenotypes, but also reliable identification of biologically relevant features and elucidation of underlying biological processes.
format	Online Article Text
id	pubmed-6540576
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-65405762019-06-03 Robust identification of molecular phenotypes using semi-supervised learning Roder, Heinrich Oliveira, Carlos Net, Lelia Linstid, Benjamin Tsypin, Maxim Roder, Joanna BMC Bioinformatics Methodology Article BACKGROUND: Modern molecular profiling techniques are yielding vast amounts of data from patient samples that could be utilized with machine learning methods to provide important biological insights and improvements in patient outcomes. Unsupervised methods have been successfully used to identify molecularly-defined disease subtypes. However, these approaches do not take advantage of potential additional clinical outcome information. Supervised methods can be implemented when training classes are apparent (e.g., responders or non-responders to treatment). However, training classes can be difficult to define when assessing relative benefit of one therapy over another using gold standard clinical endpoints, since it is often not clear how much benefit each individual patient receives. RESULTS: We introduce an iterative approach to binary classification tasks based on the simultaneous refinement of training class labels and classifiers towards self-consistency. As training labels are refined during the process, the method is well suited to cases where training class definitions are not obvious or noisy. Clinical data, including time-to-event endpoints, can be incorporated into the approach to enable the iterative refinement to identify molecular phenotypes associated with a particular clinical variable. Using synthetic data, we show how this approach can be used to increase the accuracy of identification of outcome-related phenotypes and their associated molecular attributes. Further, we demonstrate that the advantages of the method persist in real world genomic datasets, allowing the reliable identification of molecular phenotypes and estimation of their association with outcome that generalizes to validation datasets. We show that at convergence of the iterative refinement, there is a consistent incorporation of the molecular data into the classifier yielding the molecular phenotype and that this allows a robust identification of associated attributes and the underlying biological processes. CONCLUSIONS: The consistent incorporation of the structure of the molecular data into the classifier helps to minimize overfitting and facilitates not only good generalization of classification and molecular phenotypes, but also reliable identification of biologically relevant features and elucidation of underlying biological processes. BioMed Central 2019-05-28 /pmc/articles/PMC6540576/ /pubmed/31138112 http://dx.doi.org/10.1186/s12859-019-2885-3 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Roder, Heinrich Oliveira, Carlos Net, Lelia Linstid, Benjamin Tsypin, Maxim Roder, Joanna Robust identification of molecular phenotypes using semi-supervised learning
title	Robust identification of molecular phenotypes using semi-supervised learning
title_full	Robust identification of molecular phenotypes using semi-supervised learning
title_fullStr	Robust identification of molecular phenotypes using semi-supervised learning
title_full_unstemmed	Robust identification of molecular phenotypes using semi-supervised learning
title_short	Robust identification of molecular phenotypes using semi-supervised learning
title_sort	robust identification of molecular phenotypes using semi-supervised learning
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6540576/ https://www.ncbi.nlm.nih.gov/pubmed/31138112 http://dx.doi.org/10.1186/s12859-019-2885-3
work_keys_str_mv	AT roderheinrich robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT oliveiracarlos robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT netlelia robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT linstidbenjamin robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT tsypinmaxim robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT roderjoanna robustidentificationofmolecularphenotypesusingsemisupervisedlearning

Robust identification of molecular phenotypes using semi-supervised learning

Ejemplares similares