Cargando…

Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis

Machine learning (ML) is ubiquitous in bioinformatics, due to its versatility. One of the most crucial aspects to consider while training a ML model is to carefully select the optimal feature encoding for the problem at hand. Biophysical propensity scales are widely adopted in structural bioinformat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Raimondi, Daniele, Orlando, Gabriele, Vranken, Wim F., Moreau, Yves
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6858301/ https://www.ncbi.nlm.nih.gov/pubmed/31729443 http://dx.doi.org/10.1038/s41598-019-53324-w

_version_	1783470927622701056
author	Raimondi, Daniele Orlando, Gabriele Vranken, Wim F. Moreau, Yves
author_facet	Raimondi, Daniele Orlando, Gabriele Vranken, Wim F. Moreau, Yves
author_sort	Raimondi, Daniele
collection	PubMed
description	Machine learning (ML) is ubiquitous in bioinformatics, due to its versatility. One of the most crucial aspects to consider while training a ML model is to carefully select the optimal feature encoding for the problem at hand. Biophysical propensity scales are widely adopted in structural bioinformatics because they describe amino acids properties that are intuitively relevant for many structural and functional aspects of proteins, and are thus commonly used as input features for ML methods. In this paper we reproduce three classical structural bioinformatics prediction tasks to investigate the main assumptions about the use of propensity scales as input features for ML methods. We investigate their usefulness with different randomization experiments and we show that their effectiveness varies among the ML methods used and the tasks. We show that while linear methods are more dependent on the feature encoding, the specific biophysical meaning of the features is less relevant for non-linear methods. Moreover, we show that even among linear ML methods, the simpler one-hot encoding can surprisingly outperform the “biologically meaningful” scales. We also show that feature selection performed with non-linear ML methods may not be able to distinguish between randomized and “real” propensity scales by properly prioritizing to the latter. Finally, we show that learning problem-specific embeddings could be a simple, assumptions-free and optimal way to perform feature learning/engineering for structural bioinformatics tasks.
format	Online Article Text
id	pubmed-6858301
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-68583012019-11-27 Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis Raimondi, Daniele Orlando, Gabriele Vranken, Wim F. Moreau, Yves Sci Rep Article Machine learning (ML) is ubiquitous in bioinformatics, due to its versatility. One of the most crucial aspects to consider while training a ML model is to carefully select the optimal feature encoding for the problem at hand. Biophysical propensity scales are widely adopted in structural bioinformatics because they describe amino acids properties that are intuitively relevant for many structural and functional aspects of proteins, and are thus commonly used as input features for ML methods. In this paper we reproduce three classical structural bioinformatics prediction tasks to investigate the main assumptions about the use of propensity scales as input features for ML methods. We investigate their usefulness with different randomization experiments and we show that their effectiveness varies among the ML methods used and the tasks. We show that while linear methods are more dependent on the feature encoding, the specific biophysical meaning of the features is less relevant for non-linear methods. Moreover, we show that even among linear ML methods, the simpler one-hot encoding can surprisingly outperform the “biologically meaningful” scales. We also show that feature selection performed with non-linear ML methods may not be able to distinguish between randomized and “real” propensity scales by properly prioritizing to the latter. Finally, we show that learning problem-specific embeddings could be a simple, assumptions-free and optimal way to perform feature learning/engineering for structural bioinformatics tasks. Nature Publishing Group UK 2019-11-15 /pmc/articles/PMC6858301/ /pubmed/31729443 http://dx.doi.org/10.1038/s41598-019-53324-w Text en © The Author(s) 2019 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle	Article Raimondi, Daniele Orlando, Gabriele Vranken, Wim F. Moreau, Yves Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis
title	Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis
title_full	Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis
title_fullStr	Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis
title_full_unstemmed	Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis
title_short	Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis
title_sort	exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6858301/ https://www.ncbi.nlm.nih.gov/pubmed/31729443 http://dx.doi.org/10.1038/s41598-019-53324-w
work_keys_str_mv	AT raimondidaniele exploringthelimitationsofbiophysicalpropensityscalescoupledwithmachinelearningforproteinsequenceanalysis AT orlandogabriele exploringthelimitationsofbiophysicalpropensityscalescoupledwithmachinelearningforproteinsequenceanalysis AT vrankenwimf exploringthelimitationsofbiophysicalpropensityscalescoupledwithmachinelearningforproteinsequenceanalysis AT moreauyves exploringthelimitationsofbiophysicalpropensityscalescoupledwithmachinelearningforproteinsequenceanalysis

Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis

Ejemplares similares