Cargando…

rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments

In the last decades, huge efforts have been made in the bioinformatics community to develop machine learning-based methods for the prediction of structural features of proteins in the hope of answering fundamental questions about the way proteins function and their involvement in several illnesses....

Descripción completa

Detalles Bibliográficos
Autores principales: Mirabello, Claudio, Wallner, Björn
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6695225/
https://www.ncbi.nlm.nih.gov/pubmed/31415569
http://dx.doi.org/10.1371/journal.pone.0220182
_version_ 1783443998776492032
author Mirabello, Claudio
Wallner, Björn
author_facet Mirabello, Claudio
Wallner, Björn
author_sort Mirabello, Claudio
collection PubMed
description In the last decades, huge efforts have been made in the bioinformatics community to develop machine learning-based methods for the prediction of structural features of proteins in the hope of answering fundamental questions about the way proteins function and their involvement in several illnesses. The recent advent of Deep Learning has renewed the interest in neural networks, with dozens of methods being developed taking advantage of these new architectures. However, most methods are still heavily based pre-processing of the input data, as well as extraction and integration of multiple hand-picked, and manually designed features. Multiple Sequence Alignments (MSA) are the most common source of information in de novo prediction methods. Deep Networks that automatically refine the MSA and extract useful features from it would be immensely powerful. In this work, we propose a new paradigm for the prediction of protein structural features called rawMSA. The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space. This allows the whole MSA to be input into a Deep Network, thus rendering pre-calculated features such as sequence profiles and other features calculated from MSA obsolete. We showcased the rawMSA methodology on three different prediction problems: secondary structure, relative solvent accessibility and inter-residue contact maps. We have rigorously trained and benchmarked rawMSA on a large set of proteins and have determined that it outperforms classical methods based on position-specific scoring matrices (PSSM) when predicting secondary structure and solvent accessibility, while performing on par with methods using more pre-calculated features in the inter-residue contact map prediction category in CASP12 and CASP13. Clearly demonstrating that rawMSA represents a promising development that can pave the way for improved methods using rawMSA instead of sequence profiles to represent evolutionary information in the coming years. Availability: datasets, dataset generation code, evaluation code and models are available at: https://bitbucket.org/clami66/rawmsa.
format Online
Article
Text
id pubmed-6695225
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-66952252019-08-16 rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments Mirabello, Claudio Wallner, Björn PLoS One Research Article In the last decades, huge efforts have been made in the bioinformatics community to develop machine learning-based methods for the prediction of structural features of proteins in the hope of answering fundamental questions about the way proteins function and their involvement in several illnesses. The recent advent of Deep Learning has renewed the interest in neural networks, with dozens of methods being developed taking advantage of these new architectures. However, most methods are still heavily based pre-processing of the input data, as well as extraction and integration of multiple hand-picked, and manually designed features. Multiple Sequence Alignments (MSA) are the most common source of information in de novo prediction methods. Deep Networks that automatically refine the MSA and extract useful features from it would be immensely powerful. In this work, we propose a new paradigm for the prediction of protein structural features called rawMSA. The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space. This allows the whole MSA to be input into a Deep Network, thus rendering pre-calculated features such as sequence profiles and other features calculated from MSA obsolete. We showcased the rawMSA methodology on three different prediction problems: secondary structure, relative solvent accessibility and inter-residue contact maps. We have rigorously trained and benchmarked rawMSA on a large set of proteins and have determined that it outperforms classical methods based on position-specific scoring matrices (PSSM) when predicting secondary structure and solvent accessibility, while performing on par with methods using more pre-calculated features in the inter-residue contact map prediction category in CASP12 and CASP13. Clearly demonstrating that rawMSA represents a promising development that can pave the way for improved methods using rawMSA instead of sequence profiles to represent evolutionary information in the coming years. Availability: datasets, dataset generation code, evaluation code and models are available at: https://bitbucket.org/clami66/rawmsa. Public Library of Science 2019-08-15 /pmc/articles/PMC6695225/ /pubmed/31415569 http://dx.doi.org/10.1371/journal.pone.0220182 Text en © 2019 Mirabello, Wallner http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Mirabello, Claudio
Wallner, Björn
rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments
title rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments
title_full rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments
title_fullStr rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments
title_full_unstemmed rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments
title_short rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments
title_sort rawmsa: end-to-end deep learning using raw multiple sequence alignments
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6695225/
https://www.ncbi.nlm.nih.gov/pubmed/31415569
http://dx.doi.org/10.1371/journal.pone.0220182
work_keys_str_mv AT mirabelloclaudio rawmsaendtoenddeeplearningusingrawmultiplesequencealignments
AT wallnerbjorn rawmsaendtoenddeeplearningusingrawmultiplesequencealignments