Cargando…

Modeling aspects of the language of life through transfer-learning protein sequences

BACKGROUND: Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too...

Descripción completa

Detalles Bibliográficos
Autores principales:	Heinzinger, Michael, Elnaggar, Ahmed, Wang, Yu, Dallago, Christian, Nechaev, Dmitrii, Matthes, Florian, Rost, Burkhard
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6918593/ https://www.ncbi.nlm.nih.gov/pubmed/31847804 http://dx.doi.org/10.1186/s12859-019-3220-8

_version_	1783480621545291776
author	Heinzinger, Michael Elnaggar, Ahmed Wang, Yu Dallago, Christian Nechaev, Dmitrii Matthes, Florian Rost, Burkhard
author_facet	Heinzinger, Michael Elnaggar, Ahmed Wang, Yu Dallago, Christian Nechaev, Dmitrii Matthes, Florian Rost, Burkhard
author_sort	Heinzinger, Michael
collection	PubMed
description	BACKGROUND: Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. RESULTS: We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. CONCLUSION: Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.
format	Online Article Text
id	pubmed-6918593
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-69185932019-12-20 Modeling aspects of the language of life through transfer-learning protein sequences Heinzinger, Michael Elnaggar, Ahmed Wang, Yu Dallago, Christian Nechaev, Dmitrii Matthes, Florian Rost, Burkhard BMC Bioinformatics Research Article BACKGROUND: Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. RESULTS: We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. CONCLUSION: Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence. BioMed Central 2019-12-17 /pmc/articles/PMC6918593/ /pubmed/31847804 http://dx.doi.org/10.1186/s12859-019-3220-8 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Heinzinger, Michael Elnaggar, Ahmed Wang, Yu Dallago, Christian Nechaev, Dmitrii Matthes, Florian Rost, Burkhard Modeling aspects of the language of life through transfer-learning protein sequences
title	Modeling aspects of the language of life through transfer-learning protein sequences
title_full	Modeling aspects of the language of life through transfer-learning protein sequences
title_fullStr	Modeling aspects of the language of life through transfer-learning protein sequences
title_full_unstemmed	Modeling aspects of the language of life through transfer-learning protein sequences
title_short	Modeling aspects of the language of life through transfer-learning protein sequences
title_sort	modeling aspects of the language of life through transfer-learning protein sequences
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6918593/ https://www.ncbi.nlm.nih.gov/pubmed/31847804 http://dx.doi.org/10.1186/s12859-019-3220-8
work_keys_str_mv	AT heinzingermichael modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT elnaggarahmed modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT wangyu modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT dallagochristian modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT nechaevdmitrii modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT matthesflorian modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences AT rostburkhard modelingaspectsofthelanguageoflifethroughtransferlearningproteinsequences

Modeling aspects of the language of life through transfer-learning protein sequences

Ejemplares similares