Cargando…

The language of proteins: NLP, machine learning & protein sequences

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of...

Descripción completa

Detalles Bibliográficos
Autores principales: Ofer, Dan, Brandes, Nadav, Linial, Michal
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8050421/
https://www.ncbi.nlm.nih.gov/pubmed/33897979
http://dx.doi.org/10.1016/j.csbj.2021.03.022
_version_ 1783679597513015296
author Ofer, Dan
Brandes, Nadav
Linial, Michal
author_facet Ofer, Dan
Brandes, Nadav
Linial, Michal
author_sort Ofer, Dan
collection PubMed
description Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
format Online
Article
Text
id pubmed-8050421
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-80504212021-04-23 The language of proteins: NLP, machine learning & protein sequences Ofer, Dan Brandes, Nadav Linial, Michal Comput Struct Biotechnol J Review Article Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research. Research Network of Computational and Structural Biotechnology 2021-03-25 /pmc/articles/PMC8050421/ /pubmed/33897979 http://dx.doi.org/10.1016/j.csbj.2021.03.022 Text en © 2021 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Review Article
Ofer, Dan
Brandes, Nadav
Linial, Michal
The language of proteins: NLP, machine learning & protein sequences
title The language of proteins: NLP, machine learning & protein sequences
title_full The language of proteins: NLP, machine learning & protein sequences
title_fullStr The language of proteins: NLP, machine learning & protein sequences
title_full_unstemmed The language of proteins: NLP, machine learning & protein sequences
title_short The language of proteins: NLP, machine learning & protein sequences
title_sort language of proteins: nlp, machine learning & protein sequences
topic Review Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8050421/
https://www.ncbi.nlm.nih.gov/pubmed/33897979
http://dx.doi.org/10.1016/j.csbj.2021.03.022
work_keys_str_mv AT oferdan thelanguageofproteinsnlpmachinelearningproteinsequences
AT brandesnadav thelanguageofproteinsnlpmachinelearningproteinsequences
AT linialmichal thelanguageofproteinsnlpmachinelearningproteinsequences
AT oferdan languageofproteinsnlpmachinelearningproteinsequences
AT brandesnadav languageofproteinsnlpmachinelearningproteinsequences
AT linialmichal languageofproteinsnlpmachinelearningproteinsequences