Cargando…
Learning meaningful representations of protein sequences
How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence su...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993921/ https://www.ncbi.nlm.nih.gov/pubmed/35395843 http://dx.doi.org/10.1038/s41467-022-29443-w |
_version_ | 1784684007149010944 |
---|---|
author | Detlefsen, Nicki Skafte Hauberg, Søren Boomsma, Wouter |
author_facet | Detlefsen, Nicki Skafte Hauberg, Søren Boomsma, Wouter |
author_sort | Detlefsen, Nicki Skafte |
collection | PubMed |
description | How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured. |
format | Online Article Text |
id | pubmed-8993921 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-89939212022-04-27 Learning meaningful representations of protein sequences Detlefsen, Nicki Skafte Hauberg, Søren Boomsma, Wouter Nat Commun Article How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured. Nature Publishing Group UK 2022-04-08 /pmc/articles/PMC8993921/ /pubmed/35395843 http://dx.doi.org/10.1038/s41467-022-29443-w Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Detlefsen, Nicki Skafte Hauberg, Søren Boomsma, Wouter Learning meaningful representations of protein sequences |
title | Learning meaningful representations of protein sequences |
title_full | Learning meaningful representations of protein sequences |
title_fullStr | Learning meaningful representations of protein sequences |
title_full_unstemmed | Learning meaningful representations of protein sequences |
title_short | Learning meaningful representations of protein sequences |
title_sort | learning meaningful representations of protein sequences |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993921/ https://www.ncbi.nlm.nih.gov/pubmed/35395843 http://dx.doi.org/10.1038/s41467-022-29443-w |
work_keys_str_mv | AT detlefsennickiskafte learningmeaningfulrepresentationsofproteinsequences AT haubergsøren learningmeaningfulrepresentationsofproteinsequences AT boomsmawouter learningmeaningfulrepresentationsofproteinsequences |