Cargando…
A self-supervised deep learning method for data-efficient training in genomics
Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labele...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10495322/ https://www.ncbi.nlm.nih.gov/pubmed/37696966 http://dx.doi.org/10.1038/s42003-023-05310-2 |
_version_ | 1785104868071964672 |
---|---|
author | Gündüz, Hüseyin Anil Binder, Martin To, Xiao-Yin Mreches, René Bischl, Bernd McHardy, Alice C. Münch, Philipp C. Rezaei, Mina |
author_facet | Gündüz, Hüseyin Anil Binder, Martin To, Xiao-Yin Mreches, René Bischl, Bernd McHardy, Alice C. Münch, Philipp C. Rezaei, Mina |
author_sort | Gündüz, Hüseyin Anil |
collection | PubMed |
description | Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models. |
format | Online Article Text |
id | pubmed-10495322 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-104953222023-09-13 A self-supervised deep learning method for data-efficient training in genomics Gündüz, Hüseyin Anil Binder, Martin To, Xiao-Yin Mreches, René Bischl, Bernd McHardy, Alice C. Münch, Philipp C. Rezaei, Mina Commun Biol Article Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models. Nature Publishing Group UK 2023-09-11 /pmc/articles/PMC10495322/ /pubmed/37696966 http://dx.doi.org/10.1038/s42003-023-05310-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Gündüz, Hüseyin Anil Binder, Martin To, Xiao-Yin Mreches, René Bischl, Bernd McHardy, Alice C. Münch, Philipp C. Rezaei, Mina A self-supervised deep learning method for data-efficient training in genomics |
title | A self-supervised deep learning method for data-efficient training in genomics |
title_full | A self-supervised deep learning method for data-efficient training in genomics |
title_fullStr | A self-supervised deep learning method for data-efficient training in genomics |
title_full_unstemmed | A self-supervised deep learning method for data-efficient training in genomics |
title_short | A self-supervised deep learning method for data-efficient training in genomics |
title_sort | self-supervised deep learning method for data-efficient training in genomics |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10495322/ https://www.ncbi.nlm.nih.gov/pubmed/37696966 http://dx.doi.org/10.1038/s42003-023-05310-2 |
work_keys_str_mv | AT gunduzhuseyinanil aselfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT bindermartin aselfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT toxiaoyin aselfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT mrechesrene aselfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT bischlbernd aselfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT mchardyalicec aselfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT munchphilippc aselfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT rezaeimina aselfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT gunduzhuseyinanil selfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT bindermartin selfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT toxiaoyin selfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT mrechesrene selfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT bischlbernd selfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT mchardyalicec selfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT munchphilippc selfsuperviseddeeplearningmethodfordataefficienttrainingingenomics AT rezaeimina selfsuperviseddeeplearningmethodfordataefficienttrainingingenomics |