Cargando…
Deep self-supervised learning for biosynthetic gene cluster detection and product classification
Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10241353/ https://www.ncbi.nlm.nih.gov/pubmed/37220151 http://dx.doi.org/10.1371/journal.pcbi.1011162 |
_version_ | 1785053966466285568 |
---|---|
author | Rios-Martinez, Carolina Bhattacharya, Nicholas Amini, Ava P. Crawford, Lorin Yang, Kevin K. |
author_facet | Rios-Martinez, Carolina Bhattacharya, Nicholas Amini, Ava P. Crawford, Lorin Yang, Kevin K. |
author_sort | Rios-Martinez, Carolina |
collection | PubMed |
description | Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification. |
format | Online Article Text |
id | pubmed-10241353 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-102413532023-06-06 Deep self-supervised learning for biosynthetic gene cluster detection and product classification Rios-Martinez, Carolina Bhattacharya, Nicholas Amini, Ava P. Crawford, Lorin Yang, Kevin K. PLoS Comput Biol Research Article Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification. Public Library of Science 2023-05-23 /pmc/articles/PMC10241353/ /pubmed/37220151 http://dx.doi.org/10.1371/journal.pcbi.1011162 Text en © 2023 Rios-Martinez et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Rios-Martinez, Carolina Bhattacharya, Nicholas Amini, Ava P. Crawford, Lorin Yang, Kevin K. Deep self-supervised learning for biosynthetic gene cluster detection and product classification |
title | Deep self-supervised learning for biosynthetic gene cluster detection and product classification |
title_full | Deep self-supervised learning for biosynthetic gene cluster detection and product classification |
title_fullStr | Deep self-supervised learning for biosynthetic gene cluster detection and product classification |
title_full_unstemmed | Deep self-supervised learning for biosynthetic gene cluster detection and product classification |
title_short | Deep self-supervised learning for biosynthetic gene cluster detection and product classification |
title_sort | deep self-supervised learning for biosynthetic gene cluster detection and product classification |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10241353/ https://www.ncbi.nlm.nih.gov/pubmed/37220151 http://dx.doi.org/10.1371/journal.pcbi.1011162 |
work_keys_str_mv | AT riosmartinezcarolina deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification AT bhattacharyanicholas deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification AT aminiavap deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification AT crawfordlorin deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification AT yangkevink deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification |