Cargando…

Deep self-supervised learning for biosynthetic gene cluster detection and product classification

Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an...

Descripción completa

Detalles Bibliográficos
Autores principales: Rios-Martinez, Carolina, Bhattacharya, Nicholas, Amini, Ava P., Crawford, Lorin, Yang, Kevin K.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10241353/
https://www.ncbi.nlm.nih.gov/pubmed/37220151
http://dx.doi.org/10.1371/journal.pcbi.1011162
_version_ 1785053966466285568
author Rios-Martinez, Carolina
Bhattacharya, Nicholas
Amini, Ava P.
Crawford, Lorin
Yang, Kevin K.
author_facet Rios-Martinez, Carolina
Bhattacharya, Nicholas
Amini, Ava P.
Crawford, Lorin
Yang, Kevin K.
author_sort Rios-Martinez, Carolina
collection PubMed
description Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.
format Online
Article
Text
id pubmed-10241353
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-102413532023-06-06 Deep self-supervised learning for biosynthetic gene cluster detection and product classification Rios-Martinez, Carolina Bhattacharya, Nicholas Amini, Ava P. Crawford, Lorin Yang, Kevin K. PLoS Comput Biol Research Article Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification. Public Library of Science 2023-05-23 /pmc/articles/PMC10241353/ /pubmed/37220151 http://dx.doi.org/10.1371/journal.pcbi.1011162 Text en © 2023 Rios-Martinez et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Rios-Martinez, Carolina
Bhattacharya, Nicholas
Amini, Ava P.
Crawford, Lorin
Yang, Kevin K.
Deep self-supervised learning for biosynthetic gene cluster detection and product classification
title Deep self-supervised learning for biosynthetic gene cluster detection and product classification
title_full Deep self-supervised learning for biosynthetic gene cluster detection and product classification
title_fullStr Deep self-supervised learning for biosynthetic gene cluster detection and product classification
title_full_unstemmed Deep self-supervised learning for biosynthetic gene cluster detection and product classification
title_short Deep self-supervised learning for biosynthetic gene cluster detection and product classification
title_sort deep self-supervised learning for biosynthetic gene cluster detection and product classification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10241353/
https://www.ncbi.nlm.nih.gov/pubmed/37220151
http://dx.doi.org/10.1371/journal.pcbi.1011162
work_keys_str_mv AT riosmartinezcarolina deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification
AT bhattacharyanicholas deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification
AT aminiavap deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification
AT crawfordlorin deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification
AT yangkevink deepselfsupervisedlearningforbiosyntheticgeneclusterdetectionandproductclassification