Cargando…
PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer
Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS a...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10450166/ https://www.ncbi.nlm.nih.gov/pubmed/37427782 http://dx.doi.org/10.1093/nar/gkad578 |
_version_ | 1785095137882275840 |
---|---|
author | Tang, Xubo Shang, Jiayu Ji, Yongxin Sun, Yanni |
author_facet | Tang, Xubo Shang, Jiayu Ji, Yongxin Sun, Yanni |
author_sort | Tang, Xubo |
collection | PubMed |
description | Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS assembly programs tend to return contigs, making plasmid detection difficult. This problem is particularly grave for metagenomic assemblies, which contain short contigs of heterogeneous origins. Available tools for plasmid contig detection still suffer from some limitations. In particular, alignment-based tools tend to miss diverged plasmids while learning-based tools often have lower precision. In this work, we develop a plasmid detection tool PLASMe that capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmid sequences as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and their correlation through positionally token embedding and the attention mechanism. We compared PLASMe and other tools on detecting complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. After validating PLASMe on data with known labels, we also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools. |
format | Online Article Text |
id | pubmed-10450166 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-104501662023-08-26 PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer Tang, Xubo Shang, Jiayu Ji, Yongxin Sun, Yanni Nucleic Acids Res Methods Online Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS assembly programs tend to return contigs, making plasmid detection difficult. This problem is particularly grave for metagenomic assemblies, which contain short contigs of heterogeneous origins. Available tools for plasmid contig detection still suffer from some limitations. In particular, alignment-based tools tend to miss diverged plasmids while learning-based tools often have lower precision. In this work, we develop a plasmid detection tool PLASMe that capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmid sequences as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and their correlation through positionally token embedding and the attention mechanism. We compared PLASMe and other tools on detecting complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. After validating PLASMe on data with known labels, we also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools. Oxford University Press 2023-07-10 /pmc/articles/PMC10450166/ /pubmed/37427782 http://dx.doi.org/10.1093/nar/gkad578 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methods Online Tang, Xubo Shang, Jiayu Ji, Yongxin Sun, Yanni PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer |
title | PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer |
title_full | PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer |
title_fullStr | PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer |
title_full_unstemmed | PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer |
title_short | PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer |
title_sort | plasme: a tool to identify plasmid contigs from short-read assemblies using transformer |
topic | Methods Online |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10450166/ https://www.ncbi.nlm.nih.gov/pubmed/37427782 http://dx.doi.org/10.1093/nar/gkad578 |
work_keys_str_mv | AT tangxubo plasmeatooltoidentifyplasmidcontigsfromshortreadassembliesusingtransformer AT shangjiayu plasmeatooltoidentifyplasmidcontigsfromshortreadassembliesusingtransformer AT jiyongxin plasmeatooltoidentifyplasmidcontigsfromshortreadassembliesusingtransformer AT sunyanni plasmeatooltoidentifyplasmidcontigsfromshortreadassembliesusingtransformer |