Cargando…

PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer

Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS a...

Descripción completa

Detalles Bibliográficos
Autores principales: Tang, Xubo, Shang, Jiayu, Ji, Yongxin, Sun, Yanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10450166/
https://www.ncbi.nlm.nih.gov/pubmed/37427782
http://dx.doi.org/10.1093/nar/gkad578
_version_ 1785095137882275840
author Tang, Xubo
Shang, Jiayu
Ji, Yongxin
Sun, Yanni
author_facet Tang, Xubo
Shang, Jiayu
Ji, Yongxin
Sun, Yanni
author_sort Tang, Xubo
collection PubMed
description Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS assembly programs tend to return contigs, making plasmid detection difficult. This problem is particularly grave for metagenomic assemblies, which contain short contigs of heterogeneous origins. Available tools for plasmid contig detection still suffer from some limitations. In particular, alignment-based tools tend to miss diverged plasmids while learning-based tools often have lower precision. In this work, we develop a plasmid detection tool PLASMe that capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmid sequences as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and their correlation through positionally token embedding and the attention mechanism. We compared PLASMe and other tools on detecting complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. After validating PLASMe on data with known labels, we also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools.
format Online
Article
Text
id pubmed-10450166
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-104501662023-08-26 PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer Tang, Xubo Shang, Jiayu Ji, Yongxin Sun, Yanni Nucleic Acids Res Methods Online Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS assembly programs tend to return contigs, making plasmid detection difficult. This problem is particularly grave for metagenomic assemblies, which contain short contigs of heterogeneous origins. Available tools for plasmid contig detection still suffer from some limitations. In particular, alignment-based tools tend to miss diverged plasmids while learning-based tools often have lower precision. In this work, we develop a plasmid detection tool PLASMe that capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmid sequences as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and their correlation through positionally token embedding and the attention mechanism. We compared PLASMe and other tools on detecting complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. After validating PLASMe on data with known labels, we also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools. Oxford University Press 2023-07-10 /pmc/articles/PMC10450166/ /pubmed/37427782 http://dx.doi.org/10.1093/nar/gkad578 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods Online
Tang, Xubo
Shang, Jiayu
Ji, Yongxin
Sun, Yanni
PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer
title PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer
title_full PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer
title_fullStr PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer
title_full_unstemmed PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer
title_short PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer
title_sort plasme: a tool to identify plasmid contigs from short-read assemblies using transformer
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10450166/
https://www.ncbi.nlm.nih.gov/pubmed/37427782
http://dx.doi.org/10.1093/nar/gkad578
work_keys_str_mv AT tangxubo plasmeatooltoidentifyplasmidcontigsfromshortreadassembliesusingtransformer
AT shangjiayu plasmeatooltoidentifyplasmidcontigsfromshortreadassembliesusingtransformer
AT jiyongxin plasmeatooltoidentifyplasmidcontigsfromshortreadassembliesusingtransformer
AT sunyanni plasmeatooltoidentifyplasmidcontigsfromshortreadassembliesusingtransformer