Cargando…

Accurate identification of bacteriophages from metagenomic data using Transformer

MOTIVATION: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials f...

Descripción completa

Detalles Bibliográficos
Autores principales: Shang, Jiayu, Tang, Xubo, Guo, Ruocheng, Sun, Yanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9294416/
https://www.ncbi.nlm.nih.gov/pubmed/35769000
http://dx.doi.org/10.1093/bib/bbac258
_version_ 1784749848974589952
author Shang, Jiayu
Tang, Xubo
Guo, Ruocheng
Sun, Yanni
author_facet Shang, Jiayu
Tang, Xubo
Guo, Ruocheng
Sun, Yanni
author_sort Shang, Jiayu
collection PubMed
description MOTIVATION: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. RESULTS: In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins’ positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%.
format Online
Article
Text
id pubmed-9294416
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-92944162022-07-20 Accurate identification of bacteriophages from metagenomic data using Transformer Shang, Jiayu Tang, Xubo Guo, Ruocheng Sun, Yanni Brief Bioinform Problem Solving Protocol MOTIVATION: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. RESULTS: In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins’ positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%. Oxford University Press 2022-06-30 /pmc/articles/PMC9294416/ /pubmed/35769000 http://dx.doi.org/10.1093/bib/bbac258 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Problem Solving Protocol
Shang, Jiayu
Tang, Xubo
Guo, Ruocheng
Sun, Yanni
Accurate identification of bacteriophages from metagenomic data using Transformer
title Accurate identification of bacteriophages from metagenomic data using Transformer
title_full Accurate identification of bacteriophages from metagenomic data using Transformer
title_fullStr Accurate identification of bacteriophages from metagenomic data using Transformer
title_full_unstemmed Accurate identification of bacteriophages from metagenomic data using Transformer
title_short Accurate identification of bacteriophages from metagenomic data using Transformer
title_sort accurate identification of bacteriophages from metagenomic data using transformer
topic Problem Solving Protocol
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9294416/
https://www.ncbi.nlm.nih.gov/pubmed/35769000
http://dx.doi.org/10.1093/bib/bbac258
work_keys_str_mv AT shangjiayu accurateidentificationofbacteriophagesfrommetagenomicdatausingtransformer
AT tangxubo accurateidentificationofbacteriophagesfrommetagenomicdatausingtransformer
AT guoruocheng accurateidentificationofbacteriophagesfrommetagenomicdatausingtransformer
AT sunyanni accurateidentificationofbacteriophagesfrommetagenomicdatausingtransformer