Cargando…

Identification of bacteriophage genome sequences with representation learning

MOTIVATION: Bacteriophages/phages are the viruses that infect and replicate within bacteria and archaea, and rich in human body. To investigate the relationship between phages and microbial communities, the identification of phages from metagenome sequences is the first step. Currently, there are tw...

Descripción completa

Detalles Bibliográficos
Autores principales: Bai, Zeheng, Zhang, Yao-zhong, Miyano, Satoru, Yamaguchi, Rui, Fujimoto, Kosuke, Uematsu, Satoshi, Imoto, Seiya
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9477532/
https://www.ncbi.nlm.nih.gov/pubmed/35920769
http://dx.doi.org/10.1093/bioinformatics/btac509
_version_ 1784790382275461120
author Bai, Zeheng
Zhang, Yao-zhong
Miyano, Satoru
Yamaguchi, Rui
Fujimoto, Kosuke
Uematsu, Satoshi
Imoto, Seiya
author_facet Bai, Zeheng
Zhang, Yao-zhong
Miyano, Satoru
Yamaguchi, Rui
Fujimoto, Kosuke
Uematsu, Satoshi
Imoto, Seiya
author_sort Bai, Zeheng
collection PubMed
description MOTIVATION: Bacteriophages/phages are the viruses that infect and replicate within bacteria and archaea, and rich in human body. To investigate the relationship between phages and microbial communities, the identification of phages from metagenome sequences is the first step. Currently, there are two main methods for identifying phages: database-based (alignment-based) methods and alignment-free methods. Database-based methods typically use a large number of sequences as references; alignment-free methods usually learn the features of the sequences with machine learning and deep learning models. RESULTS: We propose INHERIT which uses a deep representation learning model to integrate both database-based and alignment-free methods, combining the strengths of both. Pre-training is used as an alternative way of acquiring knowledge representations from existing databases, while the BERT-style deep learning framework retains the advantage of alignment-free methods. We compare INHERIT with four existing methods on a third-party benchmark dataset. Our experiments show that INHERIT achieves a better performance with the F1-score of 0.9932. In addition, we find that pre-training two species separately helps the non-alignment deep learning model make more accurate predictions. AVAILABILITY AND IMPLEMENTATION: The codes of INHERIT are now available in: https://github.com/Celestial-Bai/INHERIT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9477532
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-94775322022-09-19 Identification of bacteriophage genome sequences with representation learning Bai, Zeheng Zhang, Yao-zhong Miyano, Satoru Yamaguchi, Rui Fujimoto, Kosuke Uematsu, Satoshi Imoto, Seiya Bioinformatics Original Papers MOTIVATION: Bacteriophages/phages are the viruses that infect and replicate within bacteria and archaea, and rich in human body. To investigate the relationship between phages and microbial communities, the identification of phages from metagenome sequences is the first step. Currently, there are two main methods for identifying phages: database-based (alignment-based) methods and alignment-free methods. Database-based methods typically use a large number of sequences as references; alignment-free methods usually learn the features of the sequences with machine learning and deep learning models. RESULTS: We propose INHERIT which uses a deep representation learning model to integrate both database-based and alignment-free methods, combining the strengths of both. Pre-training is used as an alternative way of acquiring knowledge representations from existing databases, while the BERT-style deep learning framework retains the advantage of alignment-free methods. We compare INHERIT with four existing methods on a third-party benchmark dataset. Our experiments show that INHERIT achieves a better performance with the F1-score of 0.9932. In addition, we find that pre-training two species separately helps the non-alignment deep learning model make more accurate predictions. AVAILABILITY AND IMPLEMENTATION: The codes of INHERIT are now available in: https://github.com/Celestial-Bai/INHERIT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-08-03 /pmc/articles/PMC9477532/ /pubmed/35920769 http://dx.doi.org/10.1093/bioinformatics/btac509 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Bai, Zeheng
Zhang, Yao-zhong
Miyano, Satoru
Yamaguchi, Rui
Fujimoto, Kosuke
Uematsu, Satoshi
Imoto, Seiya
Identification of bacteriophage genome sequences with representation learning
title Identification of bacteriophage genome sequences with representation learning
title_full Identification of bacteriophage genome sequences with representation learning
title_fullStr Identification of bacteriophage genome sequences with representation learning
title_full_unstemmed Identification of bacteriophage genome sequences with representation learning
title_short Identification of bacteriophage genome sequences with representation learning
title_sort identification of bacteriophage genome sequences with representation learning
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9477532/
https://www.ncbi.nlm.nih.gov/pubmed/35920769
http://dx.doi.org/10.1093/bioinformatics/btac509
work_keys_str_mv AT baizeheng identificationofbacteriophagegenomesequenceswithrepresentationlearning
AT zhangyaozhong identificationofbacteriophagegenomesequenceswithrepresentationlearning
AT miyanosatoru identificationofbacteriophagegenomesequenceswithrepresentationlearning
AT yamaguchirui identificationofbacteriophagegenomesequenceswithrepresentationlearning
AT fujimotokosuke identificationofbacteriophagegenomesequenceswithrepresentationlearning
AT uematsusatoshi identificationofbacteriophagegenomesequenceswithrepresentationlearning
AT imotoseiya identificationofbacteriophagegenomesequenceswithrepresentationlearning