Cargando…

DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data

Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functio...

Descripción completa

Detalles Bibliográficos
Autores principales: Chu, Yunmeng, Guo, Shun, Cui, Dachao, Fu, Xiongfei, Ma, Yingfei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9188312/
https://www.ncbi.nlm.nih.gov/pubmed/35698617
http://dx.doi.org/10.7717/peerj.13404
_version_ 1784725346854109184
author Chu, Yunmeng
Guo, Shun
Cui, Dachao
Fu, Xiongfei
Ma, Yingfei
author_facet Chu, Yunmeng
Guo, Shun
Cui, Dachao
Fu, Xiongfei
Ma, Yingfei
author_sort Chu, Yunmeng
collection PubMed
description Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP.
format Online
Article
Text
id pubmed-9188312
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-91883122022-06-12 DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data Chu, Yunmeng Guo, Shun Cui, Dachao Fu, Xiongfei Ma, Yingfei PeerJ Bioinformatics Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP. PeerJ Inc. 2022-06-08 /pmc/articles/PMC9188312/ /pubmed/35698617 http://dx.doi.org/10.7717/peerj.13404 Text en ©2022 Chu et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Chu, Yunmeng
Guo, Shun
Cui, Dachao
Fu, Xiongfei
Ma, Yingfei
DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_full DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_fullStr DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_full_unstemmed DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_short DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_sort deephagetp: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9188312/
https://www.ncbi.nlm.nih.gov/pubmed/35698617
http://dx.doi.org/10.7717/peerj.13404
work_keys_str_mv AT chuyunmeng deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata
AT guoshun deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata
AT cuidachao deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata
AT fuxiongfei deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata
AT mayingfei deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata