Cargando…
DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functio...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9188312/ https://www.ncbi.nlm.nih.gov/pubmed/35698617 http://dx.doi.org/10.7717/peerj.13404 |
_version_ | 1784725346854109184 |
---|---|
author | Chu, Yunmeng Guo, Shun Cui, Dachao Fu, Xiongfei Ma, Yingfei |
author_facet | Chu, Yunmeng Guo, Shun Cui, Dachao Fu, Xiongfei Ma, Yingfei |
author_sort | Chu, Yunmeng |
collection | PubMed |
description | Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP. |
format | Online Article Text |
id | pubmed-9188312 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-91883122022-06-12 DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data Chu, Yunmeng Guo, Shun Cui, Dachao Fu, Xiongfei Ma, Yingfei PeerJ Bioinformatics Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP. PeerJ Inc. 2022-06-08 /pmc/articles/PMC9188312/ /pubmed/35698617 http://dx.doi.org/10.7717/peerj.13404 Text en ©2022 Chu et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Chu, Yunmeng Guo, Shun Cui, Dachao Fu, Xiongfei Ma, Yingfei DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_full | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_fullStr | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_full_unstemmed | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_short | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_sort | deephagetp: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9188312/ https://www.ncbi.nlm.nih.gov/pubmed/35698617 http://dx.doi.org/10.7717/peerj.13404 |
work_keys_str_mv | AT chuyunmeng deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata AT guoshun deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata AT cuidachao deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata AT fuxiongfei deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata AT mayingfei deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata |