Cargando…

Deep learning for HGT insertion sites recognition

BACKGROUND: Horizontal Gene Transfer (HGT) refers to the sharing of genetic materials between distant species that are not in a parent-offspring relationship. The HGT insertion sites are important to understand the HGT mechanisms. Recent studies in main agents of HGT, such as transposon and plasmid,...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Chen, Chen, Jiaxing, Li, Shuai Cheng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7771070/
https://www.ncbi.nlm.nih.gov/pubmed/33372605
http://dx.doi.org/10.1186/s12864-020-07296-1
_version_ 1783629641187065856
author Li, Chen
Chen, Jiaxing
Li, Shuai Cheng
author_facet Li, Chen
Chen, Jiaxing
Li, Shuai Cheng
author_sort Li, Chen
collection PubMed
description BACKGROUND: Horizontal Gene Transfer (HGT) refers to the sharing of genetic materials between distant species that are not in a parent-offspring relationship. The HGT insertion sites are important to understand the HGT mechanisms. Recent studies in main agents of HGT, such as transposon and plasmid, demonstrate that insertion sites usually hold specific sequence features. This motivates us to find a method to infer HGT insertion sites according to sequence features. RESULTS: In this paper, we propose a deep residual network, DeepHGT, to recognize HGT insertion sites. To train DeepHGT, we extracted about 1.55 million sequence segments as training instances from 262 metagenomic samples, where the ratio between positive instances and negative instances is about 1:1. These segments are randomly partitioned into three subsets: 80% of them as the training set, 10% as the validation set, and the remaining 10% as the test set. The training loss of DeepHGT is 0.4163 and the validation loss is 0.423. On the test set, DeepHGT has achieved the area under curve (AUC) value of 0.8782. Furthermore, in order to further evaluate the generalization of DeepHGT, we constructed an independent test set containing 689,312 sequence segments from another 147 gut metagenomic samples. DeepHGT has achieved the AUC value of 0.8428, which approaches the previous test AUC value. As a comparison, the gradient boosting classifier model implemented in PyFeat achieve an AUC value of 0.694 and 0.686 on the above two test sets, respectively. Furthermore, DeepHGT could learn discriminant sequence features; for example, DeepHGT has learned a sequence pattern of palindromic subsequences as a significantly (P-value=0.0182) local feature. Hence, DeepHGT is a reliable model to recognize the HGT insertion site. CONCLUSION: DeepHGT is the first deep learning model that can accurately recognize HGT insertion sites on genomes according to the sequence pattern.
format Online
Article
Text
id pubmed-7771070
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-77710702020-12-30 Deep learning for HGT insertion sites recognition Li, Chen Chen, Jiaxing Li, Shuai Cheng BMC Genomics Research BACKGROUND: Horizontal Gene Transfer (HGT) refers to the sharing of genetic materials between distant species that are not in a parent-offspring relationship. The HGT insertion sites are important to understand the HGT mechanisms. Recent studies in main agents of HGT, such as transposon and plasmid, demonstrate that insertion sites usually hold specific sequence features. This motivates us to find a method to infer HGT insertion sites according to sequence features. RESULTS: In this paper, we propose a deep residual network, DeepHGT, to recognize HGT insertion sites. To train DeepHGT, we extracted about 1.55 million sequence segments as training instances from 262 metagenomic samples, where the ratio between positive instances and negative instances is about 1:1. These segments are randomly partitioned into three subsets: 80% of them as the training set, 10% as the validation set, and the remaining 10% as the test set. The training loss of DeepHGT is 0.4163 and the validation loss is 0.423. On the test set, DeepHGT has achieved the area under curve (AUC) value of 0.8782. Furthermore, in order to further evaluate the generalization of DeepHGT, we constructed an independent test set containing 689,312 sequence segments from another 147 gut metagenomic samples. DeepHGT has achieved the AUC value of 0.8428, which approaches the previous test AUC value. As a comparison, the gradient boosting classifier model implemented in PyFeat achieve an AUC value of 0.694 and 0.686 on the above two test sets, respectively. Furthermore, DeepHGT could learn discriminant sequence features; for example, DeepHGT has learned a sequence pattern of palindromic subsequences as a significantly (P-value=0.0182) local feature. Hence, DeepHGT is a reliable model to recognize the HGT insertion site. CONCLUSION: DeepHGT is the first deep learning model that can accurately recognize HGT insertion sites on genomes according to the sequence pattern. BioMed Central 2020-12-29 /pmc/articles/PMC7771070/ /pubmed/33372605 http://dx.doi.org/10.1186/s12864-020-07296-1 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Li, Chen
Chen, Jiaxing
Li, Shuai Cheng
Deep learning for HGT insertion sites recognition
title Deep learning for HGT insertion sites recognition
title_full Deep learning for HGT insertion sites recognition
title_fullStr Deep learning for HGT insertion sites recognition
title_full_unstemmed Deep learning for HGT insertion sites recognition
title_short Deep learning for HGT insertion sites recognition
title_sort deep learning for hgt insertion sites recognition
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7771070/
https://www.ncbi.nlm.nih.gov/pubmed/33372605
http://dx.doi.org/10.1186/s12864-020-07296-1
work_keys_str_mv AT lichen deeplearningforhgtinsertionsitesrecognition
AT chenjiaxing deeplearningforhgtinsertionsitesrecognition
AT lishuaicheng deeplearningforhgtinsertionsitesrecognition