Cargando…
Improving protein domain classification for third-generation sequencing reads using deep learning
BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8033682/ https://www.ncbi.nlm.nih.gov/pubmed/33836667 http://dx.doi.org/10.1186/s12864-021-07468-7 |
_version_ | 1783676445381361664 |
---|---|
author | Du, Nan Shang, Jiayu Sun, Yanni |
author_facet | Du, Nan Shang, Jiayu Sun, Yanni |
author_sort | Du, Nan |
collection | PubMed |
description | BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. RESULTS: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. CONCLUSIONS: In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-021-07468-7). |
format | Online Article Text |
id | pubmed-8033682 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-80336822021-04-09 Improving protein domain classification for third-generation sequencing reads using deep learning Du, Nan Shang, Jiayu Sun, Yanni BMC Genomics Research Article BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. RESULTS: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. CONCLUSIONS: In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-021-07468-7). BioMed Central 2021-04-09 /pmc/articles/PMC8033682/ /pubmed/33836667 http://dx.doi.org/10.1186/s12864-021-07468-7 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Du, Nan Shang, Jiayu Sun, Yanni Improving protein domain classification for third-generation sequencing reads using deep learning |
title | Improving protein domain classification for third-generation sequencing reads using deep learning |
title_full | Improving protein domain classification for third-generation sequencing reads using deep learning |
title_fullStr | Improving protein domain classification for third-generation sequencing reads using deep learning |
title_full_unstemmed | Improving protein domain classification for third-generation sequencing reads using deep learning |
title_short | Improving protein domain classification for third-generation sequencing reads using deep learning |
title_sort | improving protein domain classification for third-generation sequencing reads using deep learning |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8033682/ https://www.ncbi.nlm.nih.gov/pubmed/33836667 http://dx.doi.org/10.1186/s12864-021-07468-7 |
work_keys_str_mv | AT dunan improvingproteindomainclassificationforthirdgenerationsequencingreadsusingdeeplearning AT shangjiayu improvingproteindomainclassificationforthirdgenerationsequencingreadsusingdeeplearning AT sunyanni improvingproteindomainclassificationforthirdgenerationsequencingreadsusingdeeplearning |