Cargando…

Improving protein domain classification for third-generation sequencing reads using deep learning

BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the...

Descripción completa

Detalles Bibliográficos
Autores principales: Du, Nan, Shang, Jiayu, Sun, Yanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8033682/
https://www.ncbi.nlm.nih.gov/pubmed/33836667
http://dx.doi.org/10.1186/s12864-021-07468-7
_version_ 1783676445381361664
author Du, Nan
Shang, Jiayu
Sun, Yanni
author_facet Du, Nan
Shang, Jiayu
Sun, Yanni
author_sort Du, Nan
collection PubMed
description BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. RESULTS: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. CONCLUSIONS: In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-021-07468-7).
format Online
Article
Text
id pubmed-8033682
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-80336822021-04-09 Improving protein domain classification for third-generation sequencing reads using deep learning Du, Nan Shang, Jiayu Sun, Yanni BMC Genomics Research Article BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. RESULTS: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. CONCLUSIONS: In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-021-07468-7). BioMed Central 2021-04-09 /pmc/articles/PMC8033682/ /pubmed/33836667 http://dx.doi.org/10.1186/s12864-021-07468-7 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Du, Nan
Shang, Jiayu
Sun, Yanni
Improving protein domain classification for third-generation sequencing reads using deep learning
title Improving protein domain classification for third-generation sequencing reads using deep learning
title_full Improving protein domain classification for third-generation sequencing reads using deep learning
title_fullStr Improving protein domain classification for third-generation sequencing reads using deep learning
title_full_unstemmed Improving protein domain classification for third-generation sequencing reads using deep learning
title_short Improving protein domain classification for third-generation sequencing reads using deep learning
title_sort improving protein domain classification for third-generation sequencing reads using deep learning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8033682/
https://www.ncbi.nlm.nih.gov/pubmed/33836667
http://dx.doi.org/10.1186/s12864-021-07468-7
work_keys_str_mv AT dunan improvingproteindomainclassificationforthirdgenerationsequencingreadsusingdeeplearning
AT shangjiayu improvingproteindomainclassificationforthirdgenerationsequencingreadsusingdeeplearning
AT sunyanni improvingproteindomainclassificationforthirdgenerationsequencingreadsusingdeeplearning