Cargando…

Improving protein domain classification for third-generation sequencing reads using deep learning

BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Du, Nan, Shang, Jiayu, Sun, Yanni
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8033682/ https://www.ncbi.nlm.nih.gov/pubmed/33836667 http://dx.doi.org/10.1186/s12864-021-07468-7

_version_	1783676445381361664
author	Du, Nan Shang, Jiayu Sun, Yanni
author_facet	Du, Nan Shang, Jiayu Sun, Yanni
author_sort	Du, Nan
collection	PubMed
description	BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. RESULTS: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. CONCLUSIONS: In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-021-07468-7).
format	Online Article Text
id	pubmed-8033682
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-80336822021-04-09 Improving protein domain classification for third-generation sequencing reads using deep learning Du, Nan Shang, Jiayu Sun, Yanni BMC Genomics Research Article BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. RESULTS: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. CONCLUSIONS: In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-021-07468-7). BioMed Central 2021-04-09 /pmc/articles/PMC8033682/ /pubmed/33836667 http://dx.doi.org/10.1186/s12864-021-07468-7 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Article Du, Nan Shang, Jiayu Sun, Yanni Improving protein domain classification for third-generation sequencing reads using deep learning
title	Improving protein domain classification for third-generation sequencing reads using deep learning
title_full	Improving protein domain classification for third-generation sequencing reads using deep learning
title_fullStr	Improving protein domain classification for third-generation sequencing reads using deep learning
title_full_unstemmed	Improving protein domain classification for third-generation sequencing reads using deep learning
title_short	Improving protein domain classification for third-generation sequencing reads using deep learning
title_sort	improving protein domain classification for third-generation sequencing reads using deep learning
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8033682/ https://www.ncbi.nlm.nih.gov/pubmed/33836667 http://dx.doi.org/10.1186/s12864-021-07468-7
work_keys_str_mv	AT dunan improvingproteindomainclassificationforthirdgenerationsequencingreadsusingdeeplearning AT shangjiayu improvingproteindomainclassificationforthirdgenerationsequencingreadsusingdeeplearning AT sunyanni improvingproteindomainclassificationforthirdgenerationsequencingreadsusingdeeplearning

Improving protein domain classification for third-generation sequencing reads using deep learning

Ejemplares similares