Cargando…

DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention

BACKGROUND: Repetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignmen...

Descripción completa

Detalles Bibliográficos
Autores principales: Hausmann, Fabian, Kurtz, Stefan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8381506/
https://www.ncbi.nlm.nih.gov/pubmed/34425870
http://dx.doi.org/10.1186/s13015-021-00199-0
_version_ 1783741383212793856
author Hausmann, Fabian
Kurtz, Stefan
author_facet Hausmann, Fabian
Kurtz, Stefan
author_sort Hausmann, Fabian
collection PubMed
description BACKGROUND: Repetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Recently, Li (Bioinformatics 35:4408–4410, 2019) developed a novel software tool dna-brnn to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements. RESULTS: We have developed the methods of dna-brnn further and engineered a new software tool DeepGRP. This combines the basic concepts of Li (Bioinformatics 35:4408–4410, 2019) with current techniques developed for neural machine translation, the attention mechanism, for the task of nucleotide-level annotation of repetitive elements. An evaluation on the human genome shows a 20% improvement of the Matthews correlation coefficient for the predictions delivered by DeepGRP, when compared to dna-brnn. DeepGRP predicts two additional classes of repeats (compared to dna-brnn) and is able to transfer repeat annotations, using RepeatMasker-based training data to a different species (mouse). Additionally, we could show that DeepGRP predicts repeats annotated in the Dfam database, but not annotated by RepeatMasker. DeepGRP is highly scalable due to its implementation in the TensorFlow framework. For example, the GPU-accelerated version of DeepGRP is approx. 1.8 times faster than dna-brnn, approx. 8.6 times faster than RepeatMasker and over 100 times faster than HMMER searching for models of the Dfam database. CONCLUSIONS: By incorporating methods from neural machine translation, DeepGRP achieves a consistent improvement of the quality of the predictions compared to dna-brnn. Improved running times are obtained by employing TensorFlow as implementation framework and the use of GPUs. By incorporating two additional classes of repeats, DeepGRP provides more complete annotations, which were evaluated against three state-of-the-art tools for repeat annotation. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13015-021-00199-0.
format Online
Article
Text
id pubmed-8381506
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-83815062021-08-23 DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention Hausmann, Fabian Kurtz, Stefan Algorithms Mol Biol Software Article BACKGROUND: Repetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Recently, Li (Bioinformatics 35:4408–4410, 2019) developed a novel software tool dna-brnn to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements. RESULTS: We have developed the methods of dna-brnn further and engineered a new software tool DeepGRP. This combines the basic concepts of Li (Bioinformatics 35:4408–4410, 2019) with current techniques developed for neural machine translation, the attention mechanism, for the task of nucleotide-level annotation of repetitive elements. An evaluation on the human genome shows a 20% improvement of the Matthews correlation coefficient for the predictions delivered by DeepGRP, when compared to dna-brnn. DeepGRP predicts two additional classes of repeats (compared to dna-brnn) and is able to transfer repeat annotations, using RepeatMasker-based training data to a different species (mouse). Additionally, we could show that DeepGRP predicts repeats annotated in the Dfam database, but not annotated by RepeatMasker. DeepGRP is highly scalable due to its implementation in the TensorFlow framework. For example, the GPU-accelerated version of DeepGRP is approx. 1.8 times faster than dna-brnn, approx. 8.6 times faster than RepeatMasker and over 100 times faster than HMMER searching for models of the Dfam database. CONCLUSIONS: By incorporating methods from neural machine translation, DeepGRP achieves a consistent improvement of the quality of the predictions compared to dna-brnn. Improved running times are obtained by employing TensorFlow as implementation framework and the use of GPUs. By incorporating two additional classes of repeats, DeepGRP provides more complete annotations, which were evaluated against three state-of-the-art tools for repeat annotation. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13015-021-00199-0. BioMed Central 2021-08-23 /pmc/articles/PMC8381506/ /pubmed/34425870 http://dx.doi.org/10.1186/s13015-021-00199-0 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software Article
Hausmann, Fabian
Kurtz, Stefan
DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention
title DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention
title_full DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention
title_fullStr DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention
title_full_unstemmed DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention
title_short DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention
title_sort deepgrp: engineering a software tool for predicting genomic repetitive elements using recurrent neural networks with attention
topic Software Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8381506/
https://www.ncbi.nlm.nih.gov/pubmed/34425870
http://dx.doi.org/10.1186/s13015-021-00199-0
work_keys_str_mv AT hausmannfabian deepgrpengineeringasoftwaretoolforpredictinggenomicrepetitiveelementsusingrecurrentneuralnetworkswithattention
AT kurtzstefan deepgrpengineeringasoftwaretoolforpredictinggenomicrepetitiveelementsusingrecurrentneuralnetworkswithattention