Cargando…

GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies

Short-read DNA sequencing instruments can yield over 10(12) bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequen...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Eric, Chu, Justin, Zhang, Jessica, Warren, René L., Birol, Inanc
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8772386/
https://www.ncbi.nlm.nih.gov/pubmed/34478378
http://dx.doi.org/10.1109/TCBB.2021.3109557
_version_ 1784635837198106624
author Chen, Eric
Chu, Justin
Zhang, Jessica
Warren, René L.
Birol, Inanc
author_facet Chen, Eric
Chu, Justin
Zhang, Jessica
Warren, René L.
Birol, Inanc
author_sort Chen, Eric
collection PubMed
description Short-read DNA sequencing instruments can yield over 10(12) bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as “gaps”. Here, we introduce GapPredict – An implementation of a proof of concept that uses a character-level language model to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter with high similarity to the reference genome, demonstrating the practical utility of deep learning approaches to the gap-filling problem in genome assembly.
format Online
Article
Text
id pubmed-8772386
institution National Center for Biotechnology Information
language English
publishDate 2021
record_format MEDLINE/PubMed
spelling pubmed-87723862022-01-20 GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies Chen, Eric Chu, Justin Zhang, Jessica Warren, René L. Birol, Inanc IEEE/ACM Trans Comput Biol Bioinform Article Short-read DNA sequencing instruments can yield over 10(12) bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as “gaps”. Here, we introduce GapPredict – An implementation of a proof of concept that uses a character-level language model to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter with high similarity to the reference genome, demonstrating the practical utility of deep learning approaches to the gap-filling problem in genome assembly. 2021 2021-12-08 /pmc/articles/PMC8772386/ /pubmed/34478378 http://dx.doi.org/10.1109/TCBB.2021.3109557 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 License.
spellingShingle Article
Chen, Eric
Chu, Justin
Zhang, Jessica
Warren, René L.
Birol, Inanc
GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies
title GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies
title_full GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies
title_fullStr GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies
title_full_unstemmed GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies
title_short GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies
title_sort gappredict – a language model for resolving gaps in draft genome assemblies
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8772386/
https://www.ncbi.nlm.nih.gov/pubmed/34478378
http://dx.doi.org/10.1109/TCBB.2021.3109557
work_keys_str_mv AT cheneric gappredictalanguagemodelforresolvinggapsindraftgenomeassemblies
AT chujustin gappredictalanguagemodelforresolvinggapsindraftgenomeassemblies
AT zhangjessica gappredictalanguagemodelforresolvinggapsindraftgenomeassemblies
AT warrenrenel gappredictalanguagemodelforresolvinggapsindraftgenomeassemblies
AT birolinanc gappredictalanguagemodelforresolvinggapsindraftgenomeassemblies