Cargando…

RFfiller: a robust and fast statistical algorithm for gap filling in draft genomes

Numerous published genomes contain gaps or unknown sequences. Gap filling is a critical final step in de novo genome assembly, particularly for large genomes. While certain computational approaches partially address the problem, others have shortcomings regarding the draft genome’s dependability and...

Descripción completa

Detalles Bibliográficos
Autores principales: Midekso, Firaol Dida, Yi, Gangman
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9575681/
https://www.ncbi.nlm.nih.gov/pubmed/36262414
http://dx.doi.org/10.7717/peerj.14186
_version_ 1784811363659415552
author Midekso, Firaol Dida
Yi, Gangman
author_facet Midekso, Firaol Dida
Yi, Gangman
author_sort Midekso, Firaol Dida
collection PubMed
description Numerous published genomes contain gaps or unknown sequences. Gap filling is a critical final step in de novo genome assembly, particularly for large genomes. While certain computational approaches partially address the problem, others have shortcomings regarding the draft genome’s dependability and correctness (high rates of mis-assembly at gap-closing sites and high error rates). While it is well established that genomic repeats result in gaps, many sequence reads originating from repeat-related gaps are typically missed by existing approaches. A fast and reliable statistical algorithm for closing gaps in a draft genome is presented in this paper. It utilizes the alignment statistics between scaffolds, contigs, and paired-end reads to generate a Markov chain that appropriately assigns contigs or long reads to scaffold gap regions (only corrects candidate regions), resulting in accurate and efficient gap closure. To reconstruct the missing component between the two ends of the same insert, the RFfiller meticulously searches for valid overlaps (in repeat regions) and generates transition tables for similar reads, allowing it to make a statistical guess at the missing sequence. Finally, in our experiments, we show that the RFfiller’s gap-closing accuracy is better than that of other publicly available tools when sequence data from various organisms are used. Assembly benchmarks were used to validate RFfiller. Our findings show that RFfiller efficiently fills gaps and that it is especially effective when the gap length is longer. We also show that the RFfiller outperforms other gap closing tools currently on the market.
format Online
Article
Text
id pubmed-9575681
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-95756812022-10-18 RFfiller: a robust and fast statistical algorithm for gap filling in draft genomes Midekso, Firaol Dida Yi, Gangman PeerJ Bioinformatics Numerous published genomes contain gaps or unknown sequences. Gap filling is a critical final step in de novo genome assembly, particularly for large genomes. While certain computational approaches partially address the problem, others have shortcomings regarding the draft genome’s dependability and correctness (high rates of mis-assembly at gap-closing sites and high error rates). While it is well established that genomic repeats result in gaps, many sequence reads originating from repeat-related gaps are typically missed by existing approaches. A fast and reliable statistical algorithm for closing gaps in a draft genome is presented in this paper. It utilizes the alignment statistics between scaffolds, contigs, and paired-end reads to generate a Markov chain that appropriately assigns contigs or long reads to scaffold gap regions (only corrects candidate regions), resulting in accurate and efficient gap closure. To reconstruct the missing component between the two ends of the same insert, the RFfiller meticulously searches for valid overlaps (in repeat regions) and generates transition tables for similar reads, allowing it to make a statistical guess at the missing sequence. Finally, in our experiments, we show that the RFfiller’s gap-closing accuracy is better than that of other publicly available tools when sequence data from various organisms are used. Assembly benchmarks were used to validate RFfiller. Our findings show that RFfiller efficiently fills gaps and that it is especially effective when the gap length is longer. We also show that the RFfiller outperforms other gap closing tools currently on the market. PeerJ Inc. 2022-10-14 /pmc/articles/PMC9575681/ /pubmed/36262414 http://dx.doi.org/10.7717/peerj.14186 Text en ©2022 Midekso and Yi https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Midekso, Firaol Dida
Yi, Gangman
RFfiller: a robust and fast statistical algorithm for gap filling in draft genomes
title RFfiller: a robust and fast statistical algorithm for gap filling in draft genomes
title_full RFfiller: a robust and fast statistical algorithm for gap filling in draft genomes
title_fullStr RFfiller: a robust and fast statistical algorithm for gap filling in draft genomes
title_full_unstemmed RFfiller: a robust and fast statistical algorithm for gap filling in draft genomes
title_short RFfiller: a robust and fast statistical algorithm for gap filling in draft genomes
title_sort rffiller: a robust and fast statistical algorithm for gap filling in draft genomes
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9575681/
https://www.ncbi.nlm.nih.gov/pubmed/36262414
http://dx.doi.org/10.7717/peerj.14186
work_keys_str_mv AT mideksofiraoldida rffillerarobustandfaststatisticalalgorithmforgapfillingindraftgenomes
AT yigangman rffillerarobustandfaststatisticalalgorithmforgapfillingindraftgenomes