Cargando…

Genome assembly composition of the String “ACGT” array: a review of data structure accuracy and performance challenges

BACKGROUND: The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Compu...

Descripción completa

Detalles Bibliográficos
Autores principales: Magdy Mohamed Abdelaziz Barakat, Sherif, Sallehuddin, Roselina, Yuhaniz, Siti Sophiayati, R. Khairuddin, Raja Farhana, Mahmood, Yasir
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403225/
https://www.ncbi.nlm.nih.gov/pubmed/37547391
http://dx.doi.org/10.7717/peerj-cs.1180
_version_ 1785085023827787776
author Magdy Mohamed Abdelaziz Barakat, Sherif
Sallehuddin, Roselina
Yuhaniz, Siti Sophiayati
R. Khairuddin, Raja Farhana
Mahmood, Yasir
author_facet Magdy Mohamed Abdelaziz Barakat, Sherif
Sallehuddin, Roselina
Yuhaniz, Siti Sophiayati
R. Khairuddin, Raja Farhana
Mahmood, Yasir
author_sort Magdy Mohamed Abdelaziz Barakat, Sherif
collection PubMed
description BACKGROUND: The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Computer algorithms for genome assembly construct the entire genome from reads in two approaches. The de novo approach concatenates the reads based on the exact match between their suffix-prefix (overlapping). Reference-guided approach orders the reads based on their offsets in a well-known reference genome (reads alignment). The presence of repeats extends the technical ambiguity, making the algorithm unable to distinguish the reads resulting in misassembly and affecting the assembly approach accuracy. On the other hand, the massive number of reads causes a big assembly performance challenge. METHOD: The repeat identification method was introduced for misassembly by prior identification of repetitive sequences, creating a repeat knowledge base to reduce ambiguity during the assembly process, thus enhancing the accuracy of the assembled genome. Also, hybridization between assembly approaches resulted in a lower misassembly degree with the aid of the reference genome. The assembly performance is optimized through data structure indexing and parallelization. This article’s primary aim and contribution are to support the researchers through an extensive review to ease other researchers’ search for genome assembly studies. The study also, highlighted the most recent developments and limitations in genome assembly accuracy and performance optimization. RESULTS: Our findings show the limitations of the repeat identification methods available, which only allow to detect of specific lengths of the repeat, and may not perform well when various types of repeats are present in a genome. We also found that most of the hybrid assembly approaches, either starting with de novo or reference-guided, have some limitations in handling repetitive sequences as it is more computationally costly and time intensive. Although the hybrid approach was found to outperform individual assembly approaches, optimizing its performance remains a challenge. Also, the usage of parallelization in overlapping and reads alignment for genome assembly is yet to be fully implemented in the hybrid assembly approach. CONCLUSION: We suggest combining multiple repeat identification methods to enhance the accuracy of identifying the repeats as an initial step to the hybrid assembly approach and combining genome indexing with parallelization for better optimization of its performance.
format Online
Article
Text
id pubmed-10403225
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-104032252023-08-05 Genome assembly composition of the String “ACGT” array: a review of data structure accuracy and performance challenges Magdy Mohamed Abdelaziz Barakat, Sherif Sallehuddin, Roselina Yuhaniz, Siti Sophiayati R. Khairuddin, Raja Farhana Mahmood, Yasir PeerJ Comput Sci Bioinformatics BACKGROUND: The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Computer algorithms for genome assembly construct the entire genome from reads in two approaches. The de novo approach concatenates the reads based on the exact match between their suffix-prefix (overlapping). Reference-guided approach orders the reads based on their offsets in a well-known reference genome (reads alignment). The presence of repeats extends the technical ambiguity, making the algorithm unable to distinguish the reads resulting in misassembly and affecting the assembly approach accuracy. On the other hand, the massive number of reads causes a big assembly performance challenge. METHOD: The repeat identification method was introduced for misassembly by prior identification of repetitive sequences, creating a repeat knowledge base to reduce ambiguity during the assembly process, thus enhancing the accuracy of the assembled genome. Also, hybridization between assembly approaches resulted in a lower misassembly degree with the aid of the reference genome. The assembly performance is optimized through data structure indexing and parallelization. This article’s primary aim and contribution are to support the researchers through an extensive review to ease other researchers’ search for genome assembly studies. The study also, highlighted the most recent developments and limitations in genome assembly accuracy and performance optimization. RESULTS: Our findings show the limitations of the repeat identification methods available, which only allow to detect of specific lengths of the repeat, and may not perform well when various types of repeats are present in a genome. We also found that most of the hybrid assembly approaches, either starting with de novo or reference-guided, have some limitations in handling repetitive sequences as it is more computationally costly and time intensive. Although the hybrid approach was found to outperform individual assembly approaches, optimizing its performance remains a challenge. Also, the usage of parallelization in overlapping and reads alignment for genome assembly is yet to be fully implemented in the hybrid assembly approach. CONCLUSION: We suggest combining multiple repeat identification methods to enhance the accuracy of identifying the repeats as an initial step to the hybrid assembly approach and combining genome indexing with parallelization for better optimization of its performance. PeerJ Inc. 2023-07-13 /pmc/articles/PMC10403225/ /pubmed/37547391 http://dx.doi.org/10.7717/peerj-cs.1180 Text en ©2023 Magdy Mohamed Abdelaziz Barakat et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Magdy Mohamed Abdelaziz Barakat, Sherif
Sallehuddin, Roselina
Yuhaniz, Siti Sophiayati
R. Khairuddin, Raja Farhana
Mahmood, Yasir
Genome assembly composition of the String “ACGT” array: a review of data structure accuracy and performance challenges
title Genome assembly composition of the String “ACGT” array: a review of data structure accuracy and performance challenges
title_full Genome assembly composition of the String “ACGT” array: a review of data structure accuracy and performance challenges
title_fullStr Genome assembly composition of the String “ACGT” array: a review of data structure accuracy and performance challenges
title_full_unstemmed Genome assembly composition of the String “ACGT” array: a review of data structure accuracy and performance challenges
title_short Genome assembly composition of the String “ACGT” array: a review of data structure accuracy and performance challenges
title_sort genome assembly composition of the string “acgt” array: a review of data structure accuracy and performance challenges
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403225/
https://www.ncbi.nlm.nih.gov/pubmed/37547391
http://dx.doi.org/10.7717/peerj-cs.1180
work_keys_str_mv AT magdymohamedabdelazizbarakatsherif genomeassemblycompositionofthestringacgtarrayareviewofdatastructureaccuracyandperformancechallenges
AT sallehuddinroselina genomeassemblycompositionofthestringacgtarrayareviewofdatastructureaccuracyandperformancechallenges
AT yuhanizsitisophiayati genomeassemblycompositionofthestringacgtarrayareviewofdatastructureaccuracyandperformancechallenges
AT rkhairuddinrajafarhana genomeassemblycompositionofthestringacgtarrayareviewofdatastructureaccuracyandperformancechallenges
AT mahmoodyasir genomeassemblycompositionofthestringacgtarrayareviewofdatastructureaccuracyandperformancechallenges