Cargando…

New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads

Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms...

Descripción completa

Detalles Bibliográficos
Autores principales: Gonzalez-Garcia, Laura, Guevara-Barrientos, David, Lozano-Arce, Daniela, Gil, Juanita, Díaz-Riaño, Jorge, Duarte, Erick, Andrade, Germán, Bojacá, Juan Camilo, Hoyos-Sanchez, Maria Camila, Chavarro, Christian, Guayazan, Natalia, Chica, Luis Alberto, Buitrago Acosta, Maria Camila, Bautista, Edwin, Trujillo, Miller, Duitama, Jorge
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Life Science Alliance LLC 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9946810/
https://www.ncbi.nlm.nih.gov/pubmed/36813568
http://dx.doi.org/10.26508/lsa.202201719
_version_ 1784892413424173056
author Gonzalez-Garcia, Laura
Guevara-Barrientos, David
Lozano-Arce, Daniela
Gil, Juanita
Díaz-Riaño, Jorge
Duarte, Erick
Andrade, Germán
Bojacá, Juan Camilo
Hoyos-Sanchez, Maria Camila
Chavarro, Christian
Guayazan, Natalia
Chica, Luis Alberto
Buitrago Acosta, Maria Camila
Bautista, Edwin
Trujillo, Miller
Duitama, Jorge
author_facet Gonzalez-Garcia, Laura
Guevara-Barrientos, David
Lozano-Arce, Daniela
Gil, Juanita
Díaz-Riaño, Jorge
Duarte, Erick
Andrade, Germán
Bojacá, Juan Camilo
Hoyos-Sanchez, Maria Camila
Chavarro, Christian
Guayazan, Natalia
Chica, Luis Alberto
Buitrago Acosta, Maria Camila
Bautista, Edwin
Trujillo, Miller
Duitama, Jorge
author_sort Gonzalez-Garcia, Laura
collection PubMed
description Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species.
format Online
Article
Text
id pubmed-9946810
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Life Science Alliance LLC
record_format MEDLINE/PubMed
spelling pubmed-99468102023-02-24 New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads Gonzalez-Garcia, Laura Guevara-Barrientos, David Lozano-Arce, Daniela Gil, Juanita Díaz-Riaño, Jorge Duarte, Erick Andrade, Germán Bojacá, Juan Camilo Hoyos-Sanchez, Maria Camila Chavarro, Christian Guayazan, Natalia Chica, Luis Alberto Buitrago Acosta, Maria Camila Bautista, Edwin Trujillo, Miller Duitama, Jorge Life Sci Alliance Methods Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species. Life Science Alliance LLC 2023-02-22 /pmc/articles/PMC9946810/ /pubmed/36813568 http://dx.doi.org/10.26508/lsa.202201719 Text en © 2023 Gonzalez-Garcia et al. https://creativecommons.org/licenses/by/4.0/This article is available under a Creative Commons License (Attribution 4.0 International, as described at https://creativecommons.org/licenses/by/4.0/).
spellingShingle Methods
Gonzalez-Garcia, Laura
Guevara-Barrientos, David
Lozano-Arce, Daniela
Gil, Juanita
Díaz-Riaño, Jorge
Duarte, Erick
Andrade, Germán
Bojacá, Juan Camilo
Hoyos-Sanchez, Maria Camila
Chavarro, Christian
Guayazan, Natalia
Chica, Luis Alberto
Buitrago Acosta, Maria Camila
Bautista, Edwin
Trujillo, Miller
Duitama, Jorge
New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads
title New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads
title_full New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads
title_fullStr New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads
title_full_unstemmed New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads
title_short New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads
title_sort new algorithms for accurate and efficient de novo genome assembly from long dna sequencing reads
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9946810/
https://www.ncbi.nlm.nih.gov/pubmed/36813568
http://dx.doi.org/10.26508/lsa.202201719
work_keys_str_mv AT gonzalezgarcialaura newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT guevarabarrientosdavid newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT lozanoarcedaniela newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT giljuanita newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT diazrianojorge newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT duarteerick newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT andradegerman newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT bojacajuancamilo newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT hoyossanchezmariacamila newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT chavarrochristian newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT guayazannatalia newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT chicaluisalberto newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT buitragoacostamariacamila newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT bautistaedwin newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT trujillomiller newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads
AT duitamajorge newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads