Cargando…
New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads
Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms...
Autores principales: | , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Life Science Alliance LLC
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9946810/ https://www.ncbi.nlm.nih.gov/pubmed/36813568 http://dx.doi.org/10.26508/lsa.202201719 |
_version_ | 1784892413424173056 |
---|---|
author | Gonzalez-Garcia, Laura Guevara-Barrientos, David Lozano-Arce, Daniela Gil, Juanita Díaz-Riaño, Jorge Duarte, Erick Andrade, Germán Bojacá, Juan Camilo Hoyos-Sanchez, Maria Camila Chavarro, Christian Guayazan, Natalia Chica, Luis Alberto Buitrago Acosta, Maria Camila Bautista, Edwin Trujillo, Miller Duitama, Jorge |
author_facet | Gonzalez-Garcia, Laura Guevara-Barrientos, David Lozano-Arce, Daniela Gil, Juanita Díaz-Riaño, Jorge Duarte, Erick Andrade, Germán Bojacá, Juan Camilo Hoyos-Sanchez, Maria Camila Chavarro, Christian Guayazan, Natalia Chica, Luis Alberto Buitrago Acosta, Maria Camila Bautista, Edwin Trujillo, Miller Duitama, Jorge |
author_sort | Gonzalez-Garcia, Laura |
collection | PubMed |
description | Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species. |
format | Online Article Text |
id | pubmed-9946810 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Life Science Alliance LLC |
record_format | MEDLINE/PubMed |
spelling | pubmed-99468102023-02-24 New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads Gonzalez-Garcia, Laura Guevara-Barrientos, David Lozano-Arce, Daniela Gil, Juanita Díaz-Riaño, Jorge Duarte, Erick Andrade, Germán Bojacá, Juan Camilo Hoyos-Sanchez, Maria Camila Chavarro, Christian Guayazan, Natalia Chica, Luis Alberto Buitrago Acosta, Maria Camila Bautista, Edwin Trujillo, Miller Duitama, Jorge Life Sci Alliance Methods Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species. Life Science Alliance LLC 2023-02-22 /pmc/articles/PMC9946810/ /pubmed/36813568 http://dx.doi.org/10.26508/lsa.202201719 Text en © 2023 Gonzalez-Garcia et al. https://creativecommons.org/licenses/by/4.0/This article is available under a Creative Commons License (Attribution 4.0 International, as described at https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Methods Gonzalez-Garcia, Laura Guevara-Barrientos, David Lozano-Arce, Daniela Gil, Juanita Díaz-Riaño, Jorge Duarte, Erick Andrade, Germán Bojacá, Juan Camilo Hoyos-Sanchez, Maria Camila Chavarro, Christian Guayazan, Natalia Chica, Luis Alberto Buitrago Acosta, Maria Camila Bautista, Edwin Trujillo, Miller Duitama, Jorge New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads |
title | New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads |
title_full | New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads |
title_fullStr | New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads |
title_full_unstemmed | New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads |
title_short | New algorithms for accurate and efficient de novo genome assembly from long DNA sequencing reads |
title_sort | new algorithms for accurate and efficient de novo genome assembly from long dna sequencing reads |
topic | Methods |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9946810/ https://www.ncbi.nlm.nih.gov/pubmed/36813568 http://dx.doi.org/10.26508/lsa.202201719 |
work_keys_str_mv | AT gonzalezgarcialaura newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT guevarabarrientosdavid newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT lozanoarcedaniela newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT giljuanita newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT diazrianojorge newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT duarteerick newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT andradegerman newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT bojacajuancamilo newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT hoyossanchezmariacamila newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT chavarrochristian newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT guayazannatalia newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT chicaluisalberto newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT buitragoacostamariacamila newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT bautistaedwin newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT trujillomiller newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads AT duitamajorge newalgorithmsforaccurateandefficientdenovogenomeassemblyfromlongdnasequencingreads |