Cargando…

AGC: compact representation of assembled genomes with fast queries and updates

MOTIVATION: High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabyt...

Descripción completa

Detalles Bibliográficos
Autores principales: Deorowicz, Sebastian, Danek, Agnieszka, Li, Heng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9994791/
https://www.ncbi.nlm.nih.gov/pubmed/36864624
http://dx.doi.org/10.1093/bioinformatics/btad097
_version_ 1784902693882429440
author Deorowicz, Sebastian
Danek, Agnieszka
Li, Heng
author_facet Deorowicz, Sebastian
Danek, Agnieszka
Li, Heng
author_sort Deorowicz, Sebastian
collection PubMed
description MOTIVATION: High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets. RESULTS: Here, we show how to reduce the size of the sequenced genomes by 2–3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data. AVAILABILITY AND IMPLEMENTATION: The source code of AGC is available at https://github.com/refresh-bio/agc. The package can be installed via Bioconda at https://anaconda.org/bioconda/agc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9994791
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-99947912023-03-09 AGC: compact representation of assembled genomes with fast queries and updates Deorowicz, Sebastian Danek, Agnieszka Li, Heng Bioinformatics Original Paper MOTIVATION: High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets. RESULTS: Here, we show how to reduce the size of the sequenced genomes by 2–3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data. AVAILABILITY AND IMPLEMENTATION: The source code of AGC is available at https://github.com/refresh-bio/agc. The package can be installed via Bioconda at https://anaconda.org/bioconda/agc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2023-03-02 /pmc/articles/PMC9994791/ /pubmed/36864624 http://dx.doi.org/10.1093/bioinformatics/btad097 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Deorowicz, Sebastian
Danek, Agnieszka
Li, Heng
AGC: compact representation of assembled genomes with fast queries and updates
title AGC: compact representation of assembled genomes with fast queries and updates
title_full AGC: compact representation of assembled genomes with fast queries and updates
title_fullStr AGC: compact representation of assembled genomes with fast queries and updates
title_full_unstemmed AGC: compact representation of assembled genomes with fast queries and updates
title_short AGC: compact representation of assembled genomes with fast queries and updates
title_sort agc: compact representation of assembled genomes with fast queries and updates
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9994791/
https://www.ncbi.nlm.nih.gov/pubmed/36864624
http://dx.doi.org/10.1093/bioinformatics/btad097
work_keys_str_mv AT deorowiczsebastian agccompactrepresentationofassembledgenomeswithfastqueriesandupdates
AT danekagnieszka agccompactrepresentationofassembledgenomeswithfastqueriesandupdates
AT liheng agccompactrepresentationofassembledgenomeswithfastqueriesandupdates