Cargando…
Foldcomp: a library and format for compressing and indexing large protein structure sets
SUMMARY: Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here, we present Foldcomp, a novel lossy structure compression algorithm, and indexing system to address this challenge. By usin...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10085514/ https://www.ncbi.nlm.nih.gov/pubmed/36961332 http://dx.doi.org/10.1093/bioinformatics/btad153 |
_version_ | 1785021953201930240 |
---|---|
author | Kim, Hyunbin Mirdita, Milot Steinegger, Martin |
author_facet | Kim, Hyunbin Mirdita, Milot Steinegger, Martin |
author_sort | Kim, Hyunbin |
collection | PubMed |
description | SUMMARY: Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here, we present Foldcomp, a novel lossy structure compression algorithm, and indexing system to address this challenge. By using a combination of internal and Cartesian coordinates and a bi-directional NeRF-based strategy, Foldcomp improves the compression ratio by a factor of three compared to the next best method. Its reconstruction error of 0.08 Å is comparable to the best lossy compressor. It is five times faster than the next fastest compressor and competes with the fastest decompressors. With its multi-threading implementation and a Python interface that allows for easy database downloads and efficient querying of protein structures by accession, Foldcomp is a powerful tool for managing and analysing large collections of protein structures. AVAILABILITY AND IMPLEMENTATION: Foldcomp is a free open-source software (GPLv3) and available for Linux, macOS, and Windows at https://foldcomp.foldseek.com. Foldcomp provides the AlphaFold Swiss-Prot (2.9GB), TrEMBL (1.1TB), and ESMatlas HQ (114GB) database ready-for-download. |
format | Online Article Text |
id | pubmed-10085514 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-100855142023-04-11 Foldcomp: a library and format for compressing and indexing large protein structure sets Kim, Hyunbin Mirdita, Milot Steinegger, Martin Bioinformatics Applications Note SUMMARY: Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here, we present Foldcomp, a novel lossy structure compression algorithm, and indexing system to address this challenge. By using a combination of internal and Cartesian coordinates and a bi-directional NeRF-based strategy, Foldcomp improves the compression ratio by a factor of three compared to the next best method. Its reconstruction error of 0.08 Å is comparable to the best lossy compressor. It is five times faster than the next fastest compressor and competes with the fastest decompressors. With its multi-threading implementation and a Python interface that allows for easy database downloads and efficient querying of protein structures by accession, Foldcomp is a powerful tool for managing and analysing large collections of protein structures. AVAILABILITY AND IMPLEMENTATION: Foldcomp is a free open-source software (GPLv3) and available for Linux, macOS, and Windows at https://foldcomp.foldseek.com. Foldcomp provides the AlphaFold Swiss-Prot (2.9GB), TrEMBL (1.1TB), and ESMatlas HQ (114GB) database ready-for-download. Oxford University Press 2023-03-24 /pmc/articles/PMC10085514/ /pubmed/36961332 http://dx.doi.org/10.1093/bioinformatics/btad153 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Applications Note Kim, Hyunbin Mirdita, Milot Steinegger, Martin Foldcomp: a library and format for compressing and indexing large protein structure sets |
title | Foldcomp: a library and format for compressing and indexing large protein structure sets |
title_full | Foldcomp: a library and format for compressing and indexing large protein structure sets |
title_fullStr | Foldcomp: a library and format for compressing and indexing large protein structure sets |
title_full_unstemmed | Foldcomp: a library and format for compressing and indexing large protein structure sets |
title_short | Foldcomp: a library and format for compressing and indexing large protein structure sets |
title_sort | foldcomp: a library and format for compressing and indexing large protein structure sets |
topic | Applications Note |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10085514/ https://www.ncbi.nlm.nih.gov/pubmed/36961332 http://dx.doi.org/10.1093/bioinformatics/btad153 |
work_keys_str_mv | AT kimhyunbin foldcompalibraryandformatforcompressingandindexinglargeproteinstructuresets AT mirditamilot foldcompalibraryandformatforcompressingandindexinglargeproteinstructuresets AT steineggermartin foldcompalibraryandformatforcompressingandindexinglargeproteinstructuresets |