Cargando…

The MetaGens algorithm for metagenomic database lossy compression and subject alignment

The advancement of genetic sequencing techniques led to the production of a large volume of data. The extraction of genetic material from a sample is one of the early steps of the metagenomic study. With the evolution of the processes, the analysis of the sequenced data allowed the discovery of etio...

Descripción completa

Detalles Bibliográficos
Autores principales: Cervi, Gustavo Henrique, Flores, Cecilia Dias, Thompson, Claudia Elizabeth
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10419334/
https://www.ncbi.nlm.nih.gov/pubmed/37566631
http://dx.doi.org/10.1093/database/baad053
_version_ 1785088492566478848
author Cervi, Gustavo Henrique
Flores, Cecilia Dias
Thompson, Claudia Elizabeth
author_facet Cervi, Gustavo Henrique
Flores, Cecilia Dias
Thompson, Claudia Elizabeth
author_sort Cervi, Gustavo Henrique
collection PubMed
description The advancement of genetic sequencing techniques led to the production of a large volume of data. The extraction of genetic material from a sample is one of the early steps of the metagenomic study. With the evolution of the processes, the analysis of the sequenced data allowed the discovery of etiological agents and, by corollary, the diagnosis of infections. One of the biggest challenges of the technique is the huge volume of data generated with each new technology developed. To introduce an algorithm that may reduce the data volume, allowing faster DNA matching with the reference databases. Using techniques like lossy compression and substitution matrix, it is possible to match nucleotide sequences without losing the subject. This lossy compression explores the nature of DNA mutations, insertions and deletions and the possibility that different sequences are the same subject. The algorithm can reduce the overall size of the database to 15% of the original size. Depending on parameters, it may reduce up to 5% of the original size. Although is the same as the other platforms, the match algorithm is more sensible because it ignores the transitions and transversions, resulting in a faster way to obtain the diagnostic results. The first experiment results in an increase in speed 10 times faster than Blast while maintaining high sensitivity. This performance gain can be extended by combining other techniques already used in other studies, such as hash tables. Database URL https://github.com/ghc4/metagens
format Online
Article
Text
id pubmed-10419334
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-104193342023-08-12 The MetaGens algorithm for metagenomic database lossy compression and subject alignment Cervi, Gustavo Henrique Flores, Cecilia Dias Thompson, Claudia Elizabeth Database (Oxford) Original Article The advancement of genetic sequencing techniques led to the production of a large volume of data. The extraction of genetic material from a sample is one of the early steps of the metagenomic study. With the evolution of the processes, the analysis of the sequenced data allowed the discovery of etiological agents and, by corollary, the diagnosis of infections. One of the biggest challenges of the technique is the huge volume of data generated with each new technology developed. To introduce an algorithm that may reduce the data volume, allowing faster DNA matching with the reference databases. Using techniques like lossy compression and substitution matrix, it is possible to match nucleotide sequences without losing the subject. This lossy compression explores the nature of DNA mutations, insertions and deletions and the possibility that different sequences are the same subject. The algorithm can reduce the overall size of the database to 15% of the original size. Depending on parameters, it may reduce up to 5% of the original size. Although is the same as the other platforms, the match algorithm is more sensible because it ignores the transitions and transversions, resulting in a faster way to obtain the diagnostic results. The first experiment results in an increase in speed 10 times faster than Blast while maintaining high sensitivity. This performance gain can be extended by combining other techniques already used in other studies, such as hash tables. Database URL https://github.com/ghc4/metagens Oxford University Press 2023-08-11 /pmc/articles/PMC10419334/ /pubmed/37566631 http://dx.doi.org/10.1093/database/baad053 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Cervi, Gustavo Henrique
Flores, Cecilia Dias
Thompson, Claudia Elizabeth
The MetaGens algorithm for metagenomic database lossy compression and subject alignment
title The MetaGens algorithm for metagenomic database lossy compression and subject alignment
title_full The MetaGens algorithm for metagenomic database lossy compression and subject alignment
title_fullStr The MetaGens algorithm for metagenomic database lossy compression and subject alignment
title_full_unstemmed The MetaGens algorithm for metagenomic database lossy compression and subject alignment
title_short The MetaGens algorithm for metagenomic database lossy compression and subject alignment
title_sort metagens algorithm for metagenomic database lossy compression and subject alignment
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10419334/
https://www.ncbi.nlm.nih.gov/pubmed/37566631
http://dx.doi.org/10.1093/database/baad053
work_keys_str_mv AT cervigustavohenrique themetagensalgorithmformetagenomicdatabaselossycompressionandsubjectalignment
AT floresceciliadias themetagensalgorithmformetagenomicdatabaselossycompressionandsubjectalignment
AT thompsonclaudiaelizabeth themetagensalgorithmformetagenomicdatabaselossycompressionandsubjectalignment
AT cervigustavohenrique metagensalgorithmformetagenomicdatabaselossycompressionandsubjectalignment
AT floresceciliadias metagensalgorithmformetagenomicdatabaselossycompressionandsubjectalignment
AT thompsonclaudiaelizabeth metagensalgorithmformetagenomicdatabaselossycompressionandsubjectalignment