Cargando…

Improving taxonomic classification with feature space balancing

SUMMARY: Modern high-throughput sequencing technologies, such as metagenomic sequencing, generate millions of sequences that need to be assigned to their taxonomic rank. Modern approaches either apply local alignment to existing databases, such as MMseqs2, or use deep neural networks, as in DeepMicr...

Descripción completa

Detalles Bibliográficos
Autores principales: Fuhl, Wolfgang, Zabel, Susanne, Nieselt, Kay
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10415173/
https://www.ncbi.nlm.nih.gov/pubmed/37577265
http://dx.doi.org/10.1093/bioadv/vbad092
_version_ 1785087466128015360
author Fuhl, Wolfgang
Zabel, Susanne
Nieselt, Kay
author_facet Fuhl, Wolfgang
Zabel, Susanne
Nieselt, Kay
author_sort Fuhl, Wolfgang
collection PubMed
description SUMMARY: Modern high-throughput sequencing technologies, such as metagenomic sequencing, generate millions of sequences that need to be assigned to their taxonomic rank. Modern approaches either apply local alignment to existing databases, such as MMseqs2, or use deep neural networks, as in DeepMicrobes and BERTax. Due to the increasing size of datasets and databases, alignment-based approaches are expensive in terms of runtime. Deep learning-based approaches can require specialized hardware and consume large amounts of energy. In this article, we propose to use k-mer profiles of DNA sequences as features for taxonomic classification. Although k-mer profiles have been used before, we were able to significantly increase their predictive power significantly by applying a feature space balancing approach to the training data. This greatly improved the generalization quality of the classifiers. We have implemented different pipelines using our proposed feature extraction and dataset balancing in combination with different simple classifiers, such as bagged decision trees or feature subspace KNNs. By comparing the performance of our pipelines with state-of-the-art algorithms, such as BERTax and MMseqs2 on two different datasets, we show that our pipelines outperform these in almost all classification tasks. In particular, sequences from organisms that were not part of the training were classified with high precision. AVAILABILITY AND IMPLEMENTATION: The open-source code and the code to reproduce the results is available in Seafile, at https://tinyurl.com/ysk47fmr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
format Online
Article
Text
id pubmed-10415173
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-104151732023-08-12 Improving taxonomic classification with feature space balancing Fuhl, Wolfgang Zabel, Susanne Nieselt, Kay Bioinform Adv Original Article SUMMARY: Modern high-throughput sequencing technologies, such as metagenomic sequencing, generate millions of sequences that need to be assigned to their taxonomic rank. Modern approaches either apply local alignment to existing databases, such as MMseqs2, or use deep neural networks, as in DeepMicrobes and BERTax. Due to the increasing size of datasets and databases, alignment-based approaches are expensive in terms of runtime. Deep learning-based approaches can require specialized hardware and consume large amounts of energy. In this article, we propose to use k-mer profiles of DNA sequences as features for taxonomic classification. Although k-mer profiles have been used before, we were able to significantly increase their predictive power significantly by applying a feature space balancing approach to the training data. This greatly improved the generalization quality of the classifiers. We have implemented different pipelines using our proposed feature extraction and dataset balancing in combination with different simple classifiers, such as bagged decision trees or feature subspace KNNs. By comparing the performance of our pipelines with state-of-the-art algorithms, such as BERTax and MMseqs2 on two different datasets, we show that our pipelines outperform these in almost all classification tasks. In particular, sequences from organisms that were not part of the training were classified with high precision. AVAILABILITY AND IMPLEMENTATION: The open-source code and the code to reproduce the results is available in Seafile, at https://tinyurl.com/ysk47fmr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2023-07-17 /pmc/articles/PMC10415173/ /pubmed/37577265 http://dx.doi.org/10.1093/bioadv/vbad092 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Fuhl, Wolfgang
Zabel, Susanne
Nieselt, Kay
Improving taxonomic classification with feature space balancing
title Improving taxonomic classification with feature space balancing
title_full Improving taxonomic classification with feature space balancing
title_fullStr Improving taxonomic classification with feature space balancing
title_full_unstemmed Improving taxonomic classification with feature space balancing
title_short Improving taxonomic classification with feature space balancing
title_sort improving taxonomic classification with feature space balancing
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10415173/
https://www.ncbi.nlm.nih.gov/pubmed/37577265
http://dx.doi.org/10.1093/bioadv/vbad092
work_keys_str_mv AT fuhlwolfgang improvingtaxonomicclassificationwithfeaturespacebalancing
AT zabelsusanne improvingtaxonomicclassificationwithfeaturespacebalancing
AT nieseltkay improvingtaxonomicclassificationwithfeaturespacebalancing