Cargando…

Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks

Taxonomic classification, that is, the assignment to biological clades with shared ancestry, is a common task in genetics, mainly based on a genome similarity search of large genome databases. The classification quality depends heavily on the database, since representative relatives must be present....

Descripción completa

Detalles Bibliográficos
Autores principales: Mock, Florian, Kretschmer, Fleming, Kriese, Anton, Böcker, Sebastian, Marz, Manja
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9436379/
https://www.ncbi.nlm.nih.gov/pubmed/36018838
http://dx.doi.org/10.1073/pnas.2122636119
_version_ 1784781350016909312
author Mock, Florian
Kretschmer, Fleming
Kriese, Anton
Böcker, Sebastian
Marz, Manja
author_facet Mock, Florian
Kretschmer, Fleming
Kriese, Anton
Böcker, Sebastian
Marz, Manja
author_sort Mock, Florian
collection PubMed
description Taxonomic classification, that is, the assignment to biological clades with shared ancestry, is a common task in genetics, mainly based on a genome similarity search of large genome databases. The classification quality depends heavily on the database, since representative relatives must be present. Many genomic sequences cannot be classified at all or only with a high misclassification rate. Here we present BERTax, a deep neural network program based on natural language processing to precisely classify the superkingdom and phylum of DNA sequences taxonomically without the need for a known representative relative from a database. We show BERTax to be at least on par with the state-of-the-art approaches when taxonomically similar species are part of the training data. For novel organisms, however, BERTax clearly outperforms any existing approach. Finally, we show that BERTax can also be combined with database approaches to further increase the prediction quality in almost all cases. Since BERTax is not based on similar entries in databases, it allows precise taxonomic classification of a broader range of genomic sequences, thus increasing the overall information gain.
format Online
Article
Text
id pubmed-9436379
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-94363792023-02-26 Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks Mock, Florian Kretschmer, Fleming Kriese, Anton Böcker, Sebastian Marz, Manja Proc Natl Acad Sci U S A Biological Sciences Taxonomic classification, that is, the assignment to biological clades with shared ancestry, is a common task in genetics, mainly based on a genome similarity search of large genome databases. The classification quality depends heavily on the database, since representative relatives must be present. Many genomic sequences cannot be classified at all or only with a high misclassification rate. Here we present BERTax, a deep neural network program based on natural language processing to precisely classify the superkingdom and phylum of DNA sequences taxonomically without the need for a known representative relative from a database. We show BERTax to be at least on par with the state-of-the-art approaches when taxonomically similar species are part of the training data. For novel organisms, however, BERTax clearly outperforms any existing approach. Finally, we show that BERTax can also be combined with database approaches to further increase the prediction quality in almost all cases. Since BERTax is not based on similar entries in databases, it allows precise taxonomic classification of a broader range of genomic sequences, thus increasing the overall information gain. National Academy of Sciences 2022-08-26 2022-08-30 /pmc/articles/PMC9436379/ /pubmed/36018838 http://dx.doi.org/10.1073/pnas.2122636119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Biological Sciences
Mock, Florian
Kretschmer, Fleming
Kriese, Anton
Böcker, Sebastian
Marz, Manja
Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks
title Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks
title_full Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks
title_fullStr Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks
title_full_unstemmed Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks
title_short Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks
title_sort taxonomic classification of dna sequences beyond sequence similarity using deep neural networks
topic Biological Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9436379/
https://www.ncbi.nlm.nih.gov/pubmed/36018838
http://dx.doi.org/10.1073/pnas.2122636119
work_keys_str_mv AT mockflorian taxonomicclassificationofdnasequencesbeyondsequencesimilarityusingdeepneuralnetworks
AT kretschmerfleming taxonomicclassificationofdnasequencesbeyondsequencesimilarityusingdeepneuralnetworks
AT krieseanton taxonomicclassificationofdnasequencesbeyondsequencesimilarityusingdeepneuralnetworks
AT bockersebastian taxonomicclassificationofdnasequencesbeyondsequencesimilarityusingdeepneuralnetworks
AT marzmanja taxonomicclassificationofdnasequencesbeyondsequencesimilarityusingdeepneuralnetworks