Cargando…

Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure

Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all cont...

Descripción completa

Detalles Bibliográficos
Autores principales: Pronk, Lotte J.U., Medema, Marnix H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9465069/
https://www.ncbi.nlm.nih.gov/pubmed/35503723
http://dx.doi.org/10.1099/mgen.0.000823
_version_ 1784787710042439680
author Pronk, Lotte J.U.
Medema, Marnix H.
author_facet Pronk, Lotte J.U.
Medema, Marnix H.
author_sort Pronk, Lotte J.U.
collection PubMed
description Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all contigs in a metagenome are prokaryotic, likely resulting in less accurate annotation of eukaryotes in metagenomes. Early detection of eukaryotic contigs allows for eukaryote-specific gene prediction and functional annotation. Here, we developed a classifier that distinguishes eukaryotic from prokaryotic contigs based on foundational differences between these taxa in terms of gene structure. We first developed Whokaryote, a random forest classifier that uses intergenic distance, gene density and gene length as the most important features. We show that, with an estimated recall, precision and accuracy of 94, 96 and 95 %, respectively, this classifier with features grounded in biology can perform almost as well as the classifiers EukRep and Tiara, which use k-mer frequencies as features. By retraining our classifier with Tiara predictions as an additional feature, the weaknesses of both types of classifiers are compensated; the result is Whokaryote+Tiara, an enhanced classifier that outperforms all individual classifiers, with an F1 score of 0.99 for both eukaryotes and prokaryotes, while still being fast. In a reanalysis of metagenome data from a disease-suppressive plant endospheric microbial community, we show how using Whokaryote+Tiara to select contigs for eukaryotic gene prediction facilitates the discovery of several biosynthetic gene clusters that were missed in the original study. Whokaryote (+Tiara) is wrapped in an easily installable package and is freely available from https://github.com/LottePronk/whokaryote.
format Online
Article
Text
id pubmed-9465069
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-94650692022-09-12 Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure Pronk, Lotte J.U. Medema, Marnix H. Microb Genom Methods Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all contigs in a metagenome are prokaryotic, likely resulting in less accurate annotation of eukaryotes in metagenomes. Early detection of eukaryotic contigs allows for eukaryote-specific gene prediction and functional annotation. Here, we developed a classifier that distinguishes eukaryotic from prokaryotic contigs based on foundational differences between these taxa in terms of gene structure. We first developed Whokaryote, a random forest classifier that uses intergenic distance, gene density and gene length as the most important features. We show that, with an estimated recall, precision and accuracy of 94, 96 and 95 %, respectively, this classifier with features grounded in biology can perform almost as well as the classifiers EukRep and Tiara, which use k-mer frequencies as features. By retraining our classifier with Tiara predictions as an additional feature, the weaknesses of both types of classifiers are compensated; the result is Whokaryote+Tiara, an enhanced classifier that outperforms all individual classifiers, with an F1 score of 0.99 for both eukaryotes and prokaryotes, while still being fast. In a reanalysis of metagenome data from a disease-suppressive plant endospheric microbial community, we show how using Whokaryote+Tiara to select contigs for eukaryotic gene prediction facilitates the discovery of several biosynthetic gene clusters that were missed in the original study. Whokaryote (+Tiara) is wrapped in an easily installable package and is freely available from https://github.com/LottePronk/whokaryote. Microbiology Society 2022-05-03 /pmc/articles/PMC9465069/ /pubmed/35503723 http://dx.doi.org/10.1099/mgen.0.000823 Text en © 2022 Not applicable https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License.
spellingShingle Methods
Pronk, Lotte J.U.
Medema, Marnix H.
Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
title Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
title_full Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
title_fullStr Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
title_full_unstemmed Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
title_short Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
title_sort whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9465069/
https://www.ncbi.nlm.nih.gov/pubmed/35503723
http://dx.doi.org/10.1099/mgen.0.000823
work_keys_str_mv AT pronklotteju whokaryotedistinguishingeukaryoticandprokaryoticcontigsinmetagenomesbasedongenestructure
AT medemamarnixh whokaryotedistinguishingeukaryoticandprokaryoticcontigsinmetagenomesbasedongenestructure