Cargando…
Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all cont...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Microbiology Society
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9465069/ https://www.ncbi.nlm.nih.gov/pubmed/35503723 http://dx.doi.org/10.1099/mgen.0.000823 |
_version_ | 1784787710042439680 |
---|---|
author | Pronk, Lotte J.U. Medema, Marnix H. |
author_facet | Pronk, Lotte J.U. Medema, Marnix H. |
author_sort | Pronk, Lotte J.U. |
collection | PubMed |
description | Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all contigs in a metagenome are prokaryotic, likely resulting in less accurate annotation of eukaryotes in metagenomes. Early detection of eukaryotic contigs allows for eukaryote-specific gene prediction and functional annotation. Here, we developed a classifier that distinguishes eukaryotic from prokaryotic contigs based on foundational differences between these taxa in terms of gene structure. We first developed Whokaryote, a random forest classifier that uses intergenic distance, gene density and gene length as the most important features. We show that, with an estimated recall, precision and accuracy of 94, 96 and 95 %, respectively, this classifier with features grounded in biology can perform almost as well as the classifiers EukRep and Tiara, which use k-mer frequencies as features. By retraining our classifier with Tiara predictions as an additional feature, the weaknesses of both types of classifiers are compensated; the result is Whokaryote+Tiara, an enhanced classifier that outperforms all individual classifiers, with an F1 score of 0.99 for both eukaryotes and prokaryotes, while still being fast. In a reanalysis of metagenome data from a disease-suppressive plant endospheric microbial community, we show how using Whokaryote+Tiara to select contigs for eukaryotic gene prediction facilitates the discovery of several biosynthetic gene clusters that were missed in the original study. Whokaryote (+Tiara) is wrapped in an easily installable package and is freely available from https://github.com/LottePronk/whokaryote. |
format | Online Article Text |
id | pubmed-9465069 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Microbiology Society |
record_format | MEDLINE/PubMed |
spelling | pubmed-94650692022-09-12 Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure Pronk, Lotte J.U. Medema, Marnix H. Microb Genom Methods Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all contigs in a metagenome are prokaryotic, likely resulting in less accurate annotation of eukaryotes in metagenomes. Early detection of eukaryotic contigs allows for eukaryote-specific gene prediction and functional annotation. Here, we developed a classifier that distinguishes eukaryotic from prokaryotic contigs based on foundational differences between these taxa in terms of gene structure. We first developed Whokaryote, a random forest classifier that uses intergenic distance, gene density and gene length as the most important features. We show that, with an estimated recall, precision and accuracy of 94, 96 and 95 %, respectively, this classifier with features grounded in biology can perform almost as well as the classifiers EukRep and Tiara, which use k-mer frequencies as features. By retraining our classifier with Tiara predictions as an additional feature, the weaknesses of both types of classifiers are compensated; the result is Whokaryote+Tiara, an enhanced classifier that outperforms all individual classifiers, with an F1 score of 0.99 for both eukaryotes and prokaryotes, while still being fast. In a reanalysis of metagenome data from a disease-suppressive plant endospheric microbial community, we show how using Whokaryote+Tiara to select contigs for eukaryotic gene prediction facilitates the discovery of several biosynthetic gene clusters that were missed in the original study. Whokaryote (+Tiara) is wrapped in an easily installable package and is freely available from https://github.com/LottePronk/whokaryote. Microbiology Society 2022-05-03 /pmc/articles/PMC9465069/ /pubmed/35503723 http://dx.doi.org/10.1099/mgen.0.000823 Text en © 2022 Not applicable https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License. |
spellingShingle | Methods Pronk, Lotte J.U. Medema, Marnix H. Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure |
title | Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure |
title_full | Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure |
title_fullStr | Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure |
title_full_unstemmed | Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure |
title_short | Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure |
title_sort | whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure |
topic | Methods |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9465069/ https://www.ncbi.nlm.nih.gov/pubmed/35503723 http://dx.doi.org/10.1099/mgen.0.000823 |
work_keys_str_mv | AT pronklotteju whokaryotedistinguishingeukaryoticandprokaryoticcontigsinmetagenomesbasedongenestructure AT medemamarnixh whokaryotedistinguishingeukaryoticandprokaryoticcontigsinmetagenomesbasedongenestructure |