Cargando…

Metagenome Fragment Classification Using N-Mer Frequency Profiles

A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metageno...

Descripción completa

Detalles Bibliográficos
Autores principales: Rosen, Gail, Garbarine, Elaine, Caseiro, Diamantino, Polikar, Robi, Sokhansanj, Bahrad
Formato: Texto
Lenguaje:English
Publicado: Hindawi Publishing Corporation 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777009/
https://www.ncbi.nlm.nih.gov/pubmed/19956701
http://dx.doi.org/10.1155/2008/205969
_version_ 1782174133926232064
author Rosen, Gail
Garbarine, Elaine
Caseiro, Diamantino
Polikar, Robi
Sokhansanj, Bahrad
author_facet Rosen, Gail
Garbarine, Elaine
Caseiro, Diamantino
Polikar, Robi
Sokhansanj, Bahrad
author_sort Rosen, Gail
collection PubMed
description A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.
format Text
id pubmed-2777009
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher Hindawi Publishing Corporation
record_format MEDLINE/PubMed
spelling pubmed-27770092009-12-02 Metagenome Fragment Classification Using N-Mer Frequency Profiles Rosen, Gail Garbarine, Elaine Caseiro, Diamantino Polikar, Robi Sokhansanj, Bahrad Adv Bioinformatics Research Article A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced. Hindawi Publishing Corporation 2008 2008-11-16 /pmc/articles/PMC2777009/ /pubmed/19956701 http://dx.doi.org/10.1155/2008/205969 Text en Copyright © 2008 Gail Rosen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Rosen, Gail
Garbarine, Elaine
Caseiro, Diamantino
Polikar, Robi
Sokhansanj, Bahrad
Metagenome Fragment Classification Using N-Mer Frequency Profiles
title Metagenome Fragment Classification Using N-Mer Frequency Profiles
title_full Metagenome Fragment Classification Using N-Mer Frequency Profiles
title_fullStr Metagenome Fragment Classification Using N-Mer Frequency Profiles
title_full_unstemmed Metagenome Fragment Classification Using N-Mer Frequency Profiles
title_short Metagenome Fragment Classification Using N-Mer Frequency Profiles
title_sort metagenome fragment classification using n-mer frequency profiles
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777009/
https://www.ncbi.nlm.nih.gov/pubmed/19956701
http://dx.doi.org/10.1155/2008/205969
work_keys_str_mv AT rosengail metagenomefragmentclassificationusingnmerfrequencyprofiles
AT garbarineelaine metagenomefragmentclassificationusingnmerfrequencyprofiles
AT caseirodiamantino metagenomefragmentclassificationusingnmerfrequencyprofiles
AT polikarrobi metagenomefragmentclassificationusingnmerfrequencyprofiles
AT sokhansanjbahrad metagenomefragmentclassificationusingnmerfrequencyprofiles