Cargando…

Binning sequences using very sparse labels within a metagenome

BACKGROUND: In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, k-mer and PhyloPythia, involve assigning sequence fragments by comparing sequenc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chan, Chon-Kit Kenneth, Hsu, Arthur L, Halgamuge, Saman K, Tang, Sen-Lin
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2383919/ https://www.ncbi.nlm.nih.gov/pubmed/18442374 http://dx.doi.org/10.1186/1471-2105-9-215

_version_	1782154834398412800
author	Chan, Chon-Kit Kenneth Hsu, Arthur L Halgamuge, Saman K Tang, Sen-Lin
author_facet	Chan, Chon-Kit Kenneth Hsu, Arthur L Halgamuge, Saman K Tang, Sen-Lin
author_sort	Chan, Chon-Kit Kenneth
collection	PubMed
description	BACKGROUND: In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, k-mer and PhyloPythia, involve assigning sequence fragments by comparing sequence similarity or sequence composition with already-sequenced genomes that are still far from comprehensive. We propose a semi-supervised seeding method for binning that does not depend on knowledge of completed genomes. Instead, it extracts the flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds (labels) to assign other reads based on their compositional similarity. RESULTS: The proposed seeding method is implemented on an unsupervised Growing Self-Organising Map (GSOM), and called Seeded GSOM (S-GSOM). We compared it with four well-known semi-supervised learning methods in a preliminary test, separating random-length prokaryotic sequence fragments sampled from the NCBI genome database. We identified the flanking sequences of the highly conserved 16S rRNA as suitable seeds that could be used to group the sequence fragments according to their species. S-GSOM showed superior performance compared to the semi-supervised methods tested. Additionally, S-GSOM may also be used to visually identify some species that do not have seeds. The proposed method was then applied to simulated metagenomic datasets using two different confidence threshold settings and compared with PhyloPythia, k-mer and BLAST. At the reference taxonomic level Order, S-GSOM outperformed all k-mer and BLAST results and showed comparable results with PhyloPythia for each of the corresponding confidence settings, where S-GSOM performed better than PhyloPythia in the ≥ 10 reads datasets and comparable in the ≥ 8 kb benchmark tests. CONCLUSION: In the task of binning using semi-supervised learning methods, results indicate S-GSOM to be the best of the methods tested. Most importantly, the proposed method does not require knowledge from known genomes and uses only very few labels (one per species is sufficient in most cases), which are extracted from the metagenome itself. These advantages make it a very attractive binning method. S-GSOM outperformed the binning methods that depend on already-sequenced genomes, and compares well to the current most advanced binning method, PhyloPythia.
format	Text
id	pubmed-2383919
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-23839192008-05-14 Binning sequences using very sparse labels within a metagenome Chan, Chon-Kit Kenneth Hsu, Arthur L Halgamuge, Saman K Tang, Sen-Lin BMC Bioinformatics Methodology Article BACKGROUND: In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, k-mer and PhyloPythia, involve assigning sequence fragments by comparing sequence similarity or sequence composition with already-sequenced genomes that are still far from comprehensive. We propose a semi-supervised seeding method for binning that does not depend on knowledge of completed genomes. Instead, it extracts the flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds (labels) to assign other reads based on their compositional similarity. RESULTS: The proposed seeding method is implemented on an unsupervised Growing Self-Organising Map (GSOM), and called Seeded GSOM (S-GSOM). We compared it with four well-known semi-supervised learning methods in a preliminary test, separating random-length prokaryotic sequence fragments sampled from the NCBI genome database. We identified the flanking sequences of the highly conserved 16S rRNA as suitable seeds that could be used to group the sequence fragments according to their species. S-GSOM showed superior performance compared to the semi-supervised methods tested. Additionally, S-GSOM may also be used to visually identify some species that do not have seeds. The proposed method was then applied to simulated metagenomic datasets using two different confidence threshold settings and compared with PhyloPythia, k-mer and BLAST. At the reference taxonomic level Order, S-GSOM outperformed all k-mer and BLAST results and showed comparable results with PhyloPythia for each of the corresponding confidence settings, where S-GSOM performed better than PhyloPythia in the ≥ 10 reads datasets and comparable in the ≥ 8 kb benchmark tests. CONCLUSION: In the task of binning using semi-supervised learning methods, results indicate S-GSOM to be the best of the methods tested. Most importantly, the proposed method does not require knowledge from known genomes and uses only very few labels (one per species is sufficient in most cases), which are extracted from the metagenome itself. These advantages make it a very attractive binning method. S-GSOM outperformed the binning methods that depend on already-sequenced genomes, and compares well to the current most advanced binning method, PhyloPythia. BioMed Central 2008-04-28 /pmc/articles/PMC2383919/ /pubmed/18442374 http://dx.doi.org/10.1186/1471-2105-9-215 Text en Copyright © 2008 Chan et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Chan, Chon-Kit Kenneth Hsu, Arthur L Halgamuge, Saman K Tang, Sen-Lin Binning sequences using very sparse labels within a metagenome
title	Binning sequences using very sparse labels within a metagenome
title_full	Binning sequences using very sparse labels within a metagenome
title_fullStr	Binning sequences using very sparse labels within a metagenome
title_full_unstemmed	Binning sequences using very sparse labels within a metagenome
title_short	Binning sequences using very sparse labels within a metagenome
title_sort	binning sequences using very sparse labels within a metagenome
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2383919/ https://www.ncbi.nlm.nih.gov/pubmed/18442374 http://dx.doi.org/10.1186/1471-2105-9-215
work_keys_str_mv	AT chanchonkitkenneth binningsequencesusingverysparselabelswithinametagenome AT hsuarthurl binningsequencesusingverysparselabelswithinametagenome AT halgamugesamank binningsequencesusingverysparselabelswithinametagenome AT tangsenlin binningsequencesusingverysparselabelswithinametagenome

Binning sequences using very sparse labels within a metagenome

Ejemplares similares