Cargando…

A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads

BACKGROUND: Metagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture. One of the crucial steps in metagenomic analysis, referred to as “binning”, is to separate reads into clusters that represent genomes from closely related organisms. A...

Descripción completa

Detalles Bibliográficos
Autores principales: Vinh, Le Van, Lang, Tran Van, Binh, Le Thanh, Hoai, Tran Van
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4304631/
https://www.ncbi.nlm.nih.gov/pubmed/25648210
http://dx.doi.org/10.1186/s13015-014-0030-4
_version_ 1782354140737830912
author Vinh, Le Van
Lang, Tran Van
Binh, Le Thanh
Hoai, Tran Van
author_facet Vinh, Le Van
Lang, Tran Van
Binh, Le Thanh
Hoai, Tran Van
author_sort Vinh, Le Van
collection PubMed
description BACKGROUND: Metagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture. One of the crucial steps in metagenomic analysis, referred to as “binning”, is to separate reads into clusters that represent genomes from closely related organisms. Among the existing binning methods, unsupervised methods base the classification on features extracted from reads, and especially taking advantage in case of the limitation of reference database availability. However, their performance, under various aspects, is still being investigated by recent theoretical and empirical studies. The one addressed in this paper is among those efforts to enhance the accuracy of the classification. RESULTS: This paper presents an unsupervised algorithm, called BiMeta, for binning of reads from different species in a metagenomic dataset. The algorithm consists of two phases. In the first phase of the algorithm, reads are grouped into groups based on overlap information between the reads. The second phase merges the groups by using an observation on l-mer frequency distribution of sets of non-overlapping reads. The experimental results on simulated and real datasets showed that BiMeta outperforms three state-of-the-art binning algorithms for both short and long reads (≥700 bp) datasets. CONCLUSIONS: This paper developed a novel and efficient algorithm for binning of metagenomic reads, which does not require any reference database. The software implementing the algorithm and all test datasets mentioned in this paper can be downloaded at http://it.hcmute.edu.vn/bioinfo/bimeta/index.htm. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-014-0030-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4304631
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43046312015-02-03 A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads Vinh, Le Van Lang, Tran Van Binh, Le Thanh Hoai, Tran Van Algorithms Mol Biol Research BACKGROUND: Metagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture. One of the crucial steps in metagenomic analysis, referred to as “binning”, is to separate reads into clusters that represent genomes from closely related organisms. Among the existing binning methods, unsupervised methods base the classification on features extracted from reads, and especially taking advantage in case of the limitation of reference database availability. However, their performance, under various aspects, is still being investigated by recent theoretical and empirical studies. The one addressed in this paper is among those efforts to enhance the accuracy of the classification. RESULTS: This paper presents an unsupervised algorithm, called BiMeta, for binning of reads from different species in a metagenomic dataset. The algorithm consists of two phases. In the first phase of the algorithm, reads are grouped into groups based on overlap information between the reads. The second phase merges the groups by using an observation on l-mer frequency distribution of sets of non-overlapping reads. The experimental results on simulated and real datasets showed that BiMeta outperforms three state-of-the-art binning algorithms for both short and long reads (≥700 bp) datasets. CONCLUSIONS: This paper developed a novel and efficient algorithm for binning of metagenomic reads, which does not require any reference database. The software implementing the algorithm and all test datasets mentioned in this paper can be downloaded at http://it.hcmute.edu.vn/bioinfo/bimeta/index.htm. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-014-0030-4) contains supplementary material, which is available to authorized users. BioMed Central 2015-01-16 /pmc/articles/PMC4304631/ /pubmed/25648210 http://dx.doi.org/10.1186/s13015-014-0030-4 Text en © Vinh et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Vinh, Le Van
Lang, Tran Van
Binh, Le Thanh
Hoai, Tran Van
A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads
title A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads
title_full A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads
title_fullStr A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads
title_full_unstemmed A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads
title_short A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads
title_sort two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4304631/
https://www.ncbi.nlm.nih.gov/pubmed/25648210
http://dx.doi.org/10.1186/s13015-014-0030-4
work_keys_str_mv AT vinhlevan atwophasebinningalgorithmusinglmerfrequencyongroupsofnonoverlappingreads
AT langtranvan atwophasebinningalgorithmusinglmerfrequencyongroupsofnonoverlappingreads
AT binhlethanh atwophasebinningalgorithmusinglmerfrequencyongroupsofnonoverlappingreads
AT hoaitranvan atwophasebinningalgorithmusinglmerfrequencyongroupsofnonoverlappingreads
AT vinhlevan twophasebinningalgorithmusinglmerfrequencyongroupsofnonoverlappingreads
AT langtranvan twophasebinningalgorithmusinglmerfrequencyongroupsofnonoverlappingreads
AT binhlethanh twophasebinningalgorithmusinglmerfrequencyongroupsofnonoverlappingreads
AT hoaitranvan twophasebinningalgorithmusinglmerfrequencyongroupsofnonoverlappingreads