Cargando…

Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

BACKGROUND: With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Bin, Peng, Yu, Leung, Henry Chi-Ming, Yiu, Siu-Ming, Chen, Jing-Chi, Chin, Francis Yuk-Lun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3165929/
https://www.ncbi.nlm.nih.gov/pubmed/20406503
http://dx.doi.org/10.1186/1471-2105-11-S2-S5
_version_ 1782211098977501184
author Yang, Bin
Peng, Yu
Leung, Henry Chi-Ming
Yiu, Siu-Ming
Chen, Jing-Chi
Chin, Francis Yuk-Lun
author_facet Yang, Bin
Peng, Yu
Leung, Henry Chi-Ming
Yiu, Siu-Ming
Chen, Jing-Chi
Chin, Francis Yuk-Lun
author_sort Yang, Bin
collection PubMed
description BACKGROUND: With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as “binning”. Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases. RESULTS: In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets. Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%. CONCLUSIONS: We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/~alse/MetaCluster/.
format Online
Article
Text
id pubmed-3165929
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31659292011-09-03 Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers Yang, Bin Peng, Yu Leung, Henry Chi-Ming Yiu, Siu-Ming Chen, Jing-Chi Chin, Francis Yuk-Lun BMC Bioinformatics Proceedings BACKGROUND: With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as “binning”. Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases. RESULTS: In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets. Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%. CONCLUSIONS: We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/~alse/MetaCluster/. BioMed Central 2010-04-16 /pmc/articles/PMC3165929/ /pubmed/20406503 http://dx.doi.org/10.1186/1471-2105-11-S2-S5 Text en Copyright ©2010 Yang and Chin; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Yang, Bin
Peng, Yu
Leung, Henry Chi-Ming
Yiu, Siu-Ming
Chen, Jing-Chi
Chin, Francis Yuk-Lun
Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
title Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
title_full Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
title_fullStr Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
title_full_unstemmed Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
title_short Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
title_sort unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3165929/
https://www.ncbi.nlm.nih.gov/pubmed/20406503
http://dx.doi.org/10.1186/1471-2105-11-S2-S5
work_keys_str_mv AT yangbin unsupervisedbinningofenvironmentalgenomicfragmentsbasedonanerrorrobustselectionoflmers
AT pengyu unsupervisedbinningofenvironmentalgenomicfragmentsbasedonanerrorrobustselectionoflmers
AT leunghenrychiming unsupervisedbinningofenvironmentalgenomicfragmentsbasedonanerrorrobustselectionoflmers
AT yiusiuming unsupervisedbinningofenvironmentalgenomicfragmentsbasedonanerrorrobustselectionoflmers
AT chenjingchi unsupervisedbinningofenvironmentalgenomicfragmentsbasedonanerrorrobustselectionoflmers
AT chinfrancisyuklun unsupervisedbinningofenvironmentalgenomicfragmentsbasedonanerrorrobustselectionoflmers