Cargando…
The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments
BACKGROUND: The characterisation, or binning, of metagenome fragments is an important first step to further downstream analysis of microbial consortia. Here, we propose a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possi...
Autores principales: | , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2009
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788362/ https://www.ncbi.nlm.nih.gov/pubmed/19958473 http://dx.doi.org/10.1186/1471-2164-10-S3-S10 |
_version_ | 1782174964106919936 |
---|---|
author | Saeed, Isaam Halgamuge, Saman K |
author_facet | Saeed, Isaam Halgamuge, Saman K |
author_sort | Saeed, Isaam |
collection | PubMed |
description | BACKGROUND: The characterisation, or binning, of metagenome fragments is an important first step to further downstream analysis of microbial consortia. Here, we propose a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences. The one-dimensional signal is essentially a compact representation of higher dimensional feature spaces of greater complexity and is intended to improve on the tetranucleotide frequency feature space preferred by current compositional binning methods. RESULTS: We compare the fidelity of OFDEG against tetranucleotide frequency in both an unsupervised and semi-supervised setting on simulated metagenome benchmark data. Four tests were conducted using assembler output of Arachne and phrap, and for each, performance was evaluated on contigs which are greater than or equal to 8 kbp in length and contigs which are composed of at least 10 reads. Using both G-C content in conjunction with OFDEG gave an average accuracy of 96.75% (semi-supervised) and 95.19% (unsupervised), versus 94.25% (semi-supervised) and 82.35% (unsupervised) for tetranucleotide frequency. CONCLUSION: We have presented an observation of an alternative characteristic of DNA sequences. The proposed feature representation has proven to be more beneficial than the existing tetranucleotide frequency space to the metagenome binning problem. We do note, however, that our observation of OFDEG deserves further anlaysis and investigation. Unsupervised clustering revealed OFDEG related features performed better than standard tetranucleotide frequency in representing a relevant organism specific signal. Further improvement in binning accuracy is given by semi-supervised classification using OFDEG. The emphasis on a feature-driven, bottom-up approach to the problem of binning reveals promising avenues for future development of techniques to characterise short environmental sequences without bias toward cultivable organisms. |
format | Text |
id | pubmed-2788362 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2009 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-27883622009-12-04 The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments Saeed, Isaam Halgamuge, Saman K BMC Genomics Proceedings BACKGROUND: The characterisation, or binning, of metagenome fragments is an important first step to further downstream analysis of microbial consortia. Here, we propose a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences. The one-dimensional signal is essentially a compact representation of higher dimensional feature spaces of greater complexity and is intended to improve on the tetranucleotide frequency feature space preferred by current compositional binning methods. RESULTS: We compare the fidelity of OFDEG against tetranucleotide frequency in both an unsupervised and semi-supervised setting on simulated metagenome benchmark data. Four tests were conducted using assembler output of Arachne and phrap, and for each, performance was evaluated on contigs which are greater than or equal to 8 kbp in length and contigs which are composed of at least 10 reads. Using both G-C content in conjunction with OFDEG gave an average accuracy of 96.75% (semi-supervised) and 95.19% (unsupervised), versus 94.25% (semi-supervised) and 82.35% (unsupervised) for tetranucleotide frequency. CONCLUSION: We have presented an observation of an alternative characteristic of DNA sequences. The proposed feature representation has proven to be more beneficial than the existing tetranucleotide frequency space to the metagenome binning problem. We do note, however, that our observation of OFDEG deserves further anlaysis and investigation. Unsupervised clustering revealed OFDEG related features performed better than standard tetranucleotide frequency in representing a relevant organism specific signal. Further improvement in binning accuracy is given by semi-supervised classification using OFDEG. The emphasis on a feature-driven, bottom-up approach to the problem of binning reveals promising avenues for future development of techniques to characterise short environmental sequences without bias toward cultivable organisms. BioMed Central 2009-12-03 /pmc/articles/PMC2788362/ /pubmed/19958473 http://dx.doi.org/10.1186/1471-2164-10-S3-S10 Text en Copyright ©2009 Saeed and Halgamuge; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Proceedings Saeed, Isaam Halgamuge, Saman K The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments |
title | The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments |
title_full | The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments |
title_fullStr | The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments |
title_full_unstemmed | The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments |
title_short | The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments |
title_sort | oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788362/ https://www.ncbi.nlm.nih.gov/pubmed/19958473 http://dx.doi.org/10.1186/1471-2164-10-S3-S10 |
work_keys_str_mv | AT saeedisaam theoligonucleotidefrequencyderivederrorgradientanditsapplicationtothebinningofmetagenomefragments AT halgamugesamank theoligonucleotidefrequencyderivederrorgradientanditsapplicationtothebinningofmetagenomefragments AT saeedisaam oligonucleotidefrequencyderivederrorgradientanditsapplicationtothebinningofmetagenomefragments AT halgamugesamank oligonucleotidefrequencyderivederrorgradientanditsapplicationtothebinningofmetagenomefragments |