Cargando…

A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads

BACKGROUND: Taxonomic assignment is a crucial step in a metagenomic project which aims to identify the origin of sequences in an environmental sample. Among the existing methods, since composition-based algorithms are not sufficient for classifying short reads, recent algorithms use only the feature...

Descripción completa

Detalles Bibliográficos
Autores principales:	Le, Vinh Van, Tran, Lang Van, Tran, Hoai Van
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702387/ https://www.ncbi.nlm.nih.gov/pubmed/26740458 http://dx.doi.org/10.1186/s12859-015-0872-x

_version_	1782408632012374016
author	Le, Vinh Van Tran, Lang Van Tran, Hoai Van
author_facet	Le, Vinh Van Tran, Lang Van Tran, Hoai Van
author_sort	Le, Vinh Van
collection	PubMed
description	BACKGROUND: Taxonomic assignment is a crucial step in a metagenomic project which aims to identify the origin of sequences in an environmental sample. Among the existing methods, since composition-based algorithms are not sufficient for classifying short reads, recent algorithms use only the feature of similarity, or similarity-based combined features. However, those algorithms suffer from the computational expense because the task of similarity search is very time-consuming. Besides, the lack of similarity information between reads and reference sequences due to the length of short reads reduces significantly the classification quality. RESULTS: This paper presents a novel taxonomic assignment algorithm, called SeMeta, which is based on semi-supervised learning to produce a fast and highly accurate classification of short-length reads with sufficient mutual overlap. The proposed algorithm firstly separates reads into clusters using their composition feature. It then labels the clusters with the support of an efficient filtering technique on results of the similarity search between their reads and reference databases. Furthermore, instead of performing the similarity search for all reads in the clusters, SeMeta only does for reads in their subgroups by utilizing the information of sequence overlapping. The experimental results demonstrate that SeMeta outperforms two other similarity-based algorithms on different aspects. CONCLUSIONS: By using a semi-supervised method as well as taking the advantages of various features, the proposed algorithm is able not only to achieve high classification quality, but also to reduce much computational cost. The source codes of the algorithm can be downloaded at http://it.hcmute.edu.vn/bioinfo/metapro/SeMeta.html ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0872-x) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4702387
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-47023872016-01-07 A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads Le, Vinh Van Tran, Lang Van Tran, Hoai Van BMC Bioinformatics Methodology Article BACKGROUND: Taxonomic assignment is a crucial step in a metagenomic project which aims to identify the origin of sequences in an environmental sample. Among the existing methods, since composition-based algorithms are not sufficient for classifying short reads, recent algorithms use only the feature of similarity, or similarity-based combined features. However, those algorithms suffer from the computational expense because the task of similarity search is very time-consuming. Besides, the lack of similarity information between reads and reference sequences due to the length of short reads reduces significantly the classification quality. RESULTS: This paper presents a novel taxonomic assignment algorithm, called SeMeta, which is based on semi-supervised learning to produce a fast and highly accurate classification of short-length reads with sufficient mutual overlap. The proposed algorithm firstly separates reads into clusters using their composition feature. It then labels the clusters with the support of an efficient filtering technique on results of the similarity search between their reads and reference databases. Furthermore, instead of performing the similarity search for all reads in the clusters, SeMeta only does for reads in their subgroups by utilizing the information of sequence overlapping. The experimental results demonstrate that SeMeta outperforms two other similarity-based algorithms on different aspects. CONCLUSIONS: By using a semi-supervised method as well as taking the advantages of various features, the proposed algorithm is able not only to achieve high classification quality, but also to reduce much computational cost. The source codes of the algorithm can be downloaded at http://it.hcmute.edu.vn/bioinfo/metapro/SeMeta.html ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0872-x) contains supplementary material, which is available to authorized users. BioMed Central 2016-01-06 /pmc/articles/PMC4702387/ /pubmed/26740458 http://dx.doi.org/10.1186/s12859-015-0872-x Text en © Le et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Le, Vinh Van Tran, Lang Van Tran, Hoai Van A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads
title	A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads
title_full	A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads
title_fullStr	A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads
title_full_unstemmed	A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads
title_short	A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads
title_sort	novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702387/ https://www.ncbi.nlm.nih.gov/pubmed/26740458 http://dx.doi.org/10.1186/s12859-015-0872-x
work_keys_str_mv	AT levinhvan anovelsemisupervisedalgorithmforthetaxonomicassignmentofmetagenomicreads AT tranlangvan anovelsemisupervisedalgorithmforthetaxonomicassignmentofmetagenomicreads AT tranhoaivan anovelsemisupervisedalgorithmforthetaxonomicassignmentofmetagenomicreads AT levinhvan novelsemisupervisedalgorithmforthetaxonomicassignmentofmetagenomicreads AT tranlangvan novelsemisupervisedalgorithmforthetaxonomicassignmentofmetagenomicreads AT tranhoaivan novelsemisupervisedalgorithmforthetaxonomicassignmentofmetagenomicreads

A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads

Ejemplares similares