Cargando…

Comparing K-mer based methods for improved classification of 16S sequences

BACKGROUND: The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most appl...

Descripción completa

Detalles Bibliográficos
Autores principales:	Vinje, Hilde, Liland, Kristian Hovde, Almøy, Trygve, Snipen, Lars
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4487979/ https://www.ncbi.nlm.nih.gov/pubmed/26130333 http://dx.doi.org/10.1186/s12859-015-0647-4

_version_	1782379072633962496
author	Vinje, Hilde Liland, Kristian Hovde Almøy, Trygve Snipen, Lars
author_facet	Vinje, Hilde Liland, Kristian Hovde Almøy, Trygve Snipen, Lars
author_sort	Vinje, Hilde
collection	PubMed
description	BACKGROUND: The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length. RESULTS: The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau. CONCLUSIONS: We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.
format	Online Article Text
id	pubmed-4487979
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-44879792015-07-02 Comparing K-mer based methods for improved classification of 16S sequences Vinje, Hilde Liland, Kristian Hovde Almøy, Trygve Snipen, Lars BMC Bioinformatics Research Article BACKGROUND: The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length. RESULTS: The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau. CONCLUSIONS: We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial. BioMed Central 2015-07-01 /pmc/articles/PMC4487979/ /pubmed/26130333 http://dx.doi.org/10.1186/s12859-015-0647-4 Text en © Vinje et al. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Vinje, Hilde Liland, Kristian Hovde Almøy, Trygve Snipen, Lars Comparing K-mer based methods for improved classification of 16S sequences
title	Comparing K-mer based methods for improved classification of 16S sequences
title_full	Comparing K-mer based methods for improved classification of 16S sequences
title_fullStr	Comparing K-mer based methods for improved classification of 16S sequences
title_full_unstemmed	Comparing K-mer based methods for improved classification of 16S sequences
title_short	Comparing K-mer based methods for improved classification of 16S sequences
title_sort	comparing k-mer based methods for improved classification of 16s sequences
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4487979/ https://www.ncbi.nlm.nih.gov/pubmed/26130333 http://dx.doi.org/10.1186/s12859-015-0647-4
work_keys_str_mv	AT vinjehilde comparingkmerbasedmethodsforimprovedclassificationof16ssequences AT lilandkristianhovde comparingkmerbasedmethodsforimprovedclassificationof16ssequences AT almøytrygve comparingkmerbasedmethodsforimprovedclassificationof16ssequences AT snipenlars comparingkmerbasedmethodsforimprovedclassificationof16ssequences

Comparing K-mer based methods for improved classification of 16S sequences

Ejemplares similares