Cargando…

Protein sequence classification using feature hashing

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional i...

Descripción completa

Detalles Bibliográficos
Autores principales:	Caragea, Cornelia, Silvescu, Adrian, Mitra, Prasenjit
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380737/ https://www.ncbi.nlm.nih.gov/pubmed/22759572 http://dx.doi.org/10.1186/1477-5956-10-S1-S14

_version_	1782236337671241728
author	Caragea, Cornelia Silvescu, Adrian Mitra, Prasenjit
author_facet	Caragea, Cornelia Silvescu, Adrian Mitra, Prasenjit
author_sort	Caragea, Cornelia
collection	PubMed
description	Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
format	Online Article Text
id	pubmed-3380737
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-33807372012-06-25 Protein sequence classification using feature hashing Caragea, Cornelia Silvescu, Adrian Mitra, Prasenjit Proteome Sci Proceedings Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks. BioMed Central 2012-06-21 /pmc/articles/PMC3380737/ /pubmed/22759572 http://dx.doi.org/10.1186/1477-5956-10-S1-S14 Text en Copyright ©2012 Caragea et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Caragea, Cornelia Silvescu, Adrian Mitra, Prasenjit Protein sequence classification using feature hashing
title	Protein sequence classification using feature hashing
title_full	Protein sequence classification using feature hashing
title_fullStr	Protein sequence classification using feature hashing
title_full_unstemmed	Protein sequence classification using feature hashing
title_short	Protein sequence classification using feature hashing
title_sort	protein sequence classification using feature hashing
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380737/ https://www.ncbi.nlm.nih.gov/pubmed/22759572 http://dx.doi.org/10.1186/1477-5956-10-S1-S14
work_keys_str_mv	AT carageacornelia proteinsequenceclassificationusingfeaturehashing AT silvescuadrian proteinsequenceclassificationusingfeaturehashing AT mitraprasenjit proteinsequenceclassificationusingfeaturehashing

Protein sequence classification using feature hashing

Ejemplares similares