Cargando…

Protein sequence classification using feature hashing

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional i...

Descripción completa

Detalles Bibliográficos
Autores principales: Caragea, Cornelia, Silvescu, Adrian, Mitra, Prasenjit
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380737/
https://www.ncbi.nlm.nih.gov/pubmed/22759572
http://dx.doi.org/10.1186/1477-5956-10-S1-S14
_version_ 1782236337671241728
author Caragea, Cornelia
Silvescu, Adrian
Mitra, Prasenjit
author_facet Caragea, Cornelia
Silvescu, Adrian
Mitra, Prasenjit
author_sort Caragea, Cornelia
collection PubMed
description Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
format Online
Article
Text
id pubmed-3380737
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-33807372012-06-25 Protein sequence classification using feature hashing Caragea, Cornelia Silvescu, Adrian Mitra, Prasenjit Proteome Sci Proceedings Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks. BioMed Central 2012-06-21 /pmc/articles/PMC3380737/ /pubmed/22759572 http://dx.doi.org/10.1186/1477-5956-10-S1-S14 Text en Copyright ©2012 Caragea et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Caragea, Cornelia
Silvescu, Adrian
Mitra, Prasenjit
Protein sequence classification using feature hashing
title Protein sequence classification using feature hashing
title_full Protein sequence classification using feature hashing
title_fullStr Protein sequence classification using feature hashing
title_full_unstemmed Protein sequence classification using feature hashing
title_short Protein sequence classification using feature hashing
title_sort protein sequence classification using feature hashing
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380737/
https://www.ncbi.nlm.nih.gov/pubmed/22759572
http://dx.doi.org/10.1186/1477-5956-10-S1-S14
work_keys_str_mv AT carageacornelia proteinsequenceclassificationusingfeaturehashing
AT silvescuadrian proteinsequenceclassificationusingfeaturehashing
AT mitraprasenjit proteinsequenceclassificationusingfeaturehashing