Cargando…
Protein sequence classification using feature hashing
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional i...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380737/ https://www.ncbi.nlm.nih.gov/pubmed/22759572 http://dx.doi.org/10.1186/1477-5956-10-S1-S14 |
_version_ | 1782236337671241728 |
---|---|
author | Caragea, Cornelia Silvescu, Adrian Mitra, Prasenjit |
author_facet | Caragea, Cornelia Silvescu, Adrian Mitra, Prasenjit |
author_sort | Caragea, Cornelia |
collection | PubMed |
description | Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks. |
format | Online Article Text |
id | pubmed-3380737 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-33807372012-06-25 Protein sequence classification using feature hashing Caragea, Cornelia Silvescu, Adrian Mitra, Prasenjit Proteome Sci Proceedings Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks. BioMed Central 2012-06-21 /pmc/articles/PMC3380737/ /pubmed/22759572 http://dx.doi.org/10.1186/1477-5956-10-S1-S14 Text en Copyright ©2012 Caragea et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Proceedings Caragea, Cornelia Silvescu, Adrian Mitra, Prasenjit Protein sequence classification using feature hashing |
title | Protein sequence classification using feature hashing |
title_full | Protein sequence classification using feature hashing |
title_fullStr | Protein sequence classification using feature hashing |
title_full_unstemmed | Protein sequence classification using feature hashing |
title_short | Protein sequence classification using feature hashing |
title_sort | protein sequence classification using feature hashing |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380737/ https://www.ncbi.nlm.nih.gov/pubmed/22759572 http://dx.doi.org/10.1186/1477-5956-10-S1-S14 |
work_keys_str_mv | AT carageacornelia proteinsequenceclassificationusingfeaturehashing AT silvescuadrian proteinsequenceclassificationusingfeaturehashing AT mitraprasenjit proteinsequenceclassificationusingfeaturehashing |