Cargando…
Sequence-Based Classification Using Discriminatory Motif Feature Selection
Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all [Image: see text]-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213122/ https://www.ncbi.nlm.nih.gov/pubmed/22102890 http://dx.doi.org/10.1371/journal.pone.0027382 |
_version_ | 1782216083450626048 |
---|---|
author | Xiong, Hao Capurso, Daniel Sen, Śaunak Segal, Mark R. |
author_facet | Xiong, Hao Capurso, Daniel Sen, Śaunak Segal, Mark R. |
author_sort | Xiong, Hao |
collection | PubMed |
description | Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all [Image: see text]-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length [Image: see text], such that potentially important, longer ([Image: see text]) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/. |
format | Online Article Text |
id | pubmed-3213122 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-32131222011-11-18 Sequence-Based Classification Using Discriminatory Motif Feature Selection Xiong, Hao Capurso, Daniel Sen, Śaunak Segal, Mark R. PLoS One Research Article Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all [Image: see text]-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length [Image: see text], such that potentially important, longer ([Image: see text]) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/. Public Library of Science 2011-11-10 /pmc/articles/PMC3213122/ /pubmed/22102890 http://dx.doi.org/10.1371/journal.pone.0027382 Text en Xiong et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Xiong, Hao Capurso, Daniel Sen, Śaunak Segal, Mark R. Sequence-Based Classification Using Discriminatory Motif Feature Selection |
title | Sequence-Based Classification Using Discriminatory Motif Feature Selection |
title_full | Sequence-Based Classification Using Discriminatory Motif Feature Selection |
title_fullStr | Sequence-Based Classification Using Discriminatory Motif Feature Selection |
title_full_unstemmed | Sequence-Based Classification Using Discriminatory Motif Feature Selection |
title_short | Sequence-Based Classification Using Discriminatory Motif Feature Selection |
title_sort | sequence-based classification using discriminatory motif feature selection |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213122/ https://www.ncbi.nlm.nih.gov/pubmed/22102890 http://dx.doi.org/10.1371/journal.pone.0027382 |
work_keys_str_mv | AT xionghao sequencebasedclassificationusingdiscriminatorymotiffeatureselection AT capursodaniel sequencebasedclassificationusingdiscriminatorymotiffeatureselection AT sensaunak sequencebasedclassificationusingdiscriminatorymotiffeatureselection AT segalmarkr sequencebasedclassificationusingdiscriminatorymotiffeatureselection |