Cargando…

Sequence-Based Classification Using Discriminatory Motif Feature Selection

Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all [Image: see text]-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings...

Descripción completa

Detalles Bibliográficos
Autores principales: Xiong, Hao, Capurso, Daniel, Sen, Śaunak, Segal, Mark R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213122/
https://www.ncbi.nlm.nih.gov/pubmed/22102890
http://dx.doi.org/10.1371/journal.pone.0027382
_version_ 1782216083450626048
author Xiong, Hao
Capurso, Daniel
Sen, Śaunak
Segal, Mark R.
author_facet Xiong, Hao
Capurso, Daniel
Sen, Śaunak
Segal, Mark R.
author_sort Xiong, Hao
collection PubMed
description Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all [Image: see text]-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length [Image: see text], such that potentially important, longer ([Image: see text]) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.
format Online
Article
Text
id pubmed-3213122
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-32131222011-11-18 Sequence-Based Classification Using Discriminatory Motif Feature Selection Xiong, Hao Capurso, Daniel Sen, Śaunak Segal, Mark R. PLoS One Research Article Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all [Image: see text]-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length [Image: see text], such that potentially important, longer ([Image: see text]) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/. Public Library of Science 2011-11-10 /pmc/articles/PMC3213122/ /pubmed/22102890 http://dx.doi.org/10.1371/journal.pone.0027382 Text en Xiong et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Xiong, Hao
Capurso, Daniel
Sen, Śaunak
Segal, Mark R.
Sequence-Based Classification Using Discriminatory Motif Feature Selection
title Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_full Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_fullStr Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_full_unstemmed Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_short Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_sort sequence-based classification using discriminatory motif feature selection
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213122/
https://www.ncbi.nlm.nih.gov/pubmed/22102890
http://dx.doi.org/10.1371/journal.pone.0027382
work_keys_str_mv AT xionghao sequencebasedclassificationusingdiscriminatorymotiffeatureselection
AT capursodaniel sequencebasedclassificationusingdiscriminatorymotiffeatureselection
AT sensaunak sequencebasedclassificationusingdiscriminatorymotiffeatureselection
AT segalmarkr sequencebasedclassificationusingdiscriminatorymotiffeatureselection