Cargando…

Sequence-Based Classification Using Discriminatory Motif Feature Selection

Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all [Image: see text]-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings...

Descripción completa

Detalles Bibliográficos
Autores principales:	Xiong, Hao, Capurso, Daniel, Sen, Śaunak, Segal, Mark R.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2011
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213122/ https://www.ncbi.nlm.nih.gov/pubmed/22102890 http://dx.doi.org/10.1371/journal.pone.0027382

_version_	1782216083450626048
author	Xiong, Hao Capurso, Daniel Sen, Śaunak Segal, Mark R.
author_facet	Xiong, Hao Capurso, Daniel Sen, Śaunak Segal, Mark R.
author_sort	Xiong, Hao
collection	PubMed
description	Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all [Image: see text]-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length [Image: see text], such that potentially important, longer ([Image: see text]) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.
format	Online Article Text
id	pubmed-3213122
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-32131222011-11-18 Sequence-Based Classification Using Discriminatory Motif Feature Selection Xiong, Hao Capurso, Daniel Sen, Śaunak Segal, Mark R. PLoS One Research Article Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all [Image: see text]-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length [Image: see text], such that potentially important, longer ([Image: see text]) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/. Public Library of Science 2011-11-10 /pmc/articles/PMC3213122/ /pubmed/22102890 http://dx.doi.org/10.1371/journal.pone.0027382 Text en Xiong et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Xiong, Hao Capurso, Daniel Sen, Śaunak Segal, Mark R. Sequence-Based Classification Using Discriminatory Motif Feature Selection
title	Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_full	Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_fullStr	Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_full_unstemmed	Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_short	Sequence-Based Classification Using Discriminatory Motif Feature Selection
title_sort	sequence-based classification using discriminatory motif feature selection
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213122/ https://www.ncbi.nlm.nih.gov/pubmed/22102890 http://dx.doi.org/10.1371/journal.pone.0027382
work_keys_str_mv	AT xionghao sequencebasedclassificationusingdiscriminatorymotiffeatureselection AT capursodaniel sequencebasedclassificationusingdiscriminatorymotiffeatureselection AT sensaunak sequencebasedclassificationusingdiscriminatorymotiffeatureselection AT segalmarkr sequencebasedclassificationusingdiscriminatorymotiffeatureselection

Sequence-Based Classification Using Discriminatory Motif Feature Selection

Ejemplares similares