Cargando…

A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations

BACKGROUND: With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and var...

Descripción completa

Detalles Bibliográficos
Autor principal: Kiiveri, Harri T
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2390543/
https://www.ncbi.nlm.nih.gov/pubmed/18410693
http://dx.doi.org/10.1186/1471-2105-9-195
_version_ 1782155319489593344
author Kiiveri, Harri T
author_facet Kiiveri, Harri T
author_sort Kiiveri, Harri T
collection PubMed
description BACKGROUND: With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking. RESULTS: The major contribution of this paper is to present a unified methodology which allows many common (statistical) response models to be fitted to such data sets. The class of models includes virtually any model with a linear predictor in it, for example (but not limited to), multiclass logistic regression (classification), generalised linear models (regression) and survival models. A fast algorithm for finding sparse well fitting models is presented. The ideas are illustrated on real data sets with numbers of variables ranging from thousands to millions. R code implementing the ideas is available for download. CONCLUSION: The method described in this paper enables existing work on response models when there are less variables than observations to be leveraged to the situation when there are many more variables than observations. It is a powerful approach to finding parsimonious models for such datasets. The method is capable of handling problems with millions of variables and a large variety of response types within the one framework. The method compares favourably to existing methods such as support vector machines and random forests, but has the advantage of not requiring separate variable selection steps. It is also works for data types which these methods were not designed to handle. The method usually produces very sparse models which make biological interpretation simpler and more focused.
format Text
id pubmed-2390543
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-23905432008-05-21 A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations Kiiveri, Harri T BMC Bioinformatics Methodology Article BACKGROUND: With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking. RESULTS: The major contribution of this paper is to present a unified methodology which allows many common (statistical) response models to be fitted to such data sets. The class of models includes virtually any model with a linear predictor in it, for example (but not limited to), multiclass logistic regression (classification), generalised linear models (regression) and survival models. A fast algorithm for finding sparse well fitting models is presented. The ideas are illustrated on real data sets with numbers of variables ranging from thousands to millions. R code implementing the ideas is available for download. CONCLUSION: The method described in this paper enables existing work on response models when there are less variables than observations to be leveraged to the situation when there are many more variables than observations. It is a powerful approach to finding parsimonious models for such datasets. The method is capable of handling problems with millions of variables and a large variety of response types within the one framework. The method compares favourably to existing methods such as support vector machines and random forests, but has the advantage of not requiring separate variable selection steps. It is also works for data types which these methods were not designed to handle. The method usually produces very sparse models which make biological interpretation simpler and more focused. BioMed Central 2008-04-15 /pmc/articles/PMC2390543/ /pubmed/18410693 http://dx.doi.org/10.1186/1471-2105-9-195 Text en Copyright © 2008 Kiiveri; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Kiiveri, Harri T
A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
title A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
title_full A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
title_fullStr A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
title_full_unstemmed A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
title_short A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
title_sort general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2390543/
https://www.ncbi.nlm.nih.gov/pubmed/18410693
http://dx.doi.org/10.1186/1471-2105-9-195
work_keys_str_mv AT kiiveriharrit ageneralapproachtosimultaneousmodelfittingandvariableeliminationinresponsemodelsforbiologicaldatawithmanymorevariablesthanobservations
AT kiiveriharrit generalapproachtosimultaneousmodelfittingandvariableeliminationinresponsemodelsforbiologicaldatawithmanymorevariablesthanobservations