Cargando…

Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features

BACKGROUND: Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such probl...

Descripción completa

Detalles Bibliográficos
Autores principales:	Demir-Kavuk, Ozgur, Kamada, Mayumi, Akutsu, Tatsuya, Knapp, Ernst-Walter
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3224215/ https://www.ncbi.nlm.nih.gov/pubmed/22026913 http://dx.doi.org/10.1186/1471-2105-12-412

_version_	1782217352781234176
author	Demir-Kavuk, Ozgur Kamada, Mayumi Akutsu, Tatsuya Knapp, Ernst-Walter
author_facet	Demir-Kavuk, Ozgur Kamada, Mayumi Akutsu, Tatsuya Knapp, Ernst-Walter
author_sort	Demir-Kavuk, Ozgur
collection	PubMed
description	BACKGROUND: Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training. RESULTS: The CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers. CONCLUSIONS: The proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors.
format	Online Article Text
id	pubmed-3224215
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-32242152011-11-30 Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features Demir-Kavuk, Ozgur Kamada, Mayumi Akutsu, Tatsuya Knapp, Ernst-Walter BMC Bioinformatics Methodology Article BACKGROUND: Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training. RESULTS: The CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers. CONCLUSIONS: The proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors. BioMed Central 2011-10-25 /pmc/articles/PMC3224215/ /pubmed/22026913 http://dx.doi.org/10.1186/1471-2105-12-412 Text en Copyright ©2011 Demir-Kavuk et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Demir-Kavuk, Ozgur Kamada, Mayumi Akutsu, Tatsuya Knapp, Ernst-Walter Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features
title	Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features
title_full	Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features
title_fullStr	Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features
title_full_unstemmed	Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features
title_short	Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features
title_sort	prediction using step-wise l1, l2 regularization and feature selection for small data sets with large number of features
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3224215/ https://www.ncbi.nlm.nih.gov/pubmed/22026913 http://dx.doi.org/10.1186/1471-2105-12-412
work_keys_str_mv	AT demirkavukozgur predictionusingstepwisel1l2regularizationandfeatureselectionforsmalldatasetswithlargenumberoffeatures AT kamadamayumi predictionusingstepwisel1l2regularizationandfeatureselectionforsmalldatasetswithlargenumberoffeatures AT akutsutatsuya predictionusingstepwisel1l2regularizationandfeatureselectionforsmalldatasetswithlargenumberoffeatures AT knappernstwalter predictionusingstepwisel1l2regularizationandfeatureselectionforsmalldatasetswithlargenumberoffeatures

Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features

Ejemplares similares