Cargando…

Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling

BACKGROUND: Proteochemometrics is a new methodology that allows prediction of protein function directly from real interaction measurement data without the need of 3D structure information. Several reported proteochemometric models of ligand-receptor interactions have already yielded significant insi...

Descripción completa

Detalles Bibliográficos
Autores principales: Freyhult, Eva, Prusis, Peteris, Lapinsh, Maris, Wikberg, Jarl ES, Moulton, Vincent, Gustafsson, Mats G
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC555743/
https://www.ncbi.nlm.nih.gov/pubmed/15760465
http://dx.doi.org/10.1186/1471-2105-6-50
_version_ 1782122550040461312
author Freyhult, Eva
Prusis, Peteris
Lapinsh, Maris
Wikberg, Jarl ES
Moulton, Vincent
Gustafsson, Mats G
author_facet Freyhult, Eva
Prusis, Peteris
Lapinsh, Maris
Wikberg, Jarl ES
Moulton, Vincent
Gustafsson, Mats G
author_sort Freyhult, Eva
collection PubMed
description BACKGROUND: Proteochemometrics is a new methodology that allows prediction of protein function directly from real interaction measurement data without the need of 3D structure information. Several reported proteochemometric models of ligand-receptor interactions have already yielded significant insights into various forms of bio-molecular interactions. The proteochemometric models are multivariate regression models that predict binding affinity for a particular combination of features of the ligand and protein. Although proteochemometric models have already offered interesting results in various studies, no detailed statistical evaluation of their average predictive power has been performed. In particular, variable subset selection performed to date has always relied on using all available examples, a situation also encountered in microarray gene expression data analysis. RESULTS: A methodology for an unbiased evaluation of the predictive power of proteochemometric models was implemented and results from applying it to two of the largest proteochemometric data sets yet reported are presented. A double cross-validation loop procedure is used to estimate the expected performance of a given design method. The unbiased performance estimates (P(2)) obtained for the data sets that we consider confirm that properly designed single proteochemometric models have useful predictive power, but that a standard design based on cross validation may yield models with quite limited performance. The results also show that different commercial software packages employed for the design of proteochemometric models may yield very different and therefore misleading performance estimates. In addition, the differences in the models obtained in the double CV loop indicate that detailed chemical interpretation of a single proteochemometric model is uncertain when data sets are small. CONCLUSION: The double CV loop employed offer unbiased performance estimates about a given proteochemometric modelling procedure, making it possible to identify cases where the proteochemometric design does not result in useful predictive models. Chemical interpretations of single proteochemometric models are uncertain and should instead be based on all the models selected in the double CV loop employed here.
format Text
id pubmed-555743
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-5557432005-04-01 Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling Freyhult, Eva Prusis, Peteris Lapinsh, Maris Wikberg, Jarl ES Moulton, Vincent Gustafsson, Mats G BMC Bioinformatics Research Article BACKGROUND: Proteochemometrics is a new methodology that allows prediction of protein function directly from real interaction measurement data without the need of 3D structure information. Several reported proteochemometric models of ligand-receptor interactions have already yielded significant insights into various forms of bio-molecular interactions. The proteochemometric models are multivariate regression models that predict binding affinity for a particular combination of features of the ligand and protein. Although proteochemometric models have already offered interesting results in various studies, no detailed statistical evaluation of their average predictive power has been performed. In particular, variable subset selection performed to date has always relied on using all available examples, a situation also encountered in microarray gene expression data analysis. RESULTS: A methodology for an unbiased evaluation of the predictive power of proteochemometric models was implemented and results from applying it to two of the largest proteochemometric data sets yet reported are presented. A double cross-validation loop procedure is used to estimate the expected performance of a given design method. The unbiased performance estimates (P(2)) obtained for the data sets that we consider confirm that properly designed single proteochemometric models have useful predictive power, but that a standard design based on cross validation may yield models with quite limited performance. The results also show that different commercial software packages employed for the design of proteochemometric models may yield very different and therefore misleading performance estimates. In addition, the differences in the models obtained in the double CV loop indicate that detailed chemical interpretation of a single proteochemometric model is uncertain when data sets are small. CONCLUSION: The double CV loop employed offer unbiased performance estimates about a given proteochemometric modelling procedure, making it possible to identify cases where the proteochemometric design does not result in useful predictive models. Chemical interpretations of single proteochemometric models are uncertain and should instead be based on all the models selected in the double CV loop employed here. BioMed Central 2005-03-10 /pmc/articles/PMC555743/ /pubmed/15760465 http://dx.doi.org/10.1186/1471-2105-6-50 Text en Copyright © 2005 Freyhult et al; licensee BioMed Central Ltd.
spellingShingle Research Article
Freyhult, Eva
Prusis, Peteris
Lapinsh, Maris
Wikberg, Jarl ES
Moulton, Vincent
Gustafsson, Mats G
Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling
title Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling
title_full Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling
title_fullStr Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling
title_full_unstemmed Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling
title_short Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling
title_sort unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC555743/
https://www.ncbi.nlm.nih.gov/pubmed/15760465
http://dx.doi.org/10.1186/1471-2105-6-50
work_keys_str_mv AT freyhulteva unbiaseddescriptorandparameterselectionconfirmsthepotentialofproteochemometricmodelling
AT prusispeteris unbiaseddescriptorandparameterselectionconfirmsthepotentialofproteochemometricmodelling
AT lapinshmaris unbiaseddescriptorandparameterselectionconfirmsthepotentialofproteochemometricmodelling
AT wikbergjarles unbiaseddescriptorandparameterselectionconfirmsthepotentialofproteochemometricmodelling
AT moultonvincent unbiaseddescriptorandparameterselectionconfirmsthepotentialofproteochemometricmodelling
AT gustafssonmatsg unbiaseddescriptorandparameterselectionconfirmsthepotentialofproteochemometricmodelling