Cargando…

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

BACKGROUND: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Yohan, Sidney, John, Buus, Søren, Sette, Alessandro, Nielsen, Morten, Peters, Bjoern
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111843/ https://www.ncbi.nlm.nih.gov/pubmed/25017736 http://dx.doi.org/10.1186/1471-2105-15-241

_version_	1782328127930761216
author	Kim, Yohan Sidney, John Buus, Søren Sette, Alessandro Nielsen, Morten Peters, Bjoern
author_facet	Kim, Yohan Sidney, John Buus, Søren Sette, Alessandro Nielsen, Morten Peters, Bjoern
author_sort	Kim, Yohan
collection	PubMed
description	BACKGROUND: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set. RESULTS: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates. CONCLUSION: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-241) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4111843
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-41118432014-07-27 Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions Kim, Yohan Sidney, John Buus, Søren Sette, Alessandro Nielsen, Morten Peters, Bjoern BMC Bioinformatics Research Article BACKGROUND: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set. RESULTS: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates. CONCLUSION: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-241) contains supplementary material, which is available to authorized users. BioMed Central 2014-07-14 /pmc/articles/PMC4111843/ /pubmed/25017736 http://dx.doi.org/10.1186/1471-2105-15-241 Text en © Kim et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Kim, Yohan Sidney, John Buus, Søren Sette, Alessandro Nielsen, Morten Peters, Bjoern Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions
title	Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions
title_full	Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions
title_fullStr	Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions
title_full_unstemmed	Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions
title_short	Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions
title_sort	dataset size and composition impact the reliability of performance benchmarks for peptide-mhc binding predictions
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111843/ https://www.ncbi.nlm.nih.gov/pubmed/25017736 http://dx.doi.org/10.1186/1471-2105-15-241
work_keys_str_mv	AT kimyohan datasetsizeandcompositionimpactthereliabilityofperformancebenchmarksforpeptidemhcbindingpredictions AT sidneyjohn datasetsizeandcompositionimpactthereliabilityofperformancebenchmarksforpeptidemhcbindingpredictions AT buussøren datasetsizeandcompositionimpactthereliabilityofperformancebenchmarksforpeptidemhcbindingpredictions AT settealessandro datasetsizeandcompositionimpactthereliabilityofperformancebenchmarksforpeptidemhcbindingpredictions AT nielsenmorten datasetsizeandcompositionimpactthereliabilityofperformancebenchmarksforpeptidemhcbindingpredictions AT petersbjoern datasetsizeandcompositionimpactthereliabilityofperformancebenchmarksforpeptidemhcbindingpredictions

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

Ejemplares similares