TITLE: On Optimal Data Split for Generalization Estimation and Model Selection

AUTHOR: Jan Larsen and Cyril Goutte
Department of Mathematical Modelling, Building 321
Technical University of Denmark, DK-2800 Lyngby, Denmark
emails: jl,cg@imm.dtu.dk
www: http://eivind.imm.dtu.dk


Modeling with flexible models, such as neural networks, requires careful control of the model complexity and generalization ability of the resulting model. Whereas general asymptotic estimators of generalization ability have been developed over recent years it is widely acknowledged that in most modeling scenarios there is insufficient data available to reliably use these estimators for assessing generalization, or select/optimize models. As a consequence, one resorts to resampling techniques like cross-validation jackknife or bootstrap. In this paper, we address a crucial problem of cross-validation estimators: how to split the data into various sets. We are concerned with studying the very different behavior of the two data splits hold-out cross-validation, K-fold cross-validation and randomized permutation cross-validation. The theoretical basics of various cross-validation techniques with the purpose of reliably estimating the generalization error and optimizing the model structure is described. Theoretical and numerical experiments clarify the very different behaviour of the data splitting.

Siubmitted for IEEE Nueral Networks for Signal Processing, 1999.