This Ms.C. thesis by Carl Edward Rasmussen is entitled `Generalization in
Neural Networks'. The postcript file contains about 80 pages.
ABSTRACT
This report is concerned with methods for optimizing the generalization
ability of neural networks. The framework is developed to deal with
regression type problems, where the networks are trained on a limited amount
of noisy data. In this context the problem can be formulated as finding the
optimal trade off between data fit and model complexity.
Two paradigms for reducing model complexity are discussed: pruning and
weight decay. It is shown by numerical experiments that application of
weight decay is essential for obtaining good generalization performance.
This is explained by the way in which weight decay confines the space of
possible networks to a space of `reasonable' networks.
Two methods for making statistical estimates of the generalization
performance {\it without}\/ use of validation sets are presented: the
Generalization method and the Bayesian method. The advantage of not needing
validation sets is that all available data can be utilized in the training
phase. This feature is important since the optimal generalization ability of
a model is directly related to the amount of available training data.
The Generalization method is an extension of Akaikes FPE estimator, which
explicitly takes the application of weight decay into account. In this
method the generalization ability is estimated by averaging over the ensemble
of possible training sets, which are consistent with the `true' function.
The method allows for optimal setting of weight decay and estimates
generalization performance.
The Bayesian framework uses the {\it evidence}\/ to measure the plausibility
of models. This framework embodies {\it priors}\/ on the network weights,
which are shown to be identical to the application of weight decay.
The two methods are tested numerically on the sunspot problem. On linear
models for this (limited) problem both methods work well, and marked
similarities are found between the two methods. Both methods also exhibit
the ability to prune weights, when individual weight decay parameters are
applied to each weight in the network.