On the Relationship between Generalization Error and Sample Complexity for Radial Basis Functions and Incremental Training Algorithm
Headline of the presentation:
Introduction of basic concepts and notations
There are many heuristic techniques in the neural network literature to perform various tasks within supervised learning paradigm, such as optimizing training, selecting appropriate network size and predicting how many data required to achieve certain generalization error. Our results are mainly based on RBF, however, similar formulas can be applied to MLP.
Traditional Point of View: Center parameters are fixed before training, theory of linear model, bias-variance dilemma, cross-validation, center selection, localized basis function, curse of dimensionality
Probability approximately correct (PAC) framework: The concept class F is PAC learnable by a network if for all concepts fÎ F and for all distributions PX, it is true that when the network is given at least p(N, 1/e , 1/d ) training data, where p is a polynomial, the network can perform a hypothesis h such that
Pr(error(h, f) >e )<d
Think of d as a measure of confidence and e as an error tolerance and this is a worst case definition.
Regression Estimation Model: The task is defined by the distribution PX´ Y on a function space, i.e., the input-output pair (x, y) is drawn from PX´ Y. Approximation error is defined by the difference between the best possible network parameter set and the target function to approximate.
The measure of performance of RBF: average squared error between the prediction and target (the expected risk)
errave[f]=E[(y- fF(x ))2]=
The expected risk can be decomposed into two parts
errave[f]= E[(f0(x )- fF(x ))2]+ E[(y- f0(x ))2]
where f0(x ), the regression function, is the conditional mean of the output given a particular input. In practice, errave[f] is unavailable to us as PX´ Y is unknown.
What to do: use empirical risk instead of the expected risk
Assume we have l data as examples in training, then
Question: What is the relationship between generalization error and empirical risk?
First, bound the approximation error
Second, bound the estimation error
Under certain mild assumptions, the generalization error is bounded by the following inequality
Apply this framework to RBFs: For any d Î (0,1), for n nodes of RBF network, l training points with k input dimensions each, with probability greater than 1-d ,
Comments: The first term is approximation error, which is of zero interest to us. The second term is estimation error. For a fixed network, it is governed by the number of patterns and decreases as O([lnl/l]1/2). It is considerably slower than VC dimension approach for Gaussian RBFs, where the estimation error scales as 1/l. This is the price paid for (almost) distribution-free bounds.
Choice of neurons: Assume we have enough training data l, one can calculate the optimal network size n*(l), in fact, n*(l) satisfies the following equation
For large n, we also have n µ (l/klnl)1/3.
It must again be emphasized that these results depend on finding the best possible estimator for a given size data set, and are based on worst-case bounds which require almost no knowledge of the input distribution. The generalization error bound can be viewed as a benchmark that at least a well-chosen RBF should yield. In practice, one can seek better performance by pre-analysis of training datahopefully we will have some prior knowledge from statistical analysis.
What steps should we follow?
Good or bad?
Sometimes we can not expect to find the global minimum for training a NN, a sub-optimum will suffice. We may face local minimum and do not know if we need to add neurons.
Furthermore, the error bound does not assume any specific form of input distribution and we want to reduce the number of neurons by knowing the statistics of the training data.
Incremental training algorithm:
Good or bad?
Each new training epoch does not use previous result. As n0 increases, we will face the curse of dimensionality.
We do not want to forget all the history resultstwo ways to use the history: Initialize the network parameters with the previous training result or fix some parameters, only train the new neuron and output weights.
The latter seems promising though may not keep being the best parameter set for a RBF with n0 neurons.
For example, assume the set X is the
k-dimensional Euclidean space and Y is bounded interval set [-M,
M]. The radial basis functions used for approximation can be
written as and the
finite series approximation has the form
. When adding (n+1)th
neuron, the tuning parameters are b i{i=1,2,
n+1}
and tn+1 for new training epoch. Then
what can we say about the approximation accuracy of incremental
training?
Claim: Let F be the concept class and (F, ||· ||2) be a Hilbert space with a subset G. Then for every fÎ F and n>0, we have
|| f - spannG ||22 £ [V(f, G0)2- ||f||2]/n
where V(f, G) = inf{ BÎ R+; fÎ cl convG(B)} and also
V(f, G0) = infa :G® RV(f, Ga )supgÎ G||a (g)g|| < ¥ .
Proof is quite complicated and not our contribution. In practice, we hope the basis of RBFs satisfy G-variation condition as above.
Now, we move to a real application and see how to use the theoretical results for electrical demand prediction.
Data set:
hourly demand of NU from Jan. 1, 1994 to Jul. 15, 1998
temperature and humidity for each hour collected also from Jan. 1, 1994 to Jul. 15, 1998
What is next?
Data pre-analysis before training
Try to see the average performance for the overall data trace.
Some intuitive ideas about the data trace
Performance measure: Root mean square error (RMSE), relative (mean) error (RME).
Hence we do not try to explore the local behavior of learnability.
Traditional Regression:
model | best parameters | RMSE | RME |
AR(p) | p=7 |
643.83 | 0.0286 |
ARMA(p, q) | p=10, q=3 |
575.92 | 0.0265 |
polynomial | n=5 |
978.49 | 0.0428 |
data length = 39768 mean = 1.6158e+004
std = 5.2679e+003 max = 3.2427e+004 min = 0.5589e+004
Requirement of generalization error < 1% (Is it possible?)
RBFs to approximate weekly demand for whole data trace n=180, RME=0.0103
using Gaussian basis function
RBFs to approximate hourly demand per week n=48, RME=0.0109
using Gaussian basis function
How about the overall performance of the RBFs?
It is not easy to train a large RBF based network that approximates the data trace in overall time horizon. One can test the generalization error bound and see that RME will not be less than 1% with 95% confidence in probability.
In fact, the sample complexity should not be very large for small network size (e.g. good prediction for a short period of time after training).
Two ways to deal with the growing network size: on-line tuning and multiple models for different prediction regions.
The first approach may forget the "history" and only has localized prediction capability. We will discuss the second approach. For instance, we have 236 patterns of hourly demand per week and each has good localized prediction power. When we get an unknown time series, we first determine to which pattern this trace belongs and then choose the one with maximal likelihood to perform prediction. Of course one can cluster the patterns or choose variable length patterns to improve the performance.
Benchmark test:
Use the hourly electrical demand in 1997 (8760 points) to test the prediction ability after training. The prediction starts 48 hours ahead from current time horizon.
model |
RMSE |
RME |
AR(p) |
1.7895e+004 |
0.1244 |
ARMA(p, q) |
3.7729e+003 |
0.0963 |
multilayer perceptron+ |
1.1817e+003 |
0.0539 |
multi-model RBFs* |
660.833 |
0.0372 |
+MLP also uses the history information of the temperature and humidity as inputs.
*RBFs only uses data trace of electrical demand and 25 models are chosen as candidates to match the hourly demand per week, each of which are trained by incremental algorithm.
Concluding remarks:
Acknowledgement
One of the project team mates, Huimin Chen, would like to thank Ms. Shuqing Huang of Huawei Ltd., Shenzhen, China for her constructive ideas on incremental training for neural network. He would also like to give special thanks to Mr. Xin Lu of Tsinghua University, China for his valuable comments on the statistical approach to bound the generalization error as well as comments on algorithm comparison from experimental design point of view.
Parts of the theoretical works are also Huang and Lus contribution, which can not be discussed in a thorough manner due to the time limit.
Click here
to get the presentation document in PDF format.