On the Relationship between Generalization Error and Sample Complexity for Radial Basis Functions and Incremental Training Algorithm

Headline of the presentation:

  1. Notations and basic concepts of learning from statistical point of view: estimation error, generalization error and empirical risk minimization
  2. Generalization error bound for radial basis functions (RBF): two parts—approximation error and estimation error
  3. Insights of neural network training: incremental training algorithm for choosing appropriate network size
  4. Discussions with real application: data pre-analysis before neural network training, generalization error bound estimation, and performance test on electrical demand data…

Introduction of basic concepts and notations

There are many heuristic techniques in the neural network literature to perform various tasks within supervised learning paradigm, such as optimizing training, selecting appropriate network size and predicting how many data required to achieve certain generalization error. Our results are mainly based on RBF, however, similar formulas can be applied to MLP.

Traditional Point of View: Center parameters are fixed before training, theory of linear model, bias-variance dilemma, cross-validation, center selection, localized basis function, curse of dimensionality…

Probability approximately correct (PAC) framework: The concept class F is PAC learnable by a network if for all concepts fÎ F and for all distributions PX, it is true that when the network is given at least p(N, 1/e , 1/d ) training data, where p is a polynomial, the network can perform a hypothesis h such that

Pr(error(h, f) >e )<d

Think of d as a measure of confidence and e as an error tolerance and this is a worst case definition.

Regression Estimation Model: The task is defined by the distribution PX´ Y on a function space, i.e., the input-output pair (x, y) is drawn from PX´ Y. Approximation error is defined by the difference between the best possible network parameter set and the target function to approximate.

The measure of performance of RBF: average squared error between the prediction and target (the expected risk)

errave[f]=E[(y- fF(x ))2]=

The expected risk can be decomposed into two parts

errave[f]= E[(f0(x )- fF(x ))2]+ E[(y- f0(x ))2]

where f0(x ), the regression function, is the conditional mean of the output given a particular input. In practice, errave[f] is unavailable to us as PX´ Y is unknown.

What to do: use empirical risk instead of the expected risk

Assume we have l data as examples in training, then

Question: What is the relationship between generalization error and empirical risk?

First, bound the approximation error

Second, bound the estimation error

Under certain mild assumptions, the generalization error is bounded by the following inequality

Apply this framework to RBFs: For any d Î (0,1), for n nodes of RBF network, l training points with k input dimensions each, with probability greater than 1-d ,

Comments: The first term is approximation error, which is of zero interest to us. The second term is estimation error. For a fixed network, it is governed by the number of patterns and decreases as O([lnl/l]1/2). It is considerably slower than VC dimension approach for Gaussian RBFs, where the estimation error scales as 1/l. This is the price paid for (almost) distribution-free bounds.

Choice of neurons: Assume we have enough training data l, one can calculate the optimal network size n*(l), in fact, n*(l) satisfies the following equation

For large n, we also have n µ (l/klnl)1/3.

It must again be emphasized that these results depend on finding the best possible estimator for a given size data set, and are based on worst-case bounds which require almost no knowledge of the input distribution. The generalization error bound can be viewed as a benchmark that at least a well-chosen RBF should yield. In practice, one can seek better performance by pre-analysis of training data—hopefully we will have some prior knowledge from statistical analysis.

What steps should we follow?

Good or bad?

Sometimes we can not expect to find the global minimum for training a NN, a sub-optimum will suffice. We may face local minimum and do not know if we need to add neurons.

Furthermore, the error bound does not assume any specific form of input distribution and we want to reduce the number of neurons by knowing the statistics of the training data.

Incremental training algorithm:

Good or bad?

Each new training epoch does not use previous result. As n0 increases, we will face the curse of dimensionality.

We do not want to forget all the history results—two ways to use the history: Initialize the network parameters with the previous training result or fix some parameters, only train the new neuron and output weights.

The latter seems promising though may not keep being the best parameter set for a RBF with n0 neurons.

For example, assume the set X is the k-dimensional Euclidean space and Y is bounded interval set [-M, M]. The radial basis functions used for approximation can be written as and the finite series approximation has the form . When adding (n+1)th neuron, the tuning parameters are b i{i=1,2,…n+1} and tn+1 for new training epoch. Then what can we say about the approximation accuracy of incremental training?

Claim: Let F be the concept class and (F, ||· ||2) be a Hilbert space with a subset G. Then for every fÎ F and n>0, we have

|| f - spannG ||22 £ [V(f, G0)2- ||f||2]/n

where V(f, G) = inf{ BÎ R+; fÎ cl convG(B)} and also

V(f, G0) = infa :G® RV(f, Ga )supgÎ G||a (g)g|| < ¥ .

Proof is quite complicated and not our contribution. In practice, we hope the basis of RBFs satisfy G-variation condition as above.

Now, we move to a real application and see how to use the theoretical results for electrical demand prediction.

Data set:

hourly demand of NU from Jan. 1, 1994 to Jul. 15, 1998

temperature and humidity for each hour collected also from Jan. 1, 1994 to Jul. 15, 1998

What is next?

Data pre-analysis before training

Try to see the average performance for the overall data trace.

Some intuitive ideas about the data trace

Performance measure: Root mean square error (RMSE), relative (mean) error (RME).

Hence we do not try to explore the local behavior of learnability.

Traditional Regression:

model best parameters RMSE RME
AR(p)

p=7

643.83 0.0286
ARMA(p, q)

p=10, q=3

575.92 0.0265
polynomial

n=5

978.49 0.0428

data length = 39768 mean = 1.6158e+004

std = 5.2679e+003 max = 3.2427e+004 min = 0.5589e+004

Requirement of generalization error < 1% (Is it possible?)

RBFs to approximate weekly demand for whole data trace n=180, RME=0.0103

using Gaussian basis function

RBFs to approximate hourly demand per week n=48, RME=0.0109

using Gaussian basis function

 

How about the overall performance of the RBFs?

It is not easy to train a large RBF based network that approximates the data trace in overall time horizon. One can test the generalization error bound and see that RME will not be less than 1% with 95% confidence in probability.

In fact, the sample complexity should not be very large for small network size (e.g. good prediction for a short period of time after training).

Two ways to deal with the growing network size: on-line tuning and multiple models for different prediction regions.

The first approach may forget the "history" and only has localized prediction capability. We will discuss the second approach. For instance, we have 236 patterns of hourly demand per week and each has good localized prediction power. When we get an unknown time series, we first determine to which pattern this trace belongs and then choose the one with maximal likelihood to perform prediction. Of course one can cluster the patterns or choose variable length patterns to improve the performance.

Benchmark test:

Use the hourly electrical demand in 1997 (8760 points) to test the prediction ability after training. The prediction starts 48 hours ahead from current time horizon.

model

RMSE

RME

AR(p)

1.7895e+004

0.1244

ARMA(p, q)

3.7729e+003

0.0963

multilayer perceptron+

1.1817e+003

0.0539

multi-model RBFs*

660.833

0.0372

+MLP also uses the history information of the temperature and humidity as inputs.

*RBFs only uses data trace of electrical demand and 25 models are chosen as candidates to match the hourly demand per week, each of which are trained by incremental algorithm.

Concluding remarks:

Acknowledgement

One of the project team mates, Huimin Chen, would like to thank Ms. Shuqing Huang of Huawei Ltd., Shenzhen, China for her constructive ideas on incremental training for neural network. He would also like to give special thanks to Mr. Xin Lu of Tsinghua University, China for his valuable comments on the statistical approach to bound the generalization error as well as comments on algorithm comparison from experimental design point of view.

Parts of the theoretical works are also Huang and Lu’s contribution, which can not be discussed in a thorough manner due to the time limit.


Click here to get the presentation document in PDF format.