On the Relationship between Generalization Error and Sample Complexity for Radial Basis Functions and Incremental Training Algorithm

On the Relationship between Generalization Error and Sample Complexity for Radial Basis Functions and Incremental Training Algorithm

Headline of the presentation:

Notations and basic concepts of learning from statistical point of view: estimation error, generalization error and empirical risk minimization
Generalization error bound for radial basis functions (RBF): two parts—approximation error and estimation error
Insights of neural network training: incremental training algorithm for choosing appropriate network size
Discussions with real application: data pre-analysis before neural network training, generalization error bound estimation, and performance test on electrical demand data…

Introduction of basic concepts and notations

There are many heuristic techniques in the neural network literature to perform various tasks within supervised learning paradigm, such as optimizing training, selecting appropriate network size and predicting how many data required to achieve certain generalization error. Our results are mainly based on RBF, however, similar formulas can be applied to MLP.

Traditional Point of View: Center parameters are fixed before training, theory of linear model, bias-variance dilemma, cross-validation, center selection, localized basis function, curse of dimensionality…

Probability approximately correct (PAC) framework: The concept class F is PAC learnable by a network if for all concepts fÎ F and for all distributions P_X, it is true that when the network is given at least p(N, 1/e , 1/d ) training data, where p is a polynomial, the network can perform a hypothesis h such that

Pr(error(h, f) >e )<d

Think of d as a measure of confidence and e as an error tolerance and this is a worst case definition.

Regression Estimation Model: The task is defined by the distribution P_X_´_Y on a function space, i.e., the input-output pair (x, y) is drawn from P_X_´_Y. Approximation error is defined by the difference between the best possible network parameter set and the target function to approximate.

The measure of performance of RBF: average squared error between the prediction and target (the expected risk)

err_ave[f]=E[(y- f_F(x ))²]=

The expected risk can be decomposed into two parts

err_ave[f]= E[(f₀(x )- f_F(x ))²]+ E[(y- f₀(x ))²]

where f₀(x ), the regression function, is the conditional mean of the output given a particular input. In practice, err_ave[f] is unavailable to us as P_X_´_Y is unknown.

What to do: use empirical risk instead of the expected risk

Assume we have l data as examples in training, then

Question: What is the relationship between generalization error and empirical risk?

First, bound the approximation error

Second, bound the estimation error

Under certain mild assumptions, the generalization error is bounded by the following inequality

Apply this framework to RBFs: For any d Î (0,1), for n nodes of RBF network, l training points with k input dimensions each, with probability greater than 1-d ,

Comments: The first term is approximation error, which is of zero interest to us. The second term is estimation error. For a fixed network, it is governed by the number of patterns and decreases as O([lnl/l]^1/2). It is considerably slower than VC dimension approach for Gaussian RBFs, where the estimation error scales as 1/l. This is the price paid for (almost) distribution-free bounds.

Choice of neurons: Assume we have enough training data l, one can calculate the optimal network size n*(l), in fact, n*(l) satisfies the following equation

For large n, we also have n µ (l/klnl)^1/3.

It must again be emphasized that these results depend on finding the best possible estimator for a given size data set, and are based on worst-case bounds which require almost no knowledge of the input distribution. The generalization error bound can be viewed as a benchmark that at least a well-chosen RBF should yield. In practice, one can seek better performance by pre-analysis of training data—hopefully we will have some prior knowledge from statistical analysis.

What steps should we follow?

Make good usage of training data–"guess" the generalization error bound
Find the appropriate network size and perform training, in hope of good convergence of all tuning parameters

Good or bad?

Sometimes we can not expect to find the global minimum for training a NN, a sub-optimum will suffice. We may face local minimum and do not know if we need to add neurons.

Furthermore, the error bound does not assume any specific form of input distribution and we want to reduce the number of neurons by knowing the statistics of the training data.

Incremental training algorithm:

Start from a network size n₀ smaller than n*(l)
Perform training and test the generalization error

If the generalization error does not meet the requirement, add new neuron
Perform new training and calculate the improvement of generalization error

Repeat previous step until n₀>n*(l) or the rate of improvement of generalization error is slower than O([lnn₀/n₀]^1/2).

Good or bad?

Each new training epoch does not use previous result. As n₀increases, we will face the curse of dimensionality.

We do not want to forget all the history results—two ways to use the history: Initialize the network parameters with the previous training result or fix some parameters, only train the new neuron and output weights.

The latter seems promising though may not keep being the best parameter set for a RBF with n₀ neurons.

For example, assume the set X is the k-dimensional Euclidean space and Y is bounded interval set [-M, M]. The radial basis functions used for approximation can be written as and the finite series approximation has the form . When adding (n+1)^th neuron, the tuning parameters are b _i{i=1,2,…n+1} and t_n+1for new training epoch. Then what can we say about the approximation accuracy of incremental training?

Claim: Let F be the concept class and (F, ||· ||₂) be a Hilbert space with a subset G. Then for every fÎ F and n>0, we have

|| f - span_nG ||₂²^£[V(f, G⁰)²^-||f||²]/n

where V(f, G) = inf{ BÎ R⁺; fÎ cl convG(B)} and also

V(f, G⁰) = infa _:G_®_RV(f, Ga )sup_g_Î_G||a (g)g|| < ¥ .

Proof is quite complicated and not our contribution. In practice, we hope the basis of RBFs satisfy G-variation condition as above.

Now, we move to a real application and see how to use the theoretical results for electrical demand prediction.

Data set:

hourly demand of NU from Jan. 1, 1994 to Jul. 15, 1998

temperature and humidity for each hour collected also from Jan. 1, 1994 to Jul. 15, 1998

What is next?

Data pre-analysis before training

Try to see the average performance for the overall data trace.

Some intuitive ideas about the data trace

Periodical behavior for large time scales
Non-stationary with slow shifts in small regions
Long range dependence

Performance measure: Root mean square error (RMSE), relative (mean) error (RME).

Hence we do not try to explore the local behavior of learnability.

Traditional Regression:

model	best parameters	RMSE	RME
AR(p)	p=7	643.83	0.0286
ARMA(p, q)	p=10, q=3	575.92	0.0265
polynomial	n=5	978.49	0.0428

data length = 39768 mean = 1.6158e+004

std = 5.2679e+003 max = 3.2427e+004 min = 0.5589e+004

Requirement of generalization error < 1% (Is it possible?)

RBFs to approximate weekly demand for whole data trace n=180, RME=0.0103

using Gaussian basis function

RBFs to approximate hourly demand per week n=48, RME=0.0109

using Gaussian basis function

How about the overall performance of the RBFs?

It is not easy to train a large RBF based network that approximates the data trace in overall time horizon. One can test the generalization error bound and see that RME will not be less than 1% with 95% confidence in probability.

In fact, the sample complexity should not be very large for small network size (e.g. good prediction for a short period of time after training).

Two ways to deal with the growing network size: on-line tuning and multiple models for different prediction regions.

The first approach may forget the "history" and only has localized prediction capability. We will discuss the second approach. For instance, we have 236 patterns of hourly demand per week and each has good localized prediction power. When we get an unknown time series, we first determine to which pattern this trace belongs and then choose the one with maximal likelihood to perform prediction. Of course one can cluster the patterns or choose variable length patterns to improve the performance.

Benchmark test:

Use the hourly electrical demand in 1997 (8760 points) to test the prediction ability after training. The prediction starts 48 hours ahead from current time horizon.

model	RMSE	RME
AR(p)	1.7895e+004	0.1244
ARMA(p, q)	3.7729e+003	0.0963
multilayer perceptron⁺	1.1817e+003	0.0539
multi-model RBFs^*	660.833	0.0372

⁺MLP also uses the history information of the temperature and humidity as inputs.

^*RBFs only uses data trace of electrical demand and 25 models are chosen as candidates to match the hourly demand per week, each of which are trained by incremental algorithm.

Concluding remarks:

A bound for generalization error is derived for RBF based networks and used to find appropriate network size.
An incremental training approach is proposed to avoid the curse of dimensionality in parameter optimization of RBFs.
A prediction problem from real application is thoroughly analyzed and several regression approaches are used to perform benchmark test.
Further investigations include finding average performance measurement in generalization error bound considering the knowledge of input distribution, applying the results to other types of NNs and developing a sequential hypothesis test for choosing both appropriate network size and the required training data.
For the electrical demand prediction, how to use the information of hourly temperature and humidity to yield better result and how to qualify multiple model approach from statistical point of view are still under investigation.

Acknowledgement

One of the project team mates, Huimin Chen, would like to thank Ms. Shuqing Huang of Huawei Ltd., Shenzhen, China for her constructive ideas on incremental training for neural network. He would also like to give special thanks to Mr. Xin Lu of Tsinghua University, China for his valuable comments on the statistical approach to bound the generalization error as well as comments on algorithm comparison from experimental design point of view.

Parts of the theoretical works are also Huang and Lu’s contribution, which can not be discussed in a thorough manner due to the time limit.

Click here to get the presentation document in PDF format.