Download a free evaluation copy of NeuroSolutions to discover how to apply neural network technology to your artificial intelligence application.

contents.gifindex.gifprev1.gifnext1.gif

1.7 Regression for Multiple Variables

Assume that d is now a function of several inputs x1, x2,...xD (independent variables), and the goal is to find the best linear regressor of d on all the inputs (Figure 1-13). For D =2 this corresponds to fitting a plane through the N input samples or a hyperplane in the general case of D dimensions.

NEURAL AND ADAPTIVE SYSTEMS00000069.gif

Figure 1-13 Fitting a regression plane to a set of samples in 2D space.


As an example, let us assume that we have two variables x1 (speed) and x2 (feed rate) that affect the surface roughness (d) of a machined workpiece. In abstract units the values of x1, x2, d for 15 workpieces are presented in Table1-2.

Table 1-2 Multiple Regression Data

NEURAL AND ADAPTIVE SYSTEMS00000070.gif

The goal is to find how well one can explain the quality of machining by the two variables x1 and x2, and which is the most important parameter.

As before, we assume that the measurements x are noise free and that d is contaminated by a noise vector NEURAL AND ADAPTIVE SYSTEMS00090000.gifNEURAL AND ADAPTIVE SYSTEMS00090006.gifwith these properties: Gaussian distributed with components that are zero mean, of equal variance NEURAL AND ADAPTIVE SYSTEMS00090014.gif², and uncorrelated with the inputs. The regression equation when D =2 is now

NEURAL AND ADAPTIVE SYSTEMS00000071.gif (1.22)

where xi1 is the ith value of x1 (the ith workpiece in the training set). In the general case we write Eq. 1.22 as

NEURAL AND ADAPTIVE SYSTEMS00000072.gif (1.23)

where we made w0 =b and xi0 =1 (compare with Eq. 1.3). The goal of the regression problem is to find the coefficients w0, ….wD. To simplify the notation we will put all these values into a vector w = [w0, ….wD ]NEURAL AND ADAPTIVE SYSTEMS00000033.gif that minimizes the MSE of NEURAL AND ADAPTIVE SYSTEMS00090000.gifi over the N samples. Figure 1-14 shows that the linear PE now has D inputs and one bias.

NEURAL AND ADAPTIVE SYSTEMS00000073.gif

Figure 1-14 Regression system for multiple inputs


The mean square error (MSE) becomes for this case

NEURAL AND ADAPTIVE SYSTEMS00000074.gif (1.24)

The solution to the extreme (minimum) of this equation can be found in exactly the same way as before, that is, by taking the derivatives of J with respect to the unknowns (wk), and equating the result to zero (derivation of normal equations).

This yields a set of D+1 equations in D+1 unknowns called the famous normal equations

NEURAL AND ADAPTIVE SYSTEMS00000075.gif (1.25)

The normal equations can be written much more compactly with matrix notation (see the Appendix A section 3) Let us define

NEURAL AND ADAPTIVE SYSTEMS00000076.gif (1.26)

as the autocorrelation of the input samples for indices k and j. As you can see, the autocorrelation measures similarity across the samples of the training set. When k =j, R is just the sum of the squares of the input samples (the variance if the data is zero mean). When k differs from j, R measures the sum of the cross products for every possible combination of the indices. As we did for w, we will also put all these Rkj values into a matrix R, that is,

NEURAL AND ADAPTIVE SYSTEMS00000077.gif.

Thus we obtain pairwise information about the structure of the data set.

Let us call

NEURAL AND ADAPTIVE SYSTEMS00000078.gif (1.27)

the cross-correlation of the input x for index j and desired response d, which can also be put into a vector p of dimension D+1. Pj measures the similarity between the input x and the desired response d at shift j. Substituting these definitions in Eq. 1.25, the set of normal equations can be written simply

NEURAL AND ADAPTIVE SYSTEMS00000079.gif (1.28)

where w is a vector with the D+1 weights wi. w* represents the value of the D+1 weights for the optimal (minimum) solution. NEURAL AND ADAPTIVE SYSTEMS00000080.gifdenotes the inverse of the autocorrelation matrix (see Appendix A Section 3.11). Eq. 1.28 states that the solution of the multiple regression problem can be computed analytically as the product of the inverse of the autocorrelation matrix of the input samples and the cross-correlation vector between the input and the desired response. The least square solution for this problem yields

NEURAL AND ADAPTIVE SYSTEMS00000081.gif

It is remarkable that we are able to write an equation that describes the relationship between the two variables when only measured data samples were given. This attests the power of linear regression. But as for the single variable case, we still do not know how accurately the equation fits the data, that is, how much of the variance of the input is actually captured by the regression model. The multiple correlation coefficient rm can also be defined in the multiple dimensional case for a single output, as multiple variable correlation coefficient

NEURAL AND ADAPTIVE SYSTEMS00000082.gif (1.29)

and measures the amount of variation explained by the linear regression, normalized by the variance of d. In this expression d is the vector built from the desired responses di, and U is a matrix whose columns are the input data vectors. For this case rm=0.68, so there is a large portion of the variability that is not explained by the linear regression. (Either the process is nonlinear, or there are more variables involved.) We can still approximate the correlation coefficient for the multiple regression case by Eq. 1.14 after the system has adapted.

NEUROSOLUTIONS EXAMPLE 1.15

Multivariable Regression

Moving to multi-dimensional inputs is very simple in NeuroSolutions. You simply change the input and desired files (for the new input data) and change the input Axon to accept two inputs. The rest is automatic. In this example we do all this for you using macros. Note that in the two-dimensional case the regression line is now a regression plane. In NeuroSolutions, there is not currently a good way of showing a plane in three dimensions, so we will not plot our regression plane plot. When we run the network, we will see that the learning curve (one of our only indications of whether the network is training correctly) decreases steadily and that the weights eventually approach the theoretical optimal weights.

The amazing thing about the adaptive system's methodology is that we changed the problem, but the technique to solve it did not change significantly. It is true that we have to dimension the system properly and choose new values for the step size, but the fundamental aspects of the methodology did not change at all.

NeuroSolutions Example

1.7.1 Setting the Problem as a Search Procedure

All the concepts previously mentioned for linear regression can be extended to the multiple regression case. The performance surface concept can be extended to multidimensions, making J a paraboloid in D+1 dimensions, facing upwards (Figure 1-15 depicts the two weight case). J now involves matrix computations, but it remains a scalar quantity that is a quadratic function of the weights

NEURAL AND ADAPTIVE SYSTEMS00000083.gif (1.30)

where the superscript T means the transpose.

NEURAL AND ADAPTIVE SYSTEMS00000084.gif

Figure 1-15 The performance surface for two dimensions and its contour plot


The coefficients that minimize the solution are

NEURAL AND ADAPTIVE SYSTEMS00000085.gif (1.31)

which gives exactly the same solution as Eq.1.28 (see derivation of the optimal solution). In the space (w1,w2), J is a paraboloid facing upwards. See more performance surface properties.

As in the one dimensional case, the autocorrelation of the input (R) completely specifies the shape of the performance surface Eq.1B.19. However, the location of the performance surface in the space of the weights Eq.1.31 and its minimum value Eq.1B.18 depend also on the desired response.

NEUROSOLUTIONS EXAMPLE 1.16

Checking the LMS Solution with the Optimal Weights

Let us first consider a least square solution with only two weights w1, and w2, since we can still compute it easily by hand. For the data set of Table 1-2, the autocorrelation matrix is Eq. 1.26

NEURAL AND ADAPTIVE SYSTEMS00000086.gif

To determine the eigenvalues, we solve the equation

NEURAL AND ADAPTIVE SYSTEMS00000087.gif

which yields NEURAL AND ADAPTIVE SYSTEMS00090003.gif1 =59 and NEURAL AND ADAPTIVE SYSTEMS00090003.gif2 =1.5. From these results we can immediately see that the eigenvalue spread is roughly 40, so the performance surface paraboloid is very skewed (i.e., much narrower in one direction). The performance surface is shown in the following figure. Notice how it is very steep in one direction and very shallow in the other. Thus, if we train the network with gradient descent, we would expect it to move very quickly down the steep slope at first and then to move slowly down the valley toward the optimum.

NEURAL AND ADAPTIVE SYSTEMS00000088.gif

To compute the optimal solution, we first need to compute the cross-correlation vector Eq. 1.27 as

NEURAL AND ADAPTIVE SYSTEMS00000089.gif

For the two-dimensional case it is still easy to solve for w1 and w2, by writing from Eq. 1.28

NEURAL AND ADAPTIVE SYSTEMS00000090.gif

which gives for optimal weights w1 =0.2848 and w2 =0.2180. The minimum J is 0.390 from Eq.1B.18. When we run the simulator with the Axon substituting the Bias Axon (no bias), the network weights approach these values.

NeuroSolutions Example

1.7.2 Steepest Descent for Multiple Weights

Gradient techniques can also be used to find the minimum of the performance surface, but now the gradient is a vector with D+1 components

NEURAL AND ADAPTIVE SYSTEMS00000091.gif (1.32)

The extension of Eq. 1.11 is

NEURAL AND ADAPTIVE SYSTEMS00000092.gif (1.33)

where all quantities are vectors, i.e. w(k) = [w0(k), …wD(k)]NEURAL AND ADAPTIVE SYSTEMS00000033.gif. In order to calculate the largest step size NEURAL AND ADAPTIVE SYSTEMS00090002.gif, we again rewrite the update equation in the form of

NEURAL AND ADAPTIVE SYSTEMS00000093.gif (1.34)

where I is the identity matrix, R is the input autocorrelation matrix and NEURAL AND ADAPTIVE SYSTEMS00000094.gif. The solution of this equation is cross coupled, that is, the way w converges to w* depends on the behavior of the geometric progression in all the D+1 directions. Therefore, the simple picture of having w(k+1) converge to w* with a single geometric ratio, as in the unidimensional case, has to be modified. One can show that the weights converge with different time constants, each related to an eigenvalue of R (convergence for multiple weights case ).

1.7.3 Step Size Control

As we have seen, the set of values taken by the weight during adaptation is called the weight track. The weight moves in the opposite direction of the gradient at each point, so the weight track depicts the gradient direction at each point of the performance surface visited during adaptation. Therefore the gradient direction tells us about the performance surface shape. In particular, it is useful to construct the contour plot of J since the gradient has to be perpendicular to the lines that link points with the same J value. The contour plot provides a graphical construction for the gradient at each point of the performance surface.

Given a point in a contour, we take the tangent of the contour at the point. The gradient is perpendicular to the tangent, so the weights will move along the gradient line and point in the opposite direction. Likewise, if we run the adaptation algorithm with several initial conditions and we record the value of J at each point, we can determine the contour plots by taking ellipses that pass through the points of equal cost and are perpendicular to the weight tracks.

When the eigenvalues of R are the same (see Appendix A section 3.17), the contour plots are circular and the gradient always points to the center, that is to the minimum. In this case the gradient descent has only a single time constant as in the one-dimensional case. But this is an exceptional condition. In general, the eigenvalues of R will be different. When the eigenvalues are different, the weight track bends because it follows the direction of the gradient at each point, which is perpendicular to the contours (Figure 1-17). The gradient direction no longer points to the minimum, which means that the weight tracks will not be straight lines to the minimum. The adaptation will take longer for two reasons. First, a longer path to the minimum will be taken. Second, the step size must be decreased compared with the circular case. Let us address the step size aspect further.

NEURAL AND ADAPTIVE SYSTEMS00000095.gif

Figure 1-17 Weight track toward the minimum: (a) equal eigenvalues; (b) unequal eigenvalues


For guaranteed convergence the learning rate in each ith principal direction of the performance surface must be

NEURAL AND ADAPTIVE SYSTEMS00000096.gif (1.35)

where NEURAL AND ADAPTIVE SYSTEMS00090003.gifi is the corresponding eigenvalue. The worst case condition to guarantee convergence to the optimal w* in all directions is therefore

NEURAL AND ADAPTIVE SYSTEMS00000097.gif (1.36)

That is, the step size NEURAL AND ADAPTIVE SYSTEMS00090002.gif must be smaller than the inverse of the largest eigenvalue of the autocorrelation matrix. Otherwise the iteration will diverge in one (or more) directions. Since the adaptation is coupled, divergence in one direction will cause the entire system to diverge.

In the early stages of adaptation the convergence is primarily along the direction of the largest eigenvalue, since the weight update along this direction will be bigger. On the other hand, toward the end of adaptation the algorithm will mainly adapt only the weight associated with the smallest eigenvalue (which corresponds to the smallest time constant). The time constant of adaptation is therefore

NEURAL AND ADAPTIVE SYSTEMS00000098.gif (1.37)

An implication of this analysis is that when the eigenvalue spread of R is large, there will be very different time constants of adaptation in each direction. This reasoning gives a clear picture of the fundamental constraint of adapting the weights using gradient descent with a single step size NEURAL AND ADAPTIVE SYSTEMS00090002.gif: The speed of adaptation is controlled by the smallest eigenvalue, while the largest step size is constrained by the inverse of the largest eigenvalue. This means that if the ratio between the largest and the smallest eigenvalues (the eigenvalue spread) is large, the convergence will be intrinsically slow. This problem cannot be avoided when only a single step size is used in the steepest descent.

The learning curve will approach Jmin in a geometric progression as before. However, there will be many different time constants of adaptation, one for each direction. Initially the learning curve will decrease at the rate of the largest eigenvalue, but toward the end of adaptation the rate of decrease of J is controlled by the time constant of the smallest eigenvalue (estimation of eigenvalue spread).

NEUROSOLUTIONS EXAMPLE 1.17

Visualizing the Weight Tracks and Speed of Adaptation

According to our previous calculations, the largest step size for convergence is Eq. 1.36 3.3e-2. The critically damped mode along the largest eigenvector should be 1.6e-2. The time constant of adaptation for the largest step size is around 20 iterations (epochs for batch), that is, the convergence should take 80 epochs with this step size.

When we run the simulations, the algorithm converges first along the direction of the largest eigenvalue (largest eigenvector direction), and then along the direction of the smallest eigenvector. Since the eigenvalue spread is 40, the steps are much bigger along the largest eigenvector direction. If we look at the figure of Example 1.16, we can see that the weights converge perpendicular to the contour plots since this is the steepest descent path. As we will see, there are two distinct regions in the learning curve: In the beginning it is controlled by the geometric ratio along the largest eigenvector, while toward the end it is controlled by the geometric ratio of the smallest eigenvector.

After running this example and observing the weight tracks let us change the input data file to a data set with a smaller eigenvalue spread. Click the input file icon and bring up its Inspector by clicking the right mouse button. Remove the present input file, and add the file regression2a.asc from the \data\chapters\ch1 folder on the CD-ROM.

The modification was made only in the variable x2, all the rest is the same. Respond to the panel associate by clicking on the close button. In the customize panel, skip the desired signal, and click on close. You have just modified the input data to this example. This new file has a much smaller eigenvalue spread, so we can expect that the weight tracks are basically straight lines to the minimum. Compute the new eigenvalue spread, and adjust the learning rates so that the convergence is as fast as possible.

NeuroSolutions Example

1.7.4 The LMS Algorithm for Multiple Weights

It is straightforward to extend the gradient estimation given by the LMS algorithm from one dimension to many dimensions. We just apply the instantaneous gradient estimate Eq.1.12 to each element of Eq.1.33. The LMS for multiple dimensions reads

NEURAL AND ADAPTIVE SYSTEMS00000099.gif (1.38)

An interesting feature is that the LMS adaptation rule still uses local computations, i.e. for the i'th weight we can write

NEURAL AND ADAPTIVE SYSTEMS00000100.gif (1.39)

Note that although the analysis of the gradient descent techniques became complex, the LMS algorithm itself is still very simple. This is one reason why LMS is so widely used. But since LMS is a steepest-descent algorithm, the analysis and discussions concerning the largest step size for convergence and coupling of modes also apply to the LMS algorithm.

NEUROSOLUTIONS EXAMPLE 1.18

Visualizing Weight Tracks with On-Line Learning

In this example we configure the Backprop Controller to on-line learning to implement the LMS algorithm. Notice that the weight tracks follow basically the same path as before, but now the path is much more irregular due to the sample-by-sample update of the weights. When the eigenvalue spread is very large (the performance surface is very steep in one direction and shallow in others), the problem is difficult for LMS to solve. Any small perturbation in the smallest eigenvector direction gets amplified by the large eigenvalue spread.

NeuroSolutions Example

1.7.5 Multiple Regression with Bias

Up to now we have implemented and solved analytically the multiple regression problem without bias. The reason for this is only for simplicity. With two weights we can still easily solve the multiple regression case by hand; however, if the bias is added, we must do the computations with three parameters. The simulations are transparent to these difficulties, since we just substitute the Axon by a Bias Axon. Note that the largest step size between the two cases will differ since the input data was effectively changed if one interprets the bias as a weight connected to an extra constant input of one. Hence the autocorrelation function and its eigenvalue spread changed.

We should state that the use of a bias is called the full least square solution, and it is the recommended way to apply least squares. The reason can be understood easily: When a bias is utilized in the PE, the regression line is not restricted to pass through the origin of the space, and smaller errors are normally achieved. There are two equivalent ways to solve the full least squares problem for D input variables:

· The input and desired responses need to be modified such that they become zero mean variables (this is called the deviation or z scores). In this case a D weight regression will effectively solve the original problem. The bias b is computed indirectly by

NEURAL AND ADAPTIVE SYSTEMS00000101.gif

where wi are the optimal weights and the bars represent mean values.

· Alternatively, the input matrix has to be extended with an extra column of 1s (the first column). This transforms R into a (D+1)x(D+1) matrix, which introduces a D+1 weight in the solution (the bias).

NEUROSOLUTIONS EXAMPLE 1.19

Linear Regression without Bias

We will now substitute a Bias Axon for the Axon in the previous breadboard. This will effectively provide the regression solution without constraining the regression plane to pass through the origin. We see that the weight tracks are very similar in the beginning but that the error continues to drop, and the weights advance towards the w1=0 line. This means that the optimal solution changed. We now have a better solution than before but with increased complexity of the performance surface (four dimensional instead of three) and an increased number of adjustable parameters in our system (two weights and a bias).

NeuroSolutions Example

1.7.6 The LMS Algorithm in Practice

We can use some rules of thumb to choose the step size in the LMS algorithm. The step size should be normalized by the variance of the input data estimated by the trace of R.

NEURAL AND ADAPTIVE SYSTEMS00000102.gif (1.40)

where NEURAL AND ADAPTIVE SYSTEMS00090002.gif0 = 0.1 to 0.01. This normalization by the input variance was the original rule proposed by Widrow to adapt the Adaline. We can expect the algorithm to converge (settling time) in a number of iterations k given by four times the time constant of adaptation:

NEURAL AND ADAPTIVE SYSTEMS00000103.gif (1.41)

The LMS algorithm has a misadjustment that is basically one half the trace of R times NEURAL AND ADAPTIVE SYSTEMS00090002.gifNEURAL AND ADAPTIVE SYSTEMS00090015.gif

NEURAL AND ADAPTIVE SYSTEMS00000104.gif (1.42)

When the eigenvalues are equal, the misadjustment can be approximated by

NEURAL AND ADAPTIVE SYSTEMS00000105.gif

which allows us to give the following rule of thumb: The misadjustment equals the number of weights divided by the settling time, or equivalently, selecting NEURAL AND ADAPTIVE SYSTEMS00090002.gif so that it produces 10% misadjustment means a training duration in iterations of 10 times the number of inputs.

Go to next section