| Download a free evaluation copy of NeuroSolutions to discover how to apply neural network technology to your artificial intelligence application. |
1.7 Regression for Multiple Variables
Assume that d is now a function of several inputs x1, x2,...xD (independent variables), and the goal is to find the best linear regressor of d on all the inputs (Figure 1-13). For D =2 this corresponds to fitting a plane through the N input samples or a hyperplane in the general case of D dimensions.
Figure 1-13 Fitting a regression plane to a set of samples in 2D space.
As an example, let us assume that we have two variables x1 (speed) and x2 (feed rate) that affect the surface roughness (d) of a machined workpiece. In abstract units the values of x1, x2, d for 15 workpieces are presented in Table1-2.
Table 1-2 Multiple Regression Data
The goal is to find how well one can explain the quality of machining by the two variables x1 and x2, and which is the most important parameter.
As
before, we assume that the measurements x are noise
free and that d is contaminated by a noise vector ![]()
with these properties: Gaussian distributed with
components that are zero mean, of equal variance
², and uncorrelated with the inputs. The regression
equation when D =2 is now
(1.22)
where xi1 is the ith value of x1 (the ith workpiece in the training set). In the general case we write Eq. 1.22 as
(1.23)
where we
made w0 =b and xi0 =1
(compare with Eq. 1.3). The goal of
the regression problem is to find the coefficients w0,
.wD. To simplify the
notation we will put all these values into a vector w = [w0,
.wD ]
that minimizes the MSE of
i over the N samples. Figure 1-14
shows that the linear PE now has D inputs and one bias.
Figure 1-14 Regression system for multiple inputs
The mean square error (MSE) becomes for this case
(1.24)
The solution to the extreme (minimum) of this equation can be found in exactly the same way as before, that is, by taking the derivatives of J with respect to the unknowns (wk), and equating the result to zero (derivation of normal equations).
This yields a set of D+1 equations in D+1 unknowns called the famous normal equations
(1.25)
The normal equations can be written much more compactly with matrix notation (see the Appendix A section 3) Let us define
(1.26)
as the autocorrelation of the input samples for indices k and j. As you can see, the autocorrelation measures similarity across the samples of the training set. When k =j, R is just the sum of the squares of the input samples (the variance if the data is zero mean). When k differs from j, R measures the sum of the cross products for every possible combination of the indices. As we did for w, we will also put all these Rkj values into a matrix R, that is,
.
Thus we obtain pairwise information about the structure of the data set.
Let us call
(1.27)
the cross-correlation of the input x for index j and desired response d, which can also be put into a vector p of dimension D+1. Pj measures the similarity between the input x and the desired response d at shift j. Substituting these definitions in Eq. 1.25, the set of normal equations can be written simply
(1.28)
where w
is a vector with the D+1 weights wi. w*
represents the value of the D+1 weights for the optimal (minimum)
solution.
denotes the inverse of the autocorrelation
matrix (see Appendix A Section 3.11).
Eq. 1.28 states that the solution of the multiple regression
problem can be computed analytically as the product of the
inverse of the autocorrelation matrix of the input samples and
the cross-correlation vector between the input and the desired
response. The least square solution for this problem yields
It is remarkable that we are able to write an equation that describes the relationship between the two variables when only measured data samples were given. This attests the power of linear regression. But as for the single variable case, we still do not know how accurately the equation fits the data, that is, how much of the variance of the input is actually captured by the regression model. The multiple correlation coefficient rm can also be defined in the multiple dimensional case for a single output, as multiple variable correlation coefficient
(1.29)
and measures the amount of variation explained by the linear regression, normalized by the variance of d. In this expression d is the vector built from the desired responses di, and U is a matrix whose columns are the input data vectors. For this case rm=0.68, so there is a large portion of the variability that is not explained by the linear regression. (Either the process is nonlinear, or there are more variables involved.) We can still approximate the correlation coefficient for the multiple regression case by Eq. 1.14 after the system has adapted.
NEUROSOLUTIONS EXAMPLE 1.15
Multivariable Regression
Moving to multi-dimensional inputs is very simple in NeuroSolutions. You simply change the input and desired files (for the new input data) and change the input Axon to accept two inputs. The rest is automatic. In this example we do all this for you using macros. Note that in the two-dimensional case the regression line is now a regression plane. In NeuroSolutions, there is not currently a good way of showing a plane in three dimensions, so we will not plot our regression plane plot. When we run the network, we will see that the learning curve (one of our only indications of whether the network is training correctly) decreases steadily and that the weights eventually approach the theoretical optimal weights.
The amazing thing about the adaptive system's methodology is that we changed the problem, but the technique to solve it did not change significantly. It is true that we have to dimension the system properly and choose new values for the step size, but the fundamental aspects of the methodology did not change at all.
1.7.1 Setting the Problem as a Search Procedure
All the concepts previously mentioned for linear regression can be extended to the multiple regression case. The performance surface concept can be extended to multidimensions, making J a paraboloid in D+1 dimensions, facing upwards (Figure 1-15 depicts the two weight case). J now involves matrix computations, but it remains a scalar quantity that is a quadratic function of the weights
(1.30)
where the superscript T means the transpose.
Figure 1-15 The performance surface for two dimensions and its contour plot
The coefficients that minimize the solution are
(1.31)
which gives exactly the same solution as Eq.1.28 (see derivation of the optimal solution). In the space (w1,w2), J is a paraboloid facing upwards. See more performance surface properties.
As in the one dimensional case, the autocorrelation of the input (R) completely specifies the shape of the performance surface Eq.1B.19. However, the location of the performance surface in the space of the weights Eq.1.31 and its minimum value Eq.1B.18 depend also on the desired response.
NEUROSOLUTIONS EXAMPLE 1.16
Checking the LMS Solution with the Optimal Weights
Let us first consider a least square solution with only two weights w1, and w2, since we can still compute it easily by hand. For the data set of Table 1-2, the autocorrelation matrix is Eq. 1.26
To determine the eigenvalues, we solve the equation
which
yields
1 =59 and
2 =1.5. From these results we
can immediately see that the eigenvalue spread is roughly 40, so
the performance surface paraboloid is very skewed (i.e., much
narrower in one direction). The performance surface is shown in
the following figure. Notice how it is very steep in one
direction and very shallow in the other. Thus, if we train the
network with gradient descent, we would expect it to move very
quickly down the steep slope at first and then to move slowly
down the valley toward the optimum.
To compute the optimal solution, we first need to compute the cross-correlation vector Eq. 1.27 as
For the two-dimensional case it is still easy to solve for w1 and w2, by writing from Eq. 1.28
which gives for optimal weights w1 =0.2848 and w2 =0.2180. The minimum J is 0.390 from Eq.1B.18. When we run the simulator with the Axon substituting the Bias Axon (no bias), the network weights approach these values.
1.7.2 Steepest Descent for Multiple Weights
Gradient techniques can also be used to find the minimum of the performance surface, but now the gradient is a vector with D+1 components
(1.32)
The extension of Eq. 1.11 is
(1.33)
where
all quantities are vectors, i.e. w(k) = [w0(k),
wD(k)]
. In order to calculate the largest step size
, we again rewrite the update equation in the form of
(1.34)
where I
is the identity matrix, R is the input autocorrelation
matrix and
. The solution of this equation is
cross coupled, that is, the way w converges to w*
depends on the behavior of the geometric progression in all the D+1
directions. Therefore, the simple picture of having w(k+1)
converge to w* with a single geometric ratio, as in the
unidimensional case, has to be modified. One can show that the
weights converge with different time constants, each related to
an eigenvalue of R (convergence for
multiple weights case ).
1.7.3 Step Size Control
As we have seen, the set of values taken by the weight during adaptation is called the weight track. The weight moves in the opposite direction of the gradient at each point, so the weight track depicts the gradient direction at each point of the performance surface visited during adaptation. Therefore the gradient direction tells us about the performance surface shape. In particular, it is useful to construct the contour plot of J since the gradient has to be perpendicular to the lines that link points with the same J value. The contour plot provides a graphical construction for the gradient at each point of the performance surface.
Given a point in a contour, we take the tangent of the contour at the point. The gradient is perpendicular to the tangent, so the weights will move along the gradient line and point in the opposite direction. Likewise, if we run the adaptation algorithm with several initial conditions and we record the value of J at each point, we can determine the contour plots by taking ellipses that pass through the points of equal cost and are perpendicular to the weight tracks.
When the eigenvalues of R are the same (see Appendix A section 3.17), the contour plots are circular and the gradient always points to the center, that is to the minimum. In this case the gradient descent has only a single time constant as in the one-dimensional case. But this is an exceptional condition. In general, the eigenvalues of R will be different. When the eigenvalues are different, the weight track bends because it follows the direction of the gradient at each point, which is perpendicular to the contours (Figure 1-17). The gradient direction no longer points to the minimum, which means that the weight tracks will not be straight lines to the minimum. The adaptation will take longer for two reasons. First, a longer path to the minimum will be taken. Second, the step size must be decreased compared with the circular case. Let us address the step size aspect further.
Figure 1-17 Weight track toward the minimum: (a) equal eigenvalues; (b) unequal eigenvalues
For guaranteed convergence the learning rate in each ith principal direction of the performance surface must be
(1.35)
where
i is the corresponding eigenvalue. The
worst case condition to guarantee convergence to the optimal w*
in all directions is therefore
(1.36)
That is,
the step size
must be smaller than the inverse of the largest
eigenvalue of the autocorrelation matrix. Otherwise the iteration
will diverge in one (or more) directions. Since the adaptation is
coupled, divergence in one direction will cause the entire system
to diverge.
In the early stages of adaptation the convergence is primarily along the direction of the largest eigenvalue, since the weight update along this direction will be bigger. On the other hand, toward the end of adaptation the algorithm will mainly adapt only the weight associated with the smallest eigenvalue (which corresponds to the smallest time constant). The time constant of adaptation is therefore
(1.37)
An
implication of this analysis is that when the eigenvalue spread of R is
large, there will be very different time constants of adaptation
in each direction. This reasoning gives a clear picture of the
fundamental constraint of adapting the weights using gradient
descent with a single step size
: The speed of adaptation is controlled by the
smallest eigenvalue, while the largest step size is constrained
by the inverse of the largest eigenvalue. This means that if
the ratio between the largest and the smallest eigenvalues (the
eigenvalue spread) is large, the convergence will be
intrinsically slow. This problem cannot be avoided when only a
single step size is used in the steepest descent.
The learning curve will approach Jmin in a geometric progression as before. However, there will be many different time constants of adaptation, one for each direction. Initially the learning curve will decrease at the rate of the largest eigenvalue, but toward the end of adaptation the rate of decrease of J is controlled by the time constant of the smallest eigenvalue (estimation of eigenvalue spread).
NEUROSOLUTIONS EXAMPLE 1.17
Visualizing the Weight Tracks and Speed of Adaptation
According to our previous calculations, the largest step size for convergence is Eq. 1.36 3.3e-2. The critically damped mode along the largest eigenvector should be 1.6e-2. The time constant of adaptation for the largest step size is around 20 iterations (epochs for batch), that is, the convergence should take 80 epochs with this step size.
When we run the simulations, the algorithm converges first along the direction of the largest eigenvalue (largest eigenvector direction), and then along the direction of the smallest eigenvector. Since the eigenvalue spread is 40, the steps are much bigger along the largest eigenvector direction. If we look at the figure of Example 1.16, we can see that the weights converge perpendicular to the contour plots since this is the steepest descent path. As we will see, there are two distinct regions in the learning curve: In the beginning it is controlled by the geometric ratio along the largest eigenvector, while toward the end it is controlled by the geometric ratio of the smallest eigenvector.
After running this example and observing the weight tracks let us change the input data file to a data set with a smaller eigenvalue spread. Click the input file icon and bring up its Inspector by clicking the right mouse button. Remove the present input file, and add the file regression2a.asc from the \data\chapters\ch1 folder on the CD-ROM.
The modification was made only in the variable x2, all the rest is the same. Respond to the panel associate by clicking on the close button. In the customize panel, skip the desired signal, and click on close. You have just modified the input data to this example. This new file has a much smaller eigenvalue spread, so we can expect that the weight tracks are basically straight lines to the minimum. Compute the new eigenvalue spread, and adjust the learning rates so that the convergence is as fast as possible.
1.7.4 The LMS Algorithm for Multiple Weights
It is straightforward to extend the gradient estimation given by the LMS algorithm from one dimension to many dimensions. We just apply the instantaneous gradient estimate Eq.1.12 to each element of Eq.1.33. The LMS for multiple dimensions reads
(1.38)
An interesting feature is that the LMS adaptation rule still uses local computations, i.e. for the i'th weight we can write
(1.39)
Note that although the analysis of the gradient descent techniques became complex, the LMS algorithm itself is still very simple. This is one reason why LMS is so widely used. But since LMS is a steepest-descent algorithm, the analysis and discussions concerning the largest step size for convergence and coupling of modes also apply to the LMS algorithm.
NEUROSOLUTIONS EXAMPLE 1.18
Visualizing Weight Tracks with On-Line Learning
In this example we configure the Backprop Controller to on-line learning to implement the LMS algorithm. Notice that the weight tracks follow basically the same path as before, but now the path is much more irregular due to the sample-by-sample update of the weights. When the eigenvalue spread is very large (the performance surface is very steep in one direction and shallow in others), the problem is difficult for LMS to solve. Any small perturbation in the smallest eigenvector direction gets amplified by the large eigenvalue spread.
1.7.5 Multiple Regression with Bias
Up to now we have implemented and solved analytically the multiple regression problem without bias. The reason for this is only for simplicity. With two weights we can still easily solve the multiple regression case by hand; however, if the bias is added, we must do the computations with three parameters. The simulations are transparent to these difficulties, since we just substitute the Axon by a Bias Axon. Note that the largest step size between the two cases will differ since the input data was effectively changed if one interprets the bias as a weight connected to an extra constant input of one. Hence the autocorrelation function and its eigenvalue spread changed.
We should state that the use of a bias is called the full least square solution, and it is the recommended way to apply least squares. The reason can be understood easily: When a bias is utilized in the PE, the regression line is not restricted to pass through the origin of the space, and smaller errors are normally achieved. There are two equivalent ways to solve the full least squares problem for D input variables:
· The input and desired responses need to be modified such that they become zero mean variables (this is called the deviation or z scores). In this case a D weight regression will effectively solve the original problem. The bias b is computed indirectly by
where wi are the optimal weights and the bars represent mean values.
· Alternatively, the input matrix has to be extended with an extra column of 1s (the first column). This transforms R into a (D+1)x(D+1) matrix, which introduces a D+1 weight in the solution (the bias).
NEUROSOLUTIONS EXAMPLE 1.19
Linear Regression without Bias
We will now substitute a Bias Axon for the Axon in the previous breadboard. This will effectively provide the regression solution without constraining the regression plane to pass through the origin. We see that the weight tracks are very similar in the beginning but that the error continues to drop, and the weights advance towards the w1=0 line. This means that the optimal solution changed. We now have a better solution than before but with increased complexity of the performance surface (four dimensional instead of three) and an increased number of adjustable parameters in our system (two weights and a bias).
1.7.6 The LMS Algorithm in Practice
We can use some rules of thumb to choose the step size in the LMS algorithm. The step size should be normalized by the variance of the input data estimated by the trace of R.
(1.40)
where
0 = 0.1 to 0.01. This normalization by the
input variance was the original rule proposed by Widrow to adapt
the Adaline. We can expect the algorithm to converge (settling
time) in a number of iterations k given by four times the time
constant of adaptation:
(1.41)
The LMS
algorithm has a misadjustment that is basically one half the
trace of R times ![]()
(1.42)
When the eigenvalues are equal, the misadjustment can be approximated by
which
allows us to give the following rule of thumb: The
misadjustment equals the number of weights divided by the
settling time, or equivalently, selecting
so that it produces 10% misadjustment means a
training duration in iterations of 10 times the number of inputs.