| Download a free evaluation copy of NeuroSolutions to discover how to apply neural network technology to your artificial intelligence application. |
1.5 Estimation of the Gradient The LMS Algorithm
An adaptive system can use the gradient to optimize its parameters. The gradient, however, is usually not known explicitly and thus must be estimated. Traditionally, the difference operator is used to estimate the derivative as outlined in Figure 1-8. A good estimate, however, requires many small perturbations to the operating point to obtain a robust estimation through averaging. The method is straightforward but not very practical.
In the late 1960s Widrow proposed an extremely elegant algorithm to estimate the gradient that revolutionized the application of gradient descent procedures. His idea is very simple: Use the instantaneous value as the estimator for the true quantity. For our problem this means to drop the summation in Eq.1.9 and define the gradient estimate at step k as its instantaneous value. Substituting Eq. 1.4 into Eq.1.10, removing the summation, and then taking the derivative with respect to w yields
(1.12)
What Eq. 1.12 tells us is that an instantaneous estimate of the gradient at iteration k is simply the product of the current input to the weight times the current error. The amazing thing is that the gradient can be estimated with one multiplication per weight. This is the gradient estimate that led to the famous least means square (LMS) algorithm (or LMS rule). The estimate will be noisy, however, since the algorithm uses the error from a single sample instead of summing the error for each point in the data set (e.g., the MSE is estimated by the error for the current sample). But remember that the adaptation process does not find the minimum in one step. Normally, many iterations are required to find the minimum of the performance surface, and during this process the noise in the gradient is being averaged (or filtered) out.
If the estimator of Eq.1.12 is substituted in Eq.1.11, the steepest descent equation becomes
(1.13)
This
equation is the LMS algorithm. With the LMS rule one does
not need to worry about perturbation and averaging to properly
estimate the gradient at each iteration, it is the iterative
process that is improving the gradient estimator. The small
constant
is called the step size, or the
learning rate.
NEUROSOLUTIONS EXAMPLE 1.6
Adapting the Linear PE with LMS
Several
things have to be added to the previous breadboard of the linear
PE to make it learn automatically using the LMS algorithm. The
methodology will be explained in more detail later. However, the
technique used in NeuroSolutions is called back-propagation. In
short, the algorithm passes the input data forward through the
network and the error information (desired minus output) backward
through another network. The error is propagated through a second
layer, which can be obtained from the original network with minor
and well- established modifications (more about this later).
Thus, at every component there is a local activity (x) and a
local error (
) such that the weights of the network can be
modified by Eq. 1.13.
NeuroSolutions implements this technique by adding two additional
layers to the network: the back-propagation layer and the
gradient-search layer. These two layers can be automatically
added to the breadboard. The backpropagation layer looks like a
small version of the network that sits on top of the original
network (in red instead of orange). The gradient- search layer
sits on top of the back-propagation layer and uses one of the
gradient search methods to adjust the weights. In this case, the
gradient search layer is a simple "step" layer, which
implements the gradient descent rule Eq 1.13. Notice that
only the components that have adjustable weights (the Synapse (w)
and Bias Axon (b)) have gradient-search components.
In addition to the two layers, we need an additional controller to manage the back-propagation layer. The Backprop Controller sits above the yellow Controller and sets parameters such as whether we use batch or on-line learning. In this example we use batch learning, that is the learning rule will compute all the weight updates for the training set, add them up, and at the end of the epoch (one presentation of all the training data) update the weights according to Eq 1.13. The value of the step size will be set at 0.01, and the training will use 200 iterations.
When you run the network, watch the regression line move toward the optimal value in the Scatter Plot. When the simulation ends, notice that the weight is approximately 0.139, the bias is approximately 1.33, and the error is approximately 0.033 all in excellent agreement with the optimal values we computed analytically.
You should explore this breadboard by entering several values of the step size and opening the Inspector to see how each component is configured.
1.5.1 Batch and Sample-by-Sample Learning
The LMS algorithm was presented in a form in which the weight updates are computed for each input sample, and the weights modified after each sample. This procedure is called sample-by-sample learning, or on-line training. As we have mentioned, the estimate of the gradient is going to be noisy, that is the direction towards the minimum is going to zigzag around the gradient direction.
An alternative solution is to compute the weight update for each input sample and store these values (without changing the weights) during one pass through the training set which is called an epoch. At the end of the epoch, all the weight updates are added together, and only then will the weights be updated with the composite value. This method adapts the weights with a cumulative weight update, so it will follow the gradient more closely. It is called the batch training mode, or batch learning. Batch learning is also an implementation of the steepest-descent procedure. In fact, it provides an estimator for the gradient that is smoother than the LMS. We will see that the agreement between the analytical quantities that describe adaptation and the ones obtained experimentally is excellent with the batch update. See batch versus on-line learning for more detail.
To visualize the differences between these two update methods, we will plot the value of the cost during adaptation (called the learning curve).
NEUROSOLUTIONS EXAMPLE 1.7
Batch versus On-Line Adaptation
It is important to visualize the differences in adaptation for on-line and batch learning. Up to now we have been using the batch mode. In this example we set the Backprop Controller to use on-line training. To display the learning curve, we have to introduce one new component: the Megascope. The Megascope is a probe that acts just like an oscilloscope: it plots a continuous stream of inputs, using the iteration number as the x axis. The Megascope sits on top of the Data Storage component.
The important controls of the Megascope are the scales of the x and y axes. In the scope level of the Inspector, we can select the vertical scale (y axis) and the offset of each channel. Alternatively we can use the auto scale feature. The horizontal scale is selected in the sweep level of the Inspector (number of samples per division). Remember that the number of samples displayed is defined by the Data Storage component.
To create the learning curve, we simply place a Data Storage component over the L2 Criterion at one of the cost access points and then place a Megascope on top of it. In batch learning, the Average Cost is selected, while in on-line learning, the Cost access point is utilized.
We will see that the learning curve is not smooth anymore because we are updating the weights after each example. Since the individual errors vary from sample to sample, our updates will make the learning curve noisy. The learning curve will have a periodic component superimposed on a decaying exponential. The exponential tells us that we are approaching a better overall solution. The periodic features show the error obtained for each input sample, while the envelope is related to the learning curve for the batch mode. Note that the weights never stabilize: the performance curve should otherwise be smooth and converge to a single final value. Since there is more noise in on-line learning, we must decrease the step size to get smoother adaptation. We recommend using a step size 10 times smaller than the step size for batch learning. But the price paid is a longer adaptation time; the system needs more iterations to get to a predefined final error. Experiment with the learning rates to observe this behavior.
1.5.2 Robustness and System Testing
One of the interesting aspects of the LMS solution is its robustness. From the explanation given (Figure 1-9), no matter what the initial condition for the weights, the solution always converges to basically the same value. We can even add some noise to the desired response and find out that the linear regressor parameters are basically unchanged. This robustness is rather important for real- world problems, where noise is omnipresent.
The group of input samples and desired responses (shown in Table 1-1) used to train the system are called collectively the training set for obvious reasons. It is with their information that the system parameters were adapted. But once the optimal parameters are found, the parameters should be fixed. When the system is utilized for new inputs never encountered before, it will produce for each input a response based on the parameters obtained during training. If the new data comes from the same experiment, the response should resemble the value of the desired response for that particular input value.
Thus we see that the system has the ability to extrapolate responses for new data. This is an important feature since in general we wish the performance obtained in the training set to also apply (generalize) to the new data when the system is deployed. But due to the methodology utilized to derive the parameter values we can never be exactly sure of how well the system will respond to new data.
For this reason it is a good methodology to use a test set to verify the system performance before deploying it in the real-world application. The test set consists of new data not used for training but for which we still know the desired response. It is like the final rehearsal before a play's inauguration. We should also compute the correlation coefficient in the test set. Normally, we will find a slight decrease in performance from the training set. If the performance in the test set is not acceptable, we have to go back to the drawing board. When this happens in regression, the most common problem is that the training data is inadequate either in quantity or in exhaustive coverage of the experimental conditions. This point will be addressed in more depth in the following chapters.
NEUROSOLUTIONS EXAMPLE 1.8
Robustness of LMS to Noise
The LMS algorithm is very robust. It works from any arbitrary location and even works well with noise added to the desired data. To demonstrate that the system works well even with noisy data, we add one additional component to the Breadboard from the previous example: the Noise component. The Noise component allows uniform, Gaussian, or "user-defined" noise to be added to the input or desired signals. We will add the Noise component to the desired signal and watch as the system moves close to the optimal location even with the noisy data.
1.5.3 Computing the Correlation Coefficient in Adaptive Systems
The correlation coefficient, r, tells how much of the variance of d is captured by a linear regression on the independent variable x. As such, r is a very powerful quantifier of the modeling result. It has a great advantage with respect to the MSE because it is automatically normalized, while the MSE is not. However, the correlation coefficient is blind to differences in means because it is a ratio of variances (see Eq.1.8), that is, as long as the desired data and input covary r will be small, in spite of the fact that they may be far apart in actual value. Thus we need both quantities (r and MSE) when testing the results of regression.
Although the correlation coefficient can be computed directly from x and d (Eq. 1.8), we would like to estimate r at the output of the linear system to follow the adaptive systems' methodology. From Eq. 1B.4, we can write
(1.14)
Note,
however, that y changes during adaptation so we should
wait until the system adapts to read the final correlation
coefficient (i.e.,
). During adaptation the numerator of Eq.
1.14 can be larger than the denominator, giving a value for r
larger than 1, which is meaningless. We therefore propose to
compute a new parameter
that is a reasonable proxy for the
correlation coefficient, even during adaptation. We subtract a
term from the numerator of Eq. 1.14 that becomes zero at the
optimal setting but limits
such that its value is always between -1
and 1 even during adaptation. We can write computation of
correlation coefficient
(1.15)
Note that all these quantities can be computed on-line with the information of the error, the output, and the desired response. Remember, however, that Eq. 1.14 measures the correlation coefficient only when the Adaline has been totally adapted to the data.
NEUROSOLUTIONS EXAMPLE 1.9
Estimating the Correlation Coefficient During Learning
NeuroSolutions does not include a component to compute the correlation coefficient. It does, however, allow you to write your own components. These custom components are called DLLs. A custom component looks just like the component it takes the place of, except that its icon has "DLL" printed on it. In this example, we include a custom component to compute the correlation coefficient. This component looks exactly like an L2 Criterion component, except it has "DLL" printed on it.
Plug in the values of the optimal weights and verify that the formula Eq. 1.14 gives the correct correlation coefficient. Slightly modify w to 0.120 and verify that the correlation coefficient decreases. If you plug in values for w and b that are very far away from the fitted regression, this estimation of r using Eq 1.14 becomes less accurate, but is still bound by -1 and 1. The example also uses LMS to adapt the coefficients. Observe that the correlation coefficient is always between -1 and 1 during adaptation and that the final value corresponds to the computed one.