Download a free evaluation copy of NeuroSolutions to discover how to apply neural network technology to your artificial intelligence application. |
1.6 A Methodology for Stable Adaptation
During adaptation the learning algorithm automatically changes the system parameters using Eq 1.13. This adaptation algorithm has one free parameter (the step size) that must be user selected. To appropriately set the step size, the user should have a good understanding of what is happening inside the system. In this section we quantify the adaptation process and develop visualization tools that will help you understand how well the system is learning.
1.6.1 Learning Curve
As is readily apparent from Figure 1-9, when the weights approach the optimum value, the values of J(k) (the MSE at iteration k) will also decrease, approaching its minimum value Jmin. One of the best ways to monitor the convergence of the adaptation process is to plot the error at each iteration. The plot of the MSE across iterations is called the learning curve (Figure 1-10). The learning curve is as important for adaptive systems as the thermometer is to check your health. It is an external, scalar, easy-to-compute indication of how well the system is learning. But similar to body temperature, it is unspecific; that is when the system is not learning, it does not tell us why.
Figure 1-10 The learning curve
Notice that the error approaches the minimum in a one-sided manner (i.e., always larger than Jmin). As we can expect, the rate of decrease of the error depends on the value of the step size . Larger step sizes will take fewer iterations to reach the neighborhood of the minimum, provided that the adaptation converges. However, too large a step size creates a divergent iterative process, and the optimal solution is not obtained. It is interesting to note that we would like as large a step size as possible, because this decreases the convergence time. However, if the step size is increased too much, divergence will result. We therefore must seek a way to find the largest possible step size that guarantees convergence.
NEUROSOLUTIONS EXAMPLE 1.10
The Learning Curve
The goal of this example is to display the learning curve and to show how the learning rate affects its shape. This example plots the mean squared error during adaptation, which we called the learning curve, the thermometer of learning.
When you run the simulation, watch how the error moves towards zero as the regression line moves towards the optimal location. You can also change the learning rates and watch the regression line moving more quickly or slowly toward the optimal location, thus causing the learning curve to be steeper or shallower. The visualization of the regression line contains more information about what the system is doing but is very difficult to compute and display in higher dimensions. The learning curve, however, is an external, scalar quantity that can be easily measured with minimal overhead.
1.6.2 Weight Tracks
An adaptive system modifies its weights in an effort to find the best solution. The plot of the value of a weight over time is called the weight track. Weight tracks are an important and direct measure of the adaptation process. The problem is that normally our system has many weights, and we don't know what their optimal values are. Nevertheless, the dynamics of learning can be inferred and monitored from the weight tracks.
In the gradient-descent adaptation, adjustments to the weights are governed by two quantities Eq. 1.11: the step size and the value of the gradient at the point. Even for a constant step size, the weight adjustments become smaller and smaller as the adaptation approaches w*, since the slope of the quadratic performance surface is decreasing near the bottom of the performance surface. Thus the weights approach their final values asymptotically (Figure 1-11).
Three cases are depicted in Figure 1-11. If the step size is small, the weight converges monotonically to w*, and the number of iterations to reach the bottom of the bowl may be large. If the step size is increased, the convergence will be faster but still monotonic. After reaches a value called critically damped, the weight will approach w* in an oscillatory fashion (2>1), that is, it will overshoot and undershoot the final solution. The number of iterations necessary to reach the neighborhood of w* will start increasing again. If the step size is too large (3>2), the iterative process will diverge, that is instead of getting closer to the minimum, the search will visit points of larger and larger MSE until there is a numeric overflow. We say that the learning diverged.
Figure 1-11 Weight tracks and plots of the weight values across iterations for three values of .
NEUROSOLUTIONS EXAMPLE 1.11
Weight Tracks
It is very instructive to observe the linear PE parameters during learning and how they change as a function of the step size. Let us install a Mega Scope over the Synapse to visualize the slope parameter of the regressor and over the Bias Axon to visualize the regressor bias. These are called weight tracks. Run the simulation and watch how changing the step sizes affects the way the system approaches its final weights.
The weight tracks are a finer display of how adaptation is progressing, but the problem is that in systems with many weights it becomes impractical to observe all the weight tracks. Why do we say that weight tracks give us a better handle on the adaptation parameters? Enter 0.02 for the step size and see the weight tracks converge monotonically to their minimum value. Now enter 0.035. The weight tracks are oscillating toward the final value which means that the system is already in the underdamped regime (but the learning curve is still monotonically decreasing toward the minimum at a faster rate). We can expect divergence if we increase the weights further. Try 0.038 and see it happen. Relate this behavior to Figure 1-11.
1.6.3 Largest Step Size for Convergence
As we have just discussed, we would like to choose the largest step size possible for fastest convergence without creating an unstable system. Since adjustment to the weights is a product of the step size and the local gradient of the performance surface, it is clear that the largest step size depends on the shape of the performance surface. We already saw (Eq. 1B.10) that the shape of the performance surface is controlled by the input data. Therefore we can conclude that the maximum step size will be dictated by the input data. But how?
If we rewrite and manipulate the equations that produce the weight values in terms of the first weight w(0), Derivation of the Largest Step Size we get
(1.16)
where
The term (1-) must be less than or equal to 1 to guarantee weight convergence (and less than 1 to guarantee convergence to zero, giving the solution w(k+1)=w*). This implies that
(1.17)
where is the geometric ratio of the iterative process. Hence, the value of the step size must always be smaller than 2/. The fastest convergence is obtained with the critically damped step size of 1/. The closer is to 1/ the faster is the convergence, but faster convergence also means that the iterative process is closer to instability. We can visualize this in Figure 1-11. When is increased, a monotonic (overdamped) convergence to w* is substituted by an alternating (underdamped) convergence that finally degenerates into divergence.
There is a slight practical problem that must be solved. During batch learning the weight updates are added together during an epoch to obtain the new weight. This effectively includes a factor of N in the LMS weight update formula Eq. 1.13. To apply the analysis of the largest step size Eq. 1.17 one has to use a normalized step size
(1.18)
With this modification, even if the number of samples in our experiment changes, the step sizes do not need to be modified. Note that for on-line learning (N =1) we get the LMS rule again. We will always use normalized step sizes but to make the notation simpler, we will drop the subscript n in the normalized step size. An added advantage of using normalized step sizes is that we can switch between on-line updates and batch updates without having to change the step size in the simulations.
The analysis of the largest step size Eq. 1.17 also applies in the mean to the LMS algorithm. However, since LMS uses an instantaneous (noisy) estimate of the gradient, even when obeys Eq 1.17, instability may occur. When the iterative process diverges, the algorithm "forgets" its location in the performance surface; that is the values of the weights change drastically. This means that all the iterations up to that point were wasted. Hence with LMS it is common to include a safety factor of 10 in the step size max) or to use batch training, which reduces the noise in the estimate of the gradient.
1.6.4 Time Constant of Adaptation
An alternative view of the adaptive process is to quantify the convergence of w(k) to w* in terms of an exponential decrease. We know that w(k) converges to w* as a geometric Eq.1.16. The envelope of the geometric progression of weight values can be approximated by an exponential decay exp(-t/), where is the time constant of weight adaptation. A single iteration or epoch can be considered a unit of time. One may want to know approximately how many iterations are needed until the weights converge. The time constant of weight adaptation can be written as
(1.19)
which clearly shows that fast adaptation (small time constant ) requires large step sizes. (see derivation of the time constant of adaptation). For all practical purposes the iterative process converges after four time constants.
The steps used to deriv1 the time constant of weight adaptation can be applied also to come up with a closed-form solution to the decrease of the cost across iterations, which is called the time constant of adaptation. Eq. 1.16 tells us how the weights converge to w*. If the equation for the weight recursion is substituted in the equation for the cost (Eq.1B.11) we get
which means that J also approximates Jmin in a geometric progression, with a ratio equal to ². Therefore the time constant of adaptation is
Since the geometric ratio is always positive, J approximates Jmin monotonically (i.e., an exponential decrease). The time constant of adaptation describes the learning time (in number of iterations) needed to adapt the system. Notice that these expressions assume that the adaptation follows the gradient. With the instantaneous estimate used in the LMS, J may oscillate during adaptation since the estimate is noisy. But even in the LMS, J will approach Jmin in a one-sided way (i.e. always greater than or equal to Jmin).
NEUROSOLUTIONS EXAMPLE 1.12
Linear Regression Without Bias
The previous example solved the linear regression problem with one weight and one bias. To compare the equations previously given (which are a function of a single parameter) with the simulations, we have to make a modification in the data set or in the simulation. Shortly we will see how to extend the analysis for multiple weights, but for the time being let us work with the simpler case.
We substitute the Bias Axon by an Axon, a component that simply adds its inputs, so the regression solution becomes y =wx which has to pass through the origin. With this new Breadboard we can compare the numerical results of the simulations directly with all the equations derived in this section, since there is only one free parameter. Batch updates are used throughout.
The optimal value of the slope parameter is computed by Eq. 1.6, which gives w =0.30009, with an average error of 0.23. This solution is different from the value obtained previously (w = 0.139511) for the bias regressor because the regression line is now constrained to pass through the origin. It turns out that this constrained solution is worse than before, as we can see by the error (0.23 versus 0.033). Observing in the scatter plot the output (red points) and the input samples (blue) shows clearly what we are describing.
Computing by Eq. 1.16 yields 54, so according to Eq. 1.17 the maximum step size is =3.6e-2. The critically damped solution is obtained with a step size of 1.8e-2, and adaptation with a step size below this value is overdamped. When we run the simulator in the overdamped case the weights approach the final value monotonically; for the critically damped case, they stabilize quite rapidly; while for the underdamped case they oscillate around the final value, and the convergence takes more iterations. Notice also that the linear regressor "vibrates" around the final position, since the slope parameter is overshooting and undershooting the optimal value.
According to Eq. 1.19 for the critically damped step size =1, so the solution should stabilize in four updates (epochs). This step size yields the fastest convergence. Go to the Controller Inspector and use the epoch button to verify the number of samples until convergence.
1.6.5 Rattling
Up to now our main focus was the speed of adaptation, that is, how fast the weights approximate w*, or equivalently, how fast J approximates Jmin. Unfortunately, this is only part of the story. For fast convergence we need large step sizes (). But when the search is close to the minimum w*, where the gradient is small but not zero, the iterative process continues to wander around a neighborhood of the minimum solution without ever stabilizing. This phenomenon is called rattling (Figure 1-12), and the rattling basin increases proportionally to the step size . This means that when the adaptive process is stopped by an external command (such as the number of iterations through the data), the weights may not be exactly at w*. We know they are in the neighborhood of the minimum but not exactly at the optimum.
Figure 1-12 Rattling of the iteration procedure
If we picture the performance surface (Figure 1-12), when the final weights are not at w*, there will be a penalty in performance, that is the final MSE will be higher than Jmin. In the theory of adaptation the difference between the final MSE and the Jmin (normalized by Jmin) is called the misadjustment M.
(1.20)
This means that in search procedures that use gradient descent there is an intrinsic compromise between accuracy of the final solution ( a small misadjustment) and speed of convergence. The parameter that controls this compromise is the step size . High means fast convergence but also large misadjustment, while small means slow convergence but little misadjustment.
NEUROSOLUTIONS EXAMPLE 1.13
Rattling
We observed in Example 1.7 how noisy the learning curve became with the on-line update. This is an external indication that the weights were changing from sample to sample even after the system reached the neighborhood of the optimum. The implication of this random movement in the weights is a penalty in the final MSE. In this example we show and exactly quantify the rattling.
The rattling has important consequences for adaptation, since if we set the step size large for fast convergence, we pay a price of inaccurate coefficients, which is translated into an excess MSE. The rule of thumb for LMS is to use a step size that is 1/10 of the largest possible step size. For a step size close to the largest possible, the MSE for the epoch is effectively smaller than the theoretical minimum, which is impossible. This happens because the parameters are changing so much with each update that the slope is continuously varying. The problem is that when we stop the training we do not know if the final value of the weight is a good approximation to the theoretical regression line.
This shows that for adaptive systems the final MSE is only part of the story. We have to make sure that the system coefficients have stabilized. It is interesting to note that with batch updates there is no rattling, so in the linear case the batch solution is more appropriate. Observe this in the simulations by displaying the MSE for large and small step sizes. We are just paying the small price of storing the individual weight updates. For nonlinear systems the batch is unfortunately no longer always superior to the on-line update as we will see.
This example shows that obtaining a small MSE is a necessary but not sufficient condition for stable adaptation. Adaptation also requires that the weights of the model settle to stable values. This second condition is required because the system can be endlessly changed its parameters to fit the present sample. This will always give a small MSE, but from a modeling point of view it is a useless solution because no single model is found to fit the data set.
1.6.6 Scheduling the Step Sizes
As we saw in the latest examples, for fast convergence to the neighborhood of the minimum a large step size is desired. However, the solution with a large step size suffers from rattling. One attractive solution is to use a large learning rate in the beginning of training to move quickly toward the location of the optimal weights, but then the learning rate should be decreased to obtain good accuracy on the final weight values. This is called learning rate scheduling. This simple idea can be implemented with a variable step size controlled by
(1.21)
where (0) =0 is the initial step size, and is a small constant. Note that the step size is being linearly decreased at each iteration. If we have control of the number of iterations we can start with a large step size and decrease it to practically zero toward the end of training. The value of needs to be experimentally determined. Alternatively, we can decrease the step size slowly (in optimization this slow decrease is called annealing) using either a linear, geometric, or logarithmic rule (see more on scheduling step sizes).
NEUROSOLUTIONS EXAMPLE 1.14
Scheduling of Step Sizes
In this demonstration we vary the step size using the scheduling component already introduced in Example 1.4. The scheduler is a component that takes an initial value from the component it is attached to, and changes the value according to a predetermined rule. Here we use the linear rule, and since we want to decrease the step size, the factor is negative. We should set a maximum and a minimum value just to make sure that the parameters are always within the range we want. should be set according to the number of iterations and the initial and final values (init -N=residual).
Here the important parameter is the minimum (the residual step size is set at 0.001), because after scheduling we want to let the system fine-tune its parameters to find the best approximation of the minimum. However, notice that this implies that the parameter is already in its optimal neighborhood, and this depends on a lot of unknown factors. Thus, if the scheduling is not right, the adaptation may stall in positions far from the minimum.
You should explore the breadboard by entering other values for and the final value and see their impact on the final weight value. You can also use the exponential or the logarithmic schedulers and see how they behave. Which one do you prefer for this case?