| Download a free evaluation copy of NeuroSolutions to discover how to apply neural network technology to your artificial intelligence application. |
1.4 Adaptive Linear Systems
1.4.1 Least Squares as a Search for the Parameters of a Linear System
The purpose of least squares is to find parameters (b, w) that minimize the difference between the system output yi and the desired response di. Regression is effectively computing the optimal parameters of an interpolating system (linear in this case) which predicts the value of d from the value of x.
Figure 1-6 Regression as a linear system design problem
Figure 1-6 shows graphically the operation of adapting the parameters of the linear system. The system output y is always a linear combination of the input x with the bias, so it has to lie on a straight line of equation y = wx+b. Changing b modifies the y intercept, while changing w modifies the slope. Therefore we conclude that the goal of linear regression is to adjust the position of the line to minimize the average square difference between the y values (on the line) and the cloud of points di (i.e. the criterion J).
The key
point is to recognize that the error contains information that
can be used to optimally place the line. Figure 1-6 shows this by
including a subsystem that accepts the error as input and
modifies the parameters of the system. Thus, the error
i is fed back to the system and
indirectly affects the output through a change in the parameters (b,
w). Effectively the system is made "aware" of its
performance through the error. With the incorporation of the
mechanism that automatically modifies the system parameters, a
very powerful linear system can be built that will constantly
seek optimal parameters. Such systems are called neural and adaptive systems and are the
focus of this book.
1.4.2 Neural and Adaptive Systems
Before pursuing the study of adaptive systems, it is important to reflect briefly on the implications of neural and adaptive systems in engineering design. System design usually begins with specifications. First, the problem domain is studied and modeled, specifications are established, and then a system is built to meet the specifications. The key point is that the system is built to meet the current specifications and will always use the designed set of parameters, even if the external conditions change.
Here we are proposing a very different system design approach based on adaptation, which has a biological flavor to it. In the beginning the system parameters may be way off, creating a large error. However, through the feedback from the error, the system can change its parameters to decrease the error as much as possible. The system's "experience" with the data designs the best set of parameters. An adaptive system is more complex because it has to not only accomplish the desired task, but also be equipped with a subsystem that adapts its parameters. But notice that even if the data changes in the future, this design methodology will modify the system parameters so that the best possible performance is obtained. Additionally, the same system can be used for multiple problems.
There are basically two ways to adapt the system parameters: supervised learning and unsupervised learning. The method described until now belongs to supervised learning because there is a desired response. Later on in the book we will find other methods that also adapt the system parameters, but using only an internal rule. Since there is no desired response, these methods are called unsupervised. We will concentrate here on supervised learning methods.
The ingredients of supervised adaptive system design are
A system (linear in this case) with adaptive parameters
The existence of a desired or target response d
An optimality criterion (the MSE in this case) to be minimized
A method (subsystem) to compute the optimal parameters.
The method of least squares finds the optimal parameters (b, w) analytically. Our goal is to find alternate ways of computing the same parameters using a search procedure.
1.4.3 Analysis of the Error in the Space of the Parameters: The Performance Surface
Let us analyze the mean square error (J) as we change the parameters of the system (w and b). Without loss of generality, we are going to assume that b =0 (or equivalently, that the mean of x and d have been removed), such that J becomes a function of the single variable w
(1.9)
If w is treated as the variable and all other parameters are held constant, we can immediately see that J is quadratic on w with the coefficient of w² (i.e. xi²) being always positive. In the space of the possible w values, J is a parabola facing upward (J is always positive since it is a sum of squares). The function J(w) is called the performance surface for the regression problem (Figure 1-7). The performance surface is an important tool that helps us visualize how the adaptation of the weights affects the mean square error.
Figure 1-7 The performance surface for the regression problem
NEUROSOLUTIONS EXAMPLE 1.4
Plotting the Performance Surface
The performance surface is just a plot of the error criterion (J) versus the value of the weights, so what we will do during the simulation is to vary the Synapse weight (which corresponds to the slope parameter of the linear regressor) between two appropriate values. We can imagine that the error will be minimum at an intermediate value of the weight and that it will increase for both lower values and higher values.
To modify the Synapse weight incrementally we attach a Linear Scheduler to the Synapse and place the Matrix Viewer on it so we can see how the weight is changing. To visualize the MSE we will bring another Scatter Plot to the L2 Criterion. This will allow us to plot the cost versus weight (performance surface).
Run the example and see how the slope parameter of the linear PE affects the mean square error of the linear regressor. As we are going to see the input and desired signals tremendously affect the shape of the performance surface. But how can we change the shape of the performance curve without touching the data files? Let us substitute the L2 Criterion by the Lp Criterion. Go to Palettes and open the Error Criteria menu. Click on the Lp Criterion and bring the pointer to the Breadboard. Notice that the pointer changed to a stamper. If you left-click on the L2 Criterion component, the L2 is substituted by the Lp component, which computes a cost given by the p norm
By default the norm is p =5. Run the simulation again. What do you see? Does the location of the minimum change appreciably? What about the shape of the performance surface? Do you now better understand the function of the cost criterion?
Using the performance surface, we can develop a geometric method for finding the value of w, here denoted by w*, which minimizes the performance criterion. Previously (Eq. 1.5), we computed w* by setting to zero the derivative of J with respect to w.
The gradient of the performance surface is a vector (with the dimension of w) that always points toward the direction of maximum J change and with a magnitude equal to the slope of the tangent of the performance surface (Figure 1-8). If you visualize the performance surface as a hillside, each point on the hill will have a gradient arrow that points in the direction of steepest ascent at that point, with larger magnitudes for steeper slopes. A ball rolling down the hill will always attempt to roll in the direction opposite to the gradient arrow (the steepest descent). The slope at the bottom is zero, so the gradient is also zero (that is the reason the ball stops there).
In our
special case the gradient has just one component along the weight
axis w,
J =
wJ given by
(1.10)
(see the
gradient
definition and construction). A graphical way to construct
wJ at a point w0 is to first find
the level curve (curve of constant J value) that passes
through the point (also called the contour plot). Then take
the tangent to the level curve at w0. The gradient
component
wJ is always perpendicular to the
contour curve at w0, with a magnitude given by the partial
derivative of J with respect to the weight w (Eq.
1.10). For one weight (one-dimensional problem), as in Figure 1-8
the construction is simplified and we have to only find the
direction of the gradient on the axis.
Figure 1-8 Performance surface and its gradient
At the bottom of the bowl, the gradient is zero, because the parabola has slope zero at the vertex. Thus, for a parabolic performance surface, computing the gradient and equating it to zero finds the value of the coefficients that minimize the cost, just as we did in Eq.1.6. The important observation is that the analytical solution found by the least squares coincides with the minimum of the performance surface. Substituting the value of w* into Eq.1.9, the minimum value of the error (Jmin) can be computed (properties of the performance surface).
NEUROSOLUTIONS EXAMPLE 1.5
Comparison of Performance Curves for Different Data Sets
In this example we provide two sets of input files and two sets of output files. By changing the input data, we find that the minimum error, its location in the weight space (a simple line for this 1D example), and the shape of the performance surface change. On the other hand, if we change the desired signal, only the minimum value of the performance and its location change, but the overall shape remains the same.
1.4.4 Search of the Performance Surface with Steepest Descent
Since the performance surface is a paraboloid, which has a single minimum, an alternative procedure to find the best value of the coefficient w is to search the performance surface instead of computing the best coefficient analytically by Eq.1.6. The search for the minimum of a function can be done efficiently using a broad class of methods based on gradient information. The gradient has two main advantages for the search:
The gradient can be computed locally.
The gradient always points in the direction of maximum change.
If the goal is to reach the minimum, the search must be in the direction opposite to the gradient. Thus, the overall method of searching can be stated in the following way.
Start the search with an arbitrary initial weight w(0), where the iteration number is denoted by the index in parentheses. Then compute the gradient of the performance surface at w(0), and modify the initial weight proportionally to the negative of the gradient at w(0). This changes the operating point to w(1). Then compute the gradient at the new position w(1), and apply the same procedure again, that is,
(1.11)
where
is a small constant and
J(k) denotes the gradient of the
performance surface at the kth iteration. The constant
is used to maintain stability in the search by
ensuring that the operating point does not move too far along the
performance surface. This search procedure is called the steepest descent method. Figure
1-9 illustrates the search procedure.
Figure 1-9 The search using the gradient information
If we
trace the path of the weights from iteration to iteration,
intuitively we see that if the constant
is small, eventually the best value for the
coefficient w* will be found. Whenever w >w*, we
decrease w, and whenever w <w*, we increase w.