| Download a free evaluation copy of NeuroSolutions to discover how to apply neural network technology to your artificial intelligence application. |
1.8 Newton's Method
If you are familiar with numerical analysis, you may be asking why we aren't using Newton's method for the search. Newton's method is known to find the roots of quadratic equations in one iteration. The minimum of the performance surface can be equated to the root of the gradient equation Eq.1.32, as is outlined by Eq.1.31. Hence, Newton's method can also be used in search. The adaptive weight equation using the Newton's method is Newton's Derivation
(1.43)
Compare
this with Eq.1.33 and note that
the gradient information is weighted by the inverse of the
correlation matrix of the input and that
is equal to 1. This means that Newton's method
corrects the direction of the search such that it always
points to the minimum, while the gradient descent points to
the maximum direction of change. These two directions may or may
not coincide (Figure 1-18).
Figure 1-18 Directions of the steepest descent and Newton's method
They coincide when the contour plots are circles, that is, when the largest and the smallest eigenvalue of the correlation matrix are the same. When the ratio of the largest to the smallest eigenvalue (the eigenvalue spread) increases, the slope of the performance surface in the two directions differs more and more. Thus, for large eigenvalue spreads the optimization path taken by gradient descent is normally much longer than the path taken by Newton's method. This implies that Newton's method will be faster than LMS when the input data correlation matrix has a large eigenvalue spread.
Another
advantage of Newton's method versus the steepest descent method
is in the time constant of adaptation. When the gradient is
multiplied by R
not only the direction of the gradient is being
changed but also the different eigenvalues in each direction
are being equalized. What this means is that Newton's method
is automatically correcting the time constant of adaptation for
each direction such that all the weights converge at the same
rate. Hence Newton's method has a single time constant of
adaptation, unlike the steepest descent method.
These
advantages of the Newton's method should not come as a surprise,
because Newton's method uses much more information about the
performance surface (the curvature). In fact, to implement
Newton's method we need to compute the inverse of the correlation
matrix, which takes significantly longer than the single
multiplication required by the LMS method and also requires
global information. Newton's method is also brittle, that is, if
the surface is not exactly quadratic, the method may diverge.
This is the reason Newton's method is normally modified to also
include a small step size
instead of using
=1 as in Eq.1.43.
(1.44)
Note
that x(k) is a vector and R
is a matrix, so the update for one weight influences
all the other inputs in the system. This is the reason that the
computations are no longer local to each weight. However, they
are not difficult if one assumes that the inverse of R is
known a priori. The algorithm or Eq. 1.44 is called the
LMS/Newton algorithm. The case where R
has to be estimated on-line is much more involved
and leads to the recursive
least squares (RLS) algorithm.
Alternatively, to improve convergence speed with the LMS, we can implement an orthogonalizing transformation of the input correlation function followed by an equalization of the eigenvalues which is called a whitening transformation. (see Appendix A section 3.18). Since Newton's method coincides with the steepest descent for performance surfaces that are symmetric, this preprocessing will make the LMS perform as Newton's method.
NEUROSOLUTIONS EXAMPLE 1.20
Newton's Method
In
this example, we implement Newton's method with a custom DLL. For
this example, we must compute R
and apply Eq. 1.44 to the
simulator. The autocorrelation function for this example is
so R
becomes (see the Appendix, section 3.11)
By
applying Newton's method to the learning algorithm, we have
essentially compensated for the eigenvalue spread. This means
that Newton's method behaves as the steepest descent for a
circular performance surface where the steepest descent direction
always points directly to the optimal value. Thus, although the
calculations are more complicated and more demanding (we need to
know compute R
), the convergence is much faster (in fact, the
algorithm can converge in one epoch). When we run the simulator,
notice that no matter where the algorithm starts, it always heads
directly toward the optimum.