contents.gifindex.gifprev1.gifnext1.gif

1.3 Least Squares

We face a problem when trying to fit a straight line to the noisy observations of Table 1-1. A single line will fit any two observations (two points define a line), but it is unlikely that all points will fall on exactly the same line. Since no single line will fit every point, a global property of the points is needed to find the best fit. The problem of fitting a line to noisy data can be formulated as follows: What is the best choice of w, b such that the fitted line passes the closest to all the points?

Least squares solves the problem of fitting a line to data by finding the line for which the sum of the square deviations (or residuals) in the d direction (the noisy variable direction) are minimized. The fitted points in the line will be denoted by NEURAL AND ADAPTIVE SYSTEMS00000021.gifi = b + wxi. The residuals are defined as NEURAL AND ADAPTIVE SYSTEMS00090000.gifi = di - NEURAL AND ADAPTIVE SYSTEMS00000021.gifi. The fitted points NEURAL AND ADAPTIVE SYSTEMS00000021.gifi can also be interpreted as approximated values of di estimated by a linear model when the input xi is known:

NEURAL AND ADAPTIVE SYSTEMS00000022.gif (1.3)

NEURAL AND ADAPTIVE SYSTEMS00000023.gif

Figure 1-5 Regression line showing the deviations


This linear model will be called the linear regressor. Estimated quantities will be denoted by the tilde (~) throughout the book. The outputs of the linear system of Figure 1-4 are the fitted points, NEURAL AND ADAPTIVE SYSTEMS00000021.gifi = yi in Figure 1-5. To pick the line that best fits the data, we need a criterion to determine which linear estimator is the "best." The average sum of square errors J (also called the mean square error (MSE)) is a widely utilized performance criterion given by

NEURAL AND ADAPTIVE SYSTEMS00000024.gif (1.4)

where N is the number of observations. To simplify the notation, we sometimes drop the top index in the sum.

NEUROSOLUTIONS EXAMPLE 1.2

Computing the MSE for the Linear PE

To create a simulation that displays the MSE, we have to add a new component to the Breadboard, the L2 Criterion. The L2 Criterion implements the mean square error Eq. 1.4. The L2 Criterion requires two inputs to compute the MSE the system output and the desired response. We will attach the L2 Criterion to the output of the linear PE (system output) and attach a File Input component to the L2 Criterion to load in the value of the desired response from Table 1. In order to visualize the MSE, we will place a Matrix Viewer probe over the L2 Criterion (cost access point). This Matrix Viewer simply displays the data from the component that it resides over in this case, the mean square error.

NEURAL AND ADAPTIVE SYSTEMS00000025.gif NEURAL AND ADAPTIVE SYSTEMS00000026.gif

Run the demonstration and try to set the slope and bias to minimize the mean square error. Compute the error by hand according to Eq. 1.4 and see whether it matches the value displayed.

NeuroSolutions Example

Our goal is to minimize J analytically, which according to Gauss can be done by taking its partial derivative with respect to the unknowns and equating the resulting equations to zero:

NEURAL AND ADAPTIVE SYSTEMS00000027.gif (1.5)

which yields, after some manipulation, least square derivation

NEURAL AND ADAPTIVE SYSTEMS00000028.gif NEURAL AND ADAPTIVE SYSTEMS00000029.gif (1.6)

where an overbar represents the variable's mean value; for example, NEURAL AND ADAPTIVE SYSTEMS00000030.gif.

This procedure to determine the coefficients of the line is called the least square method. If we apply these equations to the data of Table 1-1, we get the regression equation (best line through the data)

NEURAL AND ADAPTIVE SYSTEMS00000031.gif (1.7)

The least square computation for a large data set is time-consuming, even with a computer.

NEUROSOLUTIONS EXAMPLE 1.3

Finding the Minimum Error by Trial and Error

Enter these values for the slope and bias by typing them in the respective Edit boxes. Verify that with these values the error is the smallest. Change the values slightly (in either direction) and see that the MSE increases. Enter a negative slope and see how the error increases a lot. For the negative slope, what is the value of the bias that gives the smallest error? Note that when one of the coefficients is wrong, the value of the other for best performance is also wrong, that is, they are coupled.

It is important to explore the NeuroSolutions breadboards. The best way to accomplish this is to open the Inspector associated with each icon. Select a component with the mouse. Then press the right mouse button, and select properties. The Inspector will appear in the screen. The Inspector has fields that allow us to configure the NeuroSolutions components and tell us what settings are being used. For instance, go to the input Axon and open the Inspector. You will see that it has one input, one output and no weights (go to the Soma level to look at the weights). If you do the same in the Synapse, you will see that it also has a single input and output and one weight which happens to be our slope parameter. The Bias Axon has a single input, a single output and a single weight, which is the system's bias.

The large barrel on the input Axon is a probe that collects data. Since the barrel is placed on the Activity point, it is storing the 12 data samples that are injected into the network. This is exactly what gets displayed in the x axis of the Scatter Plot. The y axis values are sent from the L2 Criterion by the small barrel (a Data Transmitter). So the Scatter Plot is effectively displaying the pairs of points (xi,di). Likewise it is also displaying the output of the system in blue, i.e. the pairs of points (xi,yi).

If you want to know what the component is and what it does, just go to the NeuroSolutions control bar, select the arrow icon with the question mark, and click on the component that you want to know about (this is called context-sensitive help).

NeuroSolutions Example

1.3.1 Correlation Coefficient

We have found a way to compute the regression equation, but we still do not have a measure of how successfully the regression line represents the relationship between x and d. The size of the mean square error (MSE) can be used to determine which line best fits the data, but it doesn't necessarily reflect whether a line fits the data tightly because the MSE depends on the magnitude of the data samples. For instance, by simply scaling the data, we can change the MSE without changing how well the data is fit by the regression line. The correlation coefficient (r) solves this problem. By definition, the correlation coefficient between two random variables x and d is

NEURAL AND ADAPTIVE SYSTEMS00000032.gif (1.8)

The numerator is the covariance of the two variables (see the Appendix A Section 4.11), and the denominator is the product of the corresponding standard deviation (variance).

The correlation coefficient is confined to the range [-1,1]. When r =1 there is a perfect positive linear correlation between x and d, that is, they covary, which means that they vary by the same amount. When r=-1, there is a perfectly linear negative correlation between x and d, that is, they vary in opposite ways (when x increases, y decreases by the same amount). When r =0 there is no correlation between x and d, i.e. the variables are called uncorrelated. Intermediate values describe partial correlations. In our example r =0.88, which means that the fit of the linear model to the data is reasonably good. Notice that the correlation coefficient is a property of the data, as we can see from Eq. 1.8 (it is independent of the model). However, the value also represents the amount of variance in the data captured by the optimal linear regression (correlation coefficient and least squares).

The method of least squares is very powerful. Estimation theory says that the least square estimator is the best linear unbiased estimator (BLUE), since it has no bias and has minimal variance among all possible linear estimators. Least squares can be generalized to higher-order polynomial curves, such as quadratics, cubics, and so on (generalized least squares). In this case, nonlinear regression models are obtained. More coefficients need to be computed, but the methodology still applies. Regression can also be extended to multiple variables as we will do later in the chapter (1.7 Regression for Multiple Variables). The dependent variable d in multiple variable regression is a function of a vector x = [x1,…,xD]NEURAL AND ADAPTIVE SYSTEMS00000033.gif, where T means the transpose and D is the number of inputs. In this book, vectors are denoted by bold letters. In the multivariate case, the regression line becomes a hyperplane in the space x1,x2,...xD.

Go to next section