1.3 Least Squares
We face a problem when trying to fit a straight line to the noisy observations of Table 1-1. A single line will fit any two observations (two points define a line), but it is unlikely that all points will fall on exactly the same line. Since no single line will fit every point, a global property of the points is needed to find the best fit. The problem of fitting a line to noisy data can be formulated as follows: What is the best choice of w, b such that the fitted line passes the closest to all the points?
Least squares solves the problem of
fitting a line to data by finding the line for which the sum of
the square deviations (or residuals) in the d direction
(the noisy variable direction) are minimized. The fitted points
in the line will be denoted by
i = b + wxi. The residuals
are defined as
i = di -
i. The fitted points
i can also be interpreted as
approximated values of di estimated by a linear model when
the input xi is known:
(1.3)
Figure 1-5 Regression line showing the deviations
This
linear model will be called the linear regressor. Estimated
quantities will be denoted by the tilde (~) throughout the book.
The outputs of the linear system of Figure 1-4 are the fitted
points,
i = yi in Figure 1-5.
To pick the line that best fits the data, we need a criterion to
determine which linear estimator is the "best." The
average sum of square errors J (also called the mean square error
(MSE)) is a widely utilized performance
criterion given by
(1.4)
where N is the number of observations. To simplify the notation, we sometimes drop the top index in the sum.
NEUROSOLUTIONS EXAMPLE 1.2
Computing the MSE for the Linear PE
To create a simulation that displays the MSE, we have to add a new component to the Breadboard, the L2 Criterion. The L2 Criterion implements the mean square error Eq. 1.4. The L2 Criterion requires two inputs to compute the MSE the system output and the desired response. We will attach the L2 Criterion to the output of the linear PE (system output) and attach a File Input component to the L2 Criterion to load in the value of the desired response from Table 1. In order to visualize the MSE, we will place a Matrix Viewer probe over the L2 Criterion (cost access point). This Matrix Viewer simply displays the data from the component that it resides over in this case, the mean square error.
Run the demonstration and try to set the slope and bias to minimize the mean square error. Compute the error by hand according to Eq. 1.4 and see whether it matches the value displayed.
Our goal is to minimize J analytically, which according to Gauss can be done by taking its partial derivative with respect to the unknowns and equating the resulting equations to zero:
(1.5)
which yields, after some manipulation, least square derivation
(1.6)
where an
overbar represents the variable's mean value; for example,
.
This procedure to determine the coefficients of the line is called the least square method. If we apply these equations to the data of Table 1-1, we get the regression equation (best line through the data)
(1.7)
The least square computation for a large data set is time-consuming, even with a computer.
NEUROSOLUTIONS EXAMPLE 1.3
Finding the Minimum Error by Trial and Error
Enter these values for the slope and bias by typing them in the respective Edit boxes. Verify that with these values the error is the smallest. Change the values slightly (in either direction) and see that the MSE increases. Enter a negative slope and see how the error increases a lot. For the negative slope, what is the value of the bias that gives the smallest error? Note that when one of the coefficients is wrong, the value of the other for best performance is also wrong, that is, they are coupled.
It is important to explore the NeuroSolutions breadboards. The best way to accomplish this is to open the Inspector associated with each icon. Select a component with the mouse. Then press the right mouse button, and select properties. The Inspector will appear in the screen. The Inspector has fields that allow us to configure the NeuroSolutions components and tell us what settings are being used. For instance, go to the input Axon and open the Inspector. You will see that it has one input, one output and no weights (go to the Soma level to look at the weights). If you do the same in the Synapse, you will see that it also has a single input and output and one weight which happens to be our slope parameter. The Bias Axon has a single input, a single output and a single weight, which is the system's bias.
The large barrel on the input Axon is a probe that collects data. Since the barrel is placed on the Activity point, it is storing the 12 data samples that are injected into the network. This is exactly what gets displayed in the x axis of the Scatter Plot. The y axis values are sent from the L2 Criterion by the small barrel (a Data Transmitter). So the Scatter Plot is effectively displaying the pairs of points (xi,di). Likewise it is also displaying the output of the system in blue, i.e. the pairs of points (xi,yi).
If you want to know what the component is and what it does, just go to the NeuroSolutions control bar, select the arrow icon with the question mark, and click on the component that you want to know about (this is called context-sensitive help).
1.3.1 Correlation Coefficient
We have found a way to compute the regression equation, but we still do not have a measure of how successfully the regression line represents the relationship between x and d. The size of the mean square error (MSE) can be used to determine which line best fits the data, but it doesn't necessarily reflect whether a line fits the data tightly because the MSE depends on the magnitude of the data samples. For instance, by simply scaling the data, we can change the MSE without changing how well the data is fit by the regression line. The correlation coefficient (r) solves this problem. By definition, the correlation coefficient between two random variables x and d is
(1.8)
The numerator is the covariance of the two variables (see the Appendix A Section 4.11), and the denominator is the product of the corresponding standard deviation (variance).
The correlation coefficient is confined to the range [-1,1]. When r =1 there is a perfect positive linear correlation between x and d, that is, they covary, which means that they vary by the same amount. When r=-1, there is a perfectly linear negative correlation between x and d, that is, they vary in opposite ways (when x increases, y decreases by the same amount). When r =0 there is no correlation between x and d, i.e. the variables are called uncorrelated. Intermediate values describe partial correlations. In our example r =0.88, which means that the fit of the linear model to the data is reasonably good. Notice that the correlation coefficient is a property of the data, as we can see from Eq. 1.8 (it is independent of the model). However, the value rē also represents the amount of variance in the data captured by the optimal linear regression (correlation coefficient and least squares).
The
method of least squares is very powerful. Estimation theory says that the
least square estimator is the best linear unbiased estimator
(BLUE), since it has no bias and has minimal variance among all
possible linear estimators. Least squares can be generalized to
higher-order polynomial curves, such as quadratics, cubics, and
so on (generalized least squares). In this case, nonlinear
regression models are obtained. More coefficients need to be
computed, but the methodology still applies. Regression can also
be extended to multiple variables as we will do later in the
chapter (1.7 Regression
for Multiple Variables). The dependent variable d in
multiple variable regression is a function of a vector x =
[x1,
,xD]
, where T means the transpose and D is the number
of inputs. In this book, vectors are denoted by bold letters. In
the multivariate case, the regression line becomes a hyperplane in the space x1,x2,...xD.