contents.gifindex.gifprev1.gifnext1.gif

1.10 The Linear Regression Model

We started this chapter by pointing out the advantages of building models from experimental data. In the previous sections we developed a set of techniques that adapt the parameters of a linear system (the Adaline) to fit the relationship between the input (x) and the desired data (d) as well as possible. This is our first model and it explains the relationship f(x, d) as a hyperplane that minimizes the square distance of the residuals. We will have the opportunity to study other (nonlinear) models in later chapters.

It is instructive to stop and ask the question: How can we use the newly developed regression model? One interesting aspect of model building that we mentioned previously is the ability to predict the behavior of the experimental system. Basically, what this means is that once the Adaline is trained, we can forecast the value of d when x is available. We do this by computing the system output y and assume that the error NEURAL AND ADAPTIVE SYSTEMS00090000.gif is small (Eq.1.3). You can now understand why we want to minimize the square of the error, since if the square of the error is small, d is going to be close to y in the training data. Figure 1-19 shows a productive way of looking at the input-output pairs that we used to train the Adaline.

NEURAL AND ADAPTIVE SYSTEMS00000113.gif

Figure 1-19 A view of the desired response as the output of an unknown model


We assume that the experimental system produces the desired response d for each input x according to a rule that we do not know. The purpose of building the model is to approximate as well as possible this hidden relationship.

We expect also that even for x values that the system did not use for training, y is going to be close to the corresponding unknown value d. Our intuition tells us that if:

· the data used for training covered all the possible cases well,

· we had enough training data, and

· the correlation coefficient is close to 1,

then in fact y should be close to the unknown value d. However, this is an inductive principle, which has no guarantee of being true. The ability to extrapolate the good performance from the training set to the test set is called generalization. Generalization is a central issue in the adaptive systems approach since it is the only guarantee that the model will perform well with the future data that will be presented to the system while in operation.

Remember that in the test mode the system parameters must be kept constant, that is, the learning algorithm MUST be disabled. In the next section we familiarize ourselves with training and using the linear model.

10.1 Regression Project

Getting Real-World Data

We will end Chapter 1 by giving you a flavor of the power of linear regression to solve real-life problems. We will go to the World Wide Web and seek real data sets, import them into NeuroSolutions, and solve regression problems. We will use the breadboard from Example 1.7.

The first thing is to decide on the data with which we will work. There are many interesting Web sites to visit in the search for data. We suggest the following sites:

http://ferret.wrc.noaa.gov/fbin/climate_server

http://seamonkey.ed.asu.edu/~behrens/teach/WWW_data.html

These sites have plenty of data (some duplicated). We assume that you know how to connect to the Web and how to download data. You should get the data in ASCII and store it in column format with one of the variables (the independent variable) in the input file and the dependent variable in the output file. Alternatively, we have provided sample data on the CD-ROM under the Data directory. Read the readme file to choose the data sets that interest you.

NeuroSolutions Project

The fundamental question is to find out how well a linear relation explains the dependence between the input data and the desired data. We will exemplify the project with a one-dimensional set of input data, but the multidimensional case is similar.

The first thing to do is to modify the NeuroSolutions breadboard so that it will be able to work with the data you downloaded. The data should be stored in an ASCII file and formatted in columns. Right-click on the input file icon and select properties. The Inspector will appear on the screen. Remove the present file (click the remove button) and click on the add button. The Windows file Inspector will appear; open the file that contains the input data, that is, the input to your linear model.

In NeuroSolutions, the associate panel appears, which you can close (we assume that the input file has ASCII data in column format). The next panel that pops open is the customize panel. Here you select the columns that you want to use (for those columns that you do not want, select the column label and click the skip button), and then click the close button. The input file is now open and ready to be used by NeuroSolutions. You should repeat the procedure for the desired data file. Make sure that the number of samples in the input and desired data files are the same.

Another thing that we should do is to normalize the data. Sometimes the input and desired variables have very different ranges so one should always normalize both the desired and input data files between 0 (or -1) and 1. To do this, go to the stream page (click on the stream tab) of the Inspector to access the normalization panel. Click the normalize check box and set the normalization.

We always recommend that you visually check the data - either with a plotting program or the Scatter plot in NeuroSolutions to ensure that there aren't any outliers present in the data. When outliers exist, they may distort any possible linear relationship that may exist.

Once the data sets are open, we can effectively start the adaptation of the linear regressor. The first important consideration is the largest step size that can be used for convergence. When the data is normalized we can always guess an initial value of 0.1. By plotting the learning curve or the weight tracks (if the problem has few input channels), we can judge how appropriate this value might be. Alternatively, we can compute the eigenvalues and find the exact largest possible step size, but this is rarely done. The trial-and-error method is OK for small problems.

If the problem takes a long time to converge and increasing the step size creates instability, then the eigenvalue spread is large, and there is little we can do short of using Newton's method.

After the algorithm converges (the error stabilizes), we should use the correlation coefficient DLL to estimate the correlation coefficient. Note that it is always possible to pass an hyperplane through some data points, but the real issue is whether the hyperplane provides a good model. To answer this question one needs to estimate the correlation coefficient.

For the multiple-variable regression case the relative weight magnitude tells us about the relative importance of the each variable in the regression equation (for normalized inputs). It is therefore interesting to look at the values of the regression weights, including the bias. Remember that if the data is normalized, the displayed weight values must be "unnormalized" in order to compare them with the original data. You can find the values that NeuroSolutions used to normalize the data by going to the dataset page of the Inspector and opening the normalization file. NeuroSolutions multiplies the data set by the first value in the normalization file (range) and adds the second value in the normalization file (offset). To reverse this process, you must subtract the second value and then divide by the first value.

Remember that the parameters of the regression equation can be used to predict the system output when only the input is known. We can do this by testing the system with another data file for which we do not have a desired response. To do this in NeuroSolutions, you should go to the Controller Inspector (the yellow dial) and turn off the learning check box; this fixes the weights.

No problem is finalized without a critical assessment of the results obtained. You should start with a hypothesis about the data relationship and confirm your hypothesis with NeuroSolutions results. If there is a discrepancy between what you expect and the results, you must explain it. This is where the NeuroSolutions probes are very effective. You should verify that the data is being properly read, whether the input and output files are synchronized, whether the system is converging (weight tracks, learning curve), and so on. Computers are great tools, but they are very susceptible to the garbage in, garbage out syndrome, so it is the user's responsibility to check the input and the methodology of data analysis.

NEUROSOLUTIONS EXAMPLE 1.21

Linear Regression Project

We will illustrate the project with a regression between two time series; the sea temperature and atmospheric pressure downloaded from the NOAA site (climate data base). We will start with the breadboard from Example 1.7. We replace the input with the file containing the sea temperature and replace the desired response data with the file containing the pressure data. NeuroSolutions automatically sets the number of inputs from the file (verify this in the File Inspector), and the number of exemplars in an epoch (verify this in the Controller Inspector). We also have to decide how many iterations we need. In the Controller Inspector, enter 1,000 in the Epochs/Run field. This number may be too large, but when the weights or the MSE stabilize, we can always interrupt the simulation. Experiment with everything we have learned in this chapter.

NeuroSolutions Example

Go to next section