Section5.2The Least Squares Problem
Download the Excel data set from Moodle. We're going to develop a way to match the data sets using linear algebra. Before doing the linear algebra versions of this problem though we'll use Excel to match the data sets. Our goal is to make a guess at the type of polynomial that models the function (given in this case) and to minimize the error between our guess and the data. Follow these steps in Excel to find best fitting curves. We'll start with the linear data.
In column C set up a function for your guess of the model based on the \(x\) data in column \(A\). You will need to set up cells for the polynomial parameters (for the linear case these are the slope and the \(y\)-intercept).
Fill in your guesses based on initial approximations for the slope and \(y\)-intercept.
Use column D to calculate the residual value for each data point. Residual = Actual \(y\) value - Approximate \(y\) value
Use column E to calculate the square of the residual for each data point.
Our goal is the minimize the sum of the squares of the residuals. \begin{equation*} \text{ min } \left( \sum_{j=1}^n \left( y_j - \hat{y}_j \right)^2 \right) \end{equation*} For this we can use the Excel Solver to minimize the sum of the square residuals.
Now repeat the process for the quadratic and cubic data.
Now we'll use Linear Algebra to complete the same types of problems. Set up a MATLAB script that reads the linear data from the Excel sheet. We are assuming that the data are linear so for each \(x\)-value \(x_j\) we assume that the equation \begin{equation*} c_0 + c_1 x_j \approx y_j. \end{equation*}
This gives rise to a system of linear equations \begin{equation*} \begin{pmatrix}1 \amp x_1 \\ 1 \amp x_2 \\ 1 \amp x_3 \\ \vdots \amp \vdots \\ 1 \amp x_n \end{pmatrix} \begin{pmatrix}c_0 \\ c_1 \end{pmatrix} \approx \begin{pmatrix}y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{pmatrix} \end{equation*}
We'll call the left-hand matrix \(A\) and the right-hand vector \(\by\).
Get \(A\) and \(\by\) into MATLAB so we can use them.
This is not a square system. In fact, it is overdetermined. Consider that the column space of \(A\) (\(Col(A)\)) is a subspace of \(\mathbb{R}^n\) and that \(\by \in \mathbb{R}^n\) is likely not in the column space of \(A\). If we project \(\by\) onto \(Col(A)\) we should be able to find the best approximation of \(\by\) that lies in \(Col(A)\). Projections onto a subspace can be achieved with matrix multiplication, but which matrix shall we multiply by … \begin{equation*} (??) A \begin{pmatrix}c_0 \\ c_1 \end{pmatrix} = (??) \by. \end{equation*} Once you are satisfied that the right-hand side is a projection of \(\by\) onto \(Col(A)\) you have formed the normal equations. If you've done everthing right then you also now have a square system (\(2 \times 2\) in this case).
Now that this is a square matrix you can solve for \(c_0\) and \(c_1\) using your \(LU\) code. (You can also use MATLAB's backslash (\) command to do the linear solve)
Now that you have \(c_0\) and \(c_1\) you can form the linear approximation. Plot the approximation along with your data. Also write code to plot the residuals (subplot would be great here).
Repeat the process for the quadratic and cubic data. For the quadratic data you are assuming that \begin{equation*} c_0 + c_1 x_j + c_2 x_j^2 \approx y_j, \end{equation*} and for the cubic data you are assuming that \begin{equation*} c_0 + c_1 x_j + c_2 x_j^2 + c_3 y_j^3 \approx y_j. \end{equation*}
In the same Excel document you will find a set of data representing a nonlinear function. Use Excel to find the best fit curve by making a guess about the form of the function. Since this is a nonlinear relationship our method of using the normal equations will not work.
The method of using the normal equations to solve a least squares regression problem also works for multiple regression. In this example go to the second tab of the Excel document to find the Hollywood Box Office Receipts (measured in millions). We are seeking a model of the form \begin{equation*} X_1 \approx \beta_1 + \beta_2 X_2 + \beta_3 X_3. \end{equation*}
That is, we seek to predict first year box office receipts (\(X_1\)) from production cost data (\(X_2\)) and promotional cost data (\(X_3\)).
Read the data into MATLAB
Find the matrix \(A\) to form the overdetermined system of equations \begin{equation*} A \begin{pmatrix}\beta_1 \\ \beta_2 \\ \beta_3 \end{pmatrix} = \bx_1 \end{equation*}
Form the normal equations and solve for \(\beta_1, \beta_2,\) and \(\beta_3\).
Star Wars: The Force Awakens had a production cost of $32M and a promotional cost of only $17M. What does our model predict about the first year box office receipts?
Let's go back to the World Health Organization, but this time you're looking for a data set where you can use multiple regression (linear or polynomial). Give a full description of the data set, discuss why regression is a sensible thing to do, find a regression model, and make a prediction using your regression model.
Find Chapter 7 exercise #17 on page 179 of the textbook. You can find the “mnist_all.mat” file linked from the Moodle page. Follow all of the instructions for part (a). The result of part (a) should be the same picture that you see in the book. In part (b) we are going to do something a bit different than what the textbook author intended. There are 10,000 test numbers that are not in the training set and I want to try two different ways of determining which digit is in the test image. The textbook authors give the method of finding the minimum norm distance from the training images and the test image. They are suggesting that the mean training image that is closest in \(\mathbb{R}^{784}\) is likely the correct numeral. That is to say that the textbook author wants you to minimize the expression \begin{equation*} \| d - T(i,:) \| \end{equation*} over \(i \in \{1,2,\ldots,10\}.\)
I suggest one other test: project the test image vector \(d\) onto the mean training images T(i,:), find the error in the projection, and minimize the size of this error. Recall that the projection of \(d\) onto \(T(i,:)\) is: \begin{equation*} \frac{d \cdot T(i,:)}{T(i,:) \cdot T(i,:)} \end{equation*} and the error in the projection is \begin{equation*} d - \left( \frac{d \cdot T(i,:)}{T(i,:) \cdot T(i,:)} \right) T(i,:). \end{equation*} I'm suggesting that the mean training image that gives the smallest projection error, \begin{equation*} \left\|d - \left( \frac{d \cdot T(i,:)}{T(i,:) \cdot T(i,:)} \right) T(i,:)\right\|, \end{equation*} will likely be the correct numeral.
Write MATLAB code that tests all 10,000 test images and gathers statistics on how often each method correctly identifies the test image. Report your answers by stating the proportion of correct identifications for each of the 10 numerals.