In Chapter 2, “A Review of Multiple Linear Regression,” we saw that fitting a regression model produces residuals. Residuals can also form a basis for detecting incongruous values in multivariate procedures such as PCA and PLS. However, one must remember that residuals only judge samples in relation to the model currently being considered. It is always best to precede model building with exploratory analysis of one’s data.
Dimensionality Reduction via PCA
In the context of the data table Solubility.jmp, we saw that the first two principal components explain about 95% of the variation in the six variables. Let’s continue using the script DimensionalityReduction.jsl to gain some intuition for exactly how PCA reduces dimensionality.
With the slider in Panel 2 set at 0 Degrees, we see the axes shown in Figure 3.10. The vertical blue lines in Panel 3 show the distances of the points from the horizontal axis. The sum of the squares of these distances, called the Sum of Squared Residuals, is given to the right of the plot. This sum equals 10 for your simulated data as well as for ours. This is a consequence of the fact that the sum is computed for the centered and scaled data and that there are 11 data points.
Figure 3.10: No Rotation of Axes
But now, move your slider in Panel 2 to the right to rotate the axes until you come close to minimizing the Sum of Squared Residuals given in Panel 3 (Figure 3.11). The total length of the blue lines in the third panel is greatly reduced. In effect, the third panel gives a view of the cloud of data from a rotated coordinate system (defined by the red axes in the second panel).
Figure 3.11: Rotation of Axes
From this new point of view, we have explained much of the variation in the data using a single coordinate. We can think of each point as being projected onto the horizontal line in Panel 3, or, equivalently, onto the rotated axis pointing up and to the right in the second panel. In fact, PCA proceeds in just this manner, identifying the first principal component to be the axis along which the variation of the projected points is maximized.
You can close the report generated by the script DimensionalityReduction.jsl.
4
A Deeper Understanding of PLS
PLS as a Multivariate Technique
An Example Exploring Prediction
One-Factor NIPALS Model
Two-Factor NIPALS Model
Variable Selection
SIMPLS Fits
Choosing the Number of Factors
Cross Validation
Types of Cross Validation
A Simulation of K-Fold Cross Validation
Validation in the PLS Platform
The NIPALS and SIMPLS Algorithms
Useful Things to Remember About PLS
Centering and Scaling in PLS
Although it can be adapted to more general situations, PLS usually involves only two sets of variables, one interpreted as predictors, X, and one as responses, Y. As with PCA, it is usually best to apply PLS to data that have been centered and scaled. As shown in Chapter 3 this puts all variables on an equal footing. This is why the Centering and Scaling options are turned on by default in the JMP PLS launch window.
There are sometimes cases where it might be useful or necessary to scale blocks of variables in X and/or Y differently. This can easily be done using JMP column formulas (as we saw in LoWarp.jmp) or using JMP scripting (Help > Books > Scripting Guide). In cases where you define your own scaling, be sure to deselect the relevant options in the PLS launch window. For simplicity, we assume for now that we always want all variables to be centered and scaled.
PLS as a Multivariate Technique
When all variables are centered and scaled, their covariance matrix equals their correlation matrix. The correlation matrix becomes the natural vehicle for representing the relationship between the variables. We have already talked about correlation, specifically in the context of predictors and the X matrix in MLR.
But the distinction between predictors and responses is contextual. Given any data matrix, we can compute the sample correlation between each pair of columns regardless of the interpretation we choose to assign to the columns. The sample correlations form a square matrix with ones on the main diagonal. Most linear multivariate methods, PLS included, start from a consideration of the correlation matrix (Tobias 1995).
Suppose, then, that we have a total of v variables measured on our n units or samples. We consider k of these to be responses and m of these to be predictors, so that v = k + m. The correlation matrix, denoted Σ, is v x v. Suppose that k = 2 and m = 4. Then we can represent the correlation matrix schematically as shown in Figure 4.1, where we order the variables in such a way that responses come first.
Figure 4.1: Schematic of Correlation Matrix Σ, for Ys and Xs
Recall that the elements of Σ must be between –1 and +1 and that the matrix is square and symmetric. But not every matrix with these properties is a correlation matrix. Because of the very meaning of correlation, the elements of a correlation matrix cannot be completely independent of one another. For example, in a 3 x 3 correlation matrix there are three off-diagonal elements: If one of these elements is 0.90 and another is –0.80, it is possible to show that the third must be between –0.98 and –0.46.
As we soon see in a simulation, PLS builds models linking predictors and responses. PLS does this using projection to reduce dimensionality by extracting factors (also called latent variables in the PLS context). The information it uses to do this is contained in the dark green elements of Σ, in this case a 2 x 4 submatrix. For general values of k and m, this sub-matrix is not square and does not have any special symmetry. We describe exactly how the factors are constructed in Appendix 1.
For now, though, note that the submatrix used by PLS contains the correlations