The plot on the far right in Figure 3.2, called a loading plot, gives insight into the data structure. All six of the variables have positive loadings on the first component. This means that the largest component of the variability is explained by a linear combination of all six variables with positive coefficients for each variable. But the second component has positive loadings only for 1-Octanol and Ether, while all other variables have negative loadings. This indicates that the next largest source of variability results from a difference between a compound’s solubility for 1-Octanol and Ether and the other four solvents.
For more information about the PCA platform in JMP, select Analyze > Multivariate Methods > Principal Components. Then, in the launch window, click Help.
Centering and Scaling: An Example
As mentioned in Chapter 1, multivariate methods, such as PCA, are very sensitive to the scale of the data. Open the data table LoWarp.jmp by clicking on the correct link in the master journal. These data, presented in Eriksson et al. (2006), are from an experiment run to minimize the warp and maximize the strength of a polymer used in a mobile phone. In the column group called Y, you see eight measurements of warp (warp1 – warp8) and six measurements of strength (strength1 – strength6).
Run the first script in the table, Raw Y. You obtain a plot of comparative box plots, as shown in Figure 3.3. Most of the box plots are dwarfed compared to the four larger box plots.
Figure 3.3: Comparative Box Plots for Raw Data
The plot shows that the variables strength2 and strength4 dominate in terms of the raw measurement scale – their values are much larger than those of the other variables. If these raw values were used in a multivariate analysis such as PCA or PLS, these two variables would dominate.
We can lessen the impact of the size of their values by subtracting each variable’s mean from all its measurements. As mentioned in Chapter 1, this is called centering. The Columns panel in the data table contains a column group for the centered variables, called Y Centered. Each variable in this group is given by a formula. To see the formulas, click on the disclosure icon to the left of the group name to display the 13 variables in the group. Next, click on any of the + signs to the right of the variable names. You see that the calculated values in any given column consist of the raw data minus the appropriate column mean.
Run the script Centered Y to obtain the box plots for the centered data shown in Figure 3.4. Although the data are now centered at 0, the variables strength2_Centered and strength4_Centered still dominate because of their relatively high variability.
Figure 3.4: Comparative Box Plots for Centered Data
Let’s not only center the data in any given column, but let’s also divide these centered values by the standard deviation of the column to scale the data. JMP has a function that both centers and scales a column of data. The function is called Standardize. The column group Y Centered and Scaled contains standardized versions of each of the variables. You can check this by looking at the formulas that define the columns. Run the script Centered and Scaled Y to obtain the comparative box plots of the standardized variables shown in Figure 3.5.
Figure 3.5: Comparative Box Plots for Centered and Scaled Data
As mentioned in Chapter 1, we see that the act of centering and scaling (or standardizing) the variables does indeed place all of them on an equal footing. Although there can be exceptions, it is generally the case that, in PCA and PLS, centering and scaling your variables is desirable.
In PCA, the JMP default is to calculate Principal Components > on Correlations, as shown in Figure 3.6. This means that the variables are first centered and scaled, so that the matrix containing their inner products is the correlation matrix. JMP also enables the user to select Principal Components > on Covariances, which means that the data are simply centered, or Principal Components > on Unscaled, which means that the raw data are used.
Figure 3.6: PCA Default Calculation
The Importance of Exploratory Data Analysis in Multivariate Studies
Visual data exploration should be a first step in any multivariate study. In the next section, we use some simulated data to see how PCA reduces dimensionality. But first, let’s explore the data that we use for that demonstration.
Run the script DimensionalityReduction.jsl by clicking on the correct link in the master journal. This script generates three panels. Each panel gives a plot of 11 quasi-random values for two variables. The first panel that appears when you run the script shows the raw data for X1 and X2, which we refer to as the Measured Data Values (Figure 3.7). Your plot will be slightly different, because the points are random, and the Summary of Measured Data information will differ to reflect this.
Figure 3.7: Panel 1, Measured Data Values
In Panel 1, the Summary of Measured Data gives the mean of each variable and the Variance-Covariance Matrix for X1 and X2. Note that the variance-covariance matrix is symmetric. The diagonal entries are the variance of X1 (upper left) and the variance of X2 (lower right), while the off-diagonal entries give the covariance of X1 and X2.
Covariance measures the joint variation in X1 and X2. Because the covariance value depends on the units of measurement of X1 and X2, its absolute size is not all that meaningful. However, the pattern of joint variation between X1 and X2 can be discerned and assessed from the scatterplot in Figure 3.7. Note that, as X1 increases, X2 tends to increase as well, but there appears to be one point in the lower right corner of the plot that doesn’t fit this general pattern. Although the points generated by the script are random, you should see an outlier in the lower right corner of your plot as well.
Panel 2, shown in Figure 3.8, displays the Centered and Scaled Data Values. For the centered and scaled data, the covariance matrix is just the correlation matrix. Here, the off-diagonal entries give the correlation between X1 and X2. This correlation value does have an interpretation based on its size and its sign. In our example, the correlation is 0.221, indicating a weak positive relationship.
Figure 3.8: Panel 2, Centered and Scaled Data Values
However, as you might suspect, the outlying point might be causing the correlation coefficient to be smaller than expected. In the top panel, you can use your mouse to drag that outlying point, which enables you to investigate its effect on the various numerical summaries, and in particular, on the correlation shown in the second panel. The effect of moving the rogue point into the cloud of the remaining X1, X2 values is shown for our data in Figure 3.9. (The point is now the one at the top right.) The correlation increases from 0.221 to 0.938.
Figure 3.9: Effect of Dragging Outlier to Cloud of Points
We move on to Panel 3 in the next section, but first a few remarks. Remember that PCA is based