Many statistical models can be thought of as relating one or more inputs (which we call collectively X) to one or more outputs (collectively Y). These quantities are measured on the items or units of interest, and models are constructed from these observations. Such observations yield quantitative data that can be expressed numerically or coded in numerical form.
By the standards of fundamental physics, chemistry, and biology, at least, statistical models are generally useful when current knowledge is moderately low and the underlying mechanisms that link the values in X and Y are obscure. So although one of the perennial challenges of any modeling activity is to take proper account of whatever is already known, the fact remains that statistical models are generally empirical in nature. This is not in any sense a failing, since there are many situations in research, engineering, the natural sciences, the physical sciences, life science, behavioral science, and other areas in which such empirical knowledge has practical utility or opens new, useful lines of inquiry.
However, along with this diversity of contexts comes a diversity of data. No matter what its intrinsic beauty, a useful model must be flexible enough to adequately support the more specific objectives of prediction from or explanation of the data presented to it. As we shall see, one of the appealing aspects of partial least squares as a modeling approach is that, unlike some more traditional approaches that might be familiar to you, it is able to encompass much of this diversity within a single framework.
A final comment on modeling in general—all data is contextual. Only you can determine the plausibility and relevance of the data that you have, and you overlook this simple fact at your peril. Although statistical modeling can be invaluable, just looking at the data in the right way can and should illuminate and guide the specifics of building empirical statistical models of any kind (Chatfield 1995).
Partial Least Squares in Today’s World
Increasingly, we are finding data everywhere. This data explosion, supported by innovative and convergent technologies, has arguably made data exploration (e-Science) a fourth learning paradigm, joining theory, experimentation, and simulation as a way to drive new understanding (Microsoft Research 2009).
In simple retail businesses, sellers and buyers are wrestling for more leverage over the selling/buying process, and are attempting to make better use of data in this struggle. Laboratories, production lines, and even cars are increasingly equipped with relatively low-cost instrumentation routinely producing data of a volume and complexity that was difficult to foresee even thirty years ago. This book shows you how partial least squares, with its appealing flexibility, fits into this exciting picture.
This abundance of data, supported by the widespread use of automated test equipment, results in data sets with a large number of columns, or variables, v and/or a large number of observations, or rows, n. Often, but not always, it is cheap to increase v and expensive to increase n.
When the interpretation of the data permits a natural separation of variables into predictors and responses, partial least squares, or PLS for short, is a flexible approach to building statistical models for prediction. PLS can deal effectively with the following:
• Wide data (when v >> n, and v is large or very large)
• Tall data (when n >> v, and n is large or very large)
• Square data (when n ~ v, and n is large or very large)
• Collinear variables, namely, variables that convey the same, or nearly the same, information
• Noisy data
Just to whet your appetite, we point out that PLS routinely finds application in the following disciplines as a way of taming multivariate data:
• Psychology
• Education
• Economics
• Political science
• Environmental science
• Marketing
• Engineering
• Chemistry (organic, analytical, medical, and computational)
• Bioinformatics
• Ecology
• Biology
• Manufacturing
Transforming, and Centering and Scaling Data
Data should always be screened for outliers and anomalies prior to any formal analysis, and PLS is no exception. In fact, PLS works best when the variables involved have somewhat symmetric distributions. For that reason, for example, highly skewed variables are often logarithmically transformed prior to any analysis.
Also, the data are usually centered and scaled prior to conducting the PLS analysis. By centering, we mean that, for each variable, the mean of all its observations is subtracted from each observation. By scaling, we mean that each observation is divided by the variable’s standard deviation. Centering and scaling each variable results in a working data table where each variable has mean 0 and standard deviation 1.
The reason that centering and scaling are important is because the weights that form the basis for the PLS model are very sensitive to the measurement units of the variables. Without centering and scaling, variables with higher variance have more influence on the model. The process of centering and scaling puts all variables on an equal footing. If certain variables in X are indeed more important than others, and you want them to have higher influence, you can accomplish this by assigning them a higher scaling weight (Eriksson et al. 2006). As you will see, JMP makes centering and scaling easy.
Later we discuss how PLS relates to other modeling and multivariate methods. But for now, let’s dive into an example so that we can compare and contrast it to the more familiar multivariate linear regression (MLR).
An Example of a PLS Analysis
The Data and the Goal
The data table Spearheads.jmp contains data relating to the chemical composition of spearheads known to originate from one of two African tribes (Figure 1.1). You can open this table by clicking on the correct link in the master journal. A total of 19 spearheads of known origin were studied. The Tribe of origin is recorded in the first column (“Tribe A” or “Tribe B”). Chemical measurements of 10 properties were made. These are given in the subsequent columns and are represented in the Columns panel in a column group called Xs. There is a final column called Set, indicating whether an observation will be used in building our model (“Training”) or in assessing that model (“Test”).
Figure 1.1: The Spearheads.jmp Data Table
Our goal is to build a model that uses the chemical measurements to help us decide whether other spearheads collected in the vicinity were made by “Tribe A” or “Tribe B”. Note that there are 10 columns in X (the chemical compositions) and only one column in Y (the attribution of the tribe).
The model will be built using the training set, rows 1–9. The test set, rows 10–19, enables us to assess the ability of the model to predict the tribe of origin for newly discovered spearheads. The column Tribe actually contains the numerical values +1 and –1, with –1 representing “Tribe A” and +1 representing “Tribe B". The Tribe column displays Value Labels for these numerical values. It is the numerical values that the model actually predicts from the chemical measurements.