Principal component analysis (PCA): Introduced in Section 2.1, is by far one of the most popular algorithms for dimensionality reduction [39–42]. Given a set of observations x, with dimension M (they lie in ℝM), PCA is the standard technique for finding the single best (in the sense of least‐square error) subspace of a given dimension, m. Without loss of generality, we may assume the data is zero‐mean and the subspace to fit is a linear subspace (passing through the origin).
This algorithm is based on the search for orthogonal directions explaining as much variance of the data as possible. In terms of dimensionality reduction, it can be formulated [43] as the problem of finding the m orthonormal directions wi minimizing the representation error
In Figure 2.15, we have shown a graphical representation of a PCA transformation in only two dimensions (x ∈ ℝ2) with a slightly different notation (χ represented by z). As can be seen from Figure 2.15, the variance of the data in the original data space is best captured in the rotated space given by vectors =Wtx. χ1 is the first principal component, and it goes in the direction of most variance; χ2 is the second principal component, and is orthogonal to the first direction and goes in the second direction with the most variance (in ℝ2 there is not much choice, but in the general case, ℝM, there is). Observe that without loss of generality the data is centered about the origin of the output space. We can rewrite the objective function as
Note that the class membership matrix (U in vector quantization) has been substituted in this case by Wt X, which in general can take any positive or negative value. It, thus, has lost its membership meaning and simply constitutes the weights of the linear combination of the column vectors of W that better approximate each input x. Finally, the PCA objective function can also be written as
JPCA = Tr{Wt ∑X W} [44], where
The matrix projection of the input vectors onto a lower‐dimensional space ( χ = Wtx) is a widespread technique in dimensionality reduction. As an illustration, let us look at the following example [46]:
Design Example 2.5
Assume that we are analyzing scientific articles related to a specific domain. Each article will be represented by a vector x of word frequencies; that is, we choose a set of M words representative of our scientific area, and we annotate how many times each word appears in each article. Each vector x is then orthogonally projected onto the new subspace defined by the vectors wi. Each vector wi has dimension M, and it can be understood as a “topic” (i.e. a topic is characterized by the relative frequencies of the M different words; two different topics will differ in the relative frequencies of the M words). The projection of x onto each wi gives an idea of how important topic wi is for representing the article. Important topics have large projection values and, therefore, large values in the corresponding component of χ.
It can be shown [43, 47], as already indicated in Section 2.1, that when the input vectors, x, are zero‐mean (if they are not, we can transform the input data simply by subtracting the sample average vector), then the solution of the minimization of JPCA is given by the m eigenvectors associated to the largest m eigenvalues of the covariance matrix of x {