An interesting property of any dimensionality reduction technique is to consider its stability. In this context, a technique is said to be ε‐stable if for any two input data points, x1 and x2, the following inequality holds [36]:
Methods based on statistics and information theory: This family of methods reduces the input data according to some statistical or information theory criterion. Somehow, the methods based on information theory can be seen as a generalization of the ones based on statistics in the sense that they can capture nonlinear relationships between variables, can handle interval and categorical variables at the same time, and many of them are invariant to monotonic transformations of the input variables.
Vector quantization and mixture models: Probably the simplest way of reducing dimensionality is by assigning a class {among a total of K classes) to each one of the observations xn. This can be seen as an extreme case of dimensionality reduction in which we go from M dimensions to 1 (the discrete class label χ). Each class, χ, has a representative
The goal is thus to find the representatives
Figure 2.19 Black circles represent the input data, xn; gray squares represent class representatives,
With our previous definition of uχ(x), we can express it as
The log likelihood of observing the whole dataset xn{n = 1, 2, …, N) after removing all constants is
Under this generative model, the probability density function of the observations is the convolution of a Gaussian function and a set of delta functions located at the xχ vectors, that is, a set of Gaussians located at the xχ vectors. The vector quantization then is an attempt to find the centers of the Gaussians forming the probability density function of the input data. This idea has been further pursued by Mixture Models, which are a generalization of vector quantization in which, instead of looking only for the means of the Gaussians associated with each class, we also allow each class to have a different covariance matrix ∑χ and different a priori probability πχ. The algorithm looks for estimates of all these parameters by Expectation–Maximization, and at the end produces for each input observation xn, the label χ of the Gaussian that has the