or in matrix notation,
The sample mean vector and sample covariance matrix of
are given by
Obviously, (2.9) and (2.10) are generalizations of (2.7) and (2.8), respectively.
Example 2.5 For the auto.spec
data set, using the mean()
function of R
the sample means of the variables city.mpg
and highway.mpg
can be found as 25.22 and 30.75, respectively. If we are interested in the overall MPG of a car, denoted by z, as the following weighted average of x1 = city.mpg
and x2 = highway.mpg
:
where c = (0.4 0.6)T. Then by (2.7) the sample mean of the overall MPG in the data set is
To find the sample variance of z, first we obtain the sample covariance matrix for city.mpg
and highway.mpg
using the cov()
function of R
:
cov(auto.spec.df[, c("city.mpg", "highway.mpg")]) cor(auto.spec.df[, c("city.mpg", "highway.mpg")])
The function cor()
calculates the sample correlation matrix. Based on the output from the above R
codes, we have
By (2.8), the sample variance of z is
Bibliographic Notes
Data visualization methods are discussed in books in the data mining area, for example, Shmueli et al. [2017] and Williams [2011]. In this chapter, we mostly use the graphics functions from base R
. A popular dedicated graphics package in R
is the ggplot2
package by Wickham [2016]. The ggplot2
package provides more flexible and powerful graphics capability that can create presentation-quality visualization. However, it also comes with a significant learning curve to get familiar with the special technical language used in ggplot2
. For those who use data visualizations on a regular basis, it is worth the time and effort to learn ggplot2
.
Sample statistics such as sample mean vector and sample covariance matrix for multivariate observations are discussed in detail in many multivariate statistics books, for example, Johnson et al. [2002] and Rencher [2003].
Exercises
1 Consider the data in the following table with two numerical variables x1 and x2 and two categorical variables x3 and x4.
x1 | x2 | x3 | x4 |
9 | 1 | Yes | On |
5 | 3 | No | Off |
1 | 2 | Yes | Off |
3 | 4 | Yes | On |
6 | −1 | No | On |
3 | 3 | Yes | On |
1 Manually sketch the scatter plot for x1 and x2.Manually sketch the mosaic plot for x3 and x4.
1 Consider the data set in Exercise 1. Manually calculate the sample mean vector, the sample covariance matrix, and the sample correlation matrix of x = (x1 x2)T.
2 Consider the data in the following table with two numerical variables x1 and x2 and two categorical variables x3 and x4.
x1 | x2 | x3 | x4 |
1 | 0 | Yes | Working |
4 | 6 | No | Fail |
2 | 2 | Yes | Fail |
0 | 3 | No | Fail |
3 | 4 | No | Working |
5 | 7 | Yes | Working |