Variance and correlation matrices play a vital role in multivariate statistics. Multivariate statistics studies n x p data from a set of samples, where n is the sample size or the number of measurements and p is the number of variables whose values are obtained from each sample. For instance, suppose that four respondents participated as samples in a research. They were asked to give their data about: 1) volume of internet data consumption per month (in Gigabytes), 2) monthly income (in million Rupiahs), and 3) volume of gasoline consumption for transportation per month (in liters). The sampling results were summarized in the following table.
The data can be presented in a matrix, namely X, as follows.
In general, n measurements on p variables can be presented in an n x p matrix as follows.
Note: Each row of X is a multivariate observation.
The Mean Matrix
The mean or average of the data on each variable can be obtained from the mean matrix as follows:
where 1n is a row matrix with n columns, all of which have a value of 1. So, the mean matrix for the data above is:
From the mean matrix, it can be seen that the average volume of internet data consumption per month is 120 GB, the average monthly income is IDR 9 million, and the average volume of gasoline consumption per month is 22 liters.
The Deviation Matrix
The matrix that represents deviations from the mean values is called deviation matrix, denoted by T throughout this post. It can be determined by the formula below.
Here I is the identity matrix of order n.
So, the deviation matrix of the data is determined as follows:
The Sample Covariance Matrix
We can use the sample covariance matrix S to find the sample variance and covariance:
The matrix can also be expressed as:
or
So, for the above data, we have:
The diagonal entries of S represent the variances, where Sii denotes the variance of Xi ; i = 1, 2, 3, …, p.
Consequently, the diagonal entries of S are interpreted as follows.
s11 = the variance of X1 = the variance of the volume of internet data consumption per month = 4600 GB2
s22 = the variance of X2 = the variance of the monthly income = 6.67 (million IDR)2 = 6.67⋅1012 IDR2.
s33 = the variance X3 = the variance of the volume of gasoline consumption per month = 48 liter2
In the covariance matrix S, if then sij represents the covariance between Xi and Xj ; i, j = 1, 2, 3, …, p). Hence, in the matrix S above:
s12 = s21 = the covariance between X1 and X2 = 173.33 GB.(million IDR) = 1.7333⋅108 GB.IDR
s13 = s31 = the covariance between X1 and X3 = 373.33 GB.liter
s23 = s32 = the covariance between X2 and X3 = 13.33 (million IDR).liter = 1.333⋅107 IDR.liter
The Correlation Matrix
The correlation coefficient can be obtained from the following correlation matrix:
where is the inverse matrix of , while the matrix is defined as the follows.
The element of the i-th row and j-th column of the matrix is 0 if and if i = j.
Then, it is easy to check that:
In the example above, the correlation matrix is:
From this correlation matrix, it can be concluded that:
The correlation coefficient between X1 and X2 is r12 = 0.9898. It is the correlation coefficient between the volume of internet data consumption per month and the monthly income. The correlation coefficient between X1 and X3 is r13 = 0.7945, which is the correlation between the volume of internet data consumption per month and the volume of gasoline consumption per month. The correlation coefficient between X2 and X3 is r23 = 0.7454. It is the correlation coefficient between the monthly income and the volume of gasoline consumption per month.