A principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a few linear combinations of these variables.  Hereinafter, each linear combination is referred to as a component. The number of components can be selected or set so that the total variance produced by these components is almost equal to the total variance of the original variables. Thus, the information in the components is almost as much as the information in the original variables. In addition, the components derived are orthogonal to each other. In other words, these components are not correlated with each other.

The resulting components are rarely treated as the ultimate objective in multivariate statistics. These components are often required when applying other multivariate statistical analysis such as multiple regression, cluster analysis, and factor analysis.

Suppose that the random vector \vec{X} = {[X_1,X_2, \cdots X_p]}^{\prime} has the covariance matrix Σ with eigenvalues λ1 ≥ λ2 ≥ … ≥ λp ≥ 0. Consider p linear combinations below.

 

Therefore,

Var(Yi) = {\vec{a}_i}^{\: \prime} \Sigma \vec{a}_i ; i = 1, 2, …, p

Cov(Yi,Yk) = {\vec{a}_i}^{\: \prime} \Sigma \vec{a}_k ; i, k = 1, 2, …, p

The principal components are those uncorrelated linear combinations Y1, Y2, …, Yp with the property that for every i ∈ {1, 2, …, p} Var(Yi) is as large as possible.

By the definition of Yi, the arising problem is that Var(Yi) can be made as large as possible by multiplying \vec{a}_i by some constant. To eliminate this indeterminacy, a new condition is added: \vec{a}_i must be a unit vector. Therefore the principal components are defined as follows.

First principal component = linear combination {\vec{a}_1}^{\: \prime} X that maximizes Var({\vec{a}_1}^{\: \prime } X) subject to {\vec{a}_1}^{\: \prime} \cdot \vec{a}_1 = 1 .
Second principal component = linear combination {\vec{a}_2}^{\: \prime} X that maximizes Var({\vec{a}_2}^{\: \prime } X) subject to {\vec{a}_2}^{\: \prime} \cdot \vec{a}_2 = 1 and Cov({\vec{ a}_1}^{\: \prime} X, {\vec{a}_2}^{\: \prime} X) = 0.
At the i th step,
i th principal component = linear combination {\vec{a}_i}^{\: \prime} X that maximizes Var({\vec{a}_i}^{\: \prime} X) subject to {\vec{a}_i}^{\: \prime} \cdot \vec{a}_i = 1 and Cov({\vec{a}_i}^{\: \prime} X, {\vec{a}_k}^{\: \prime} X) = 0 for k < i.

 

Theorem 1

Let Σ be the covariance matrix associated with the random vector \vec{X} = {[X_1,X_2, \cdots X_p]}^{\prime}. Also suppose that Σ has the eigenvalue-eigenvector pairs ({\lambda}_1, \vec{e}_1}), ({\lambda}_2, \vec{e }_2}), \ldots , ({\lambda}_p, \vec{e}_p}) where λ1 ≥ λ2 ≥ … ≥ λp ≥ 0. Then, the i th principal component is as follows:

Y_i = {\vec{e}_i}^{\: \prime} X = ei1X1 + ei2X2 + … + eipXp for i = 1, 2, …, p
Further consequences:

Var(Y_i) = {\vec{e}_i}^{\: \prime} \Sigma \vec{e}_i = \lambda_i ; i = 1, 2, …, p

Cov(Y_i,Y_k) = {\vec{e}_i}^{\: \prime} \Sigma \vec{e}_k = 0 ; i ≠ k.

If some λi are equal, the choices of the corresponding coefficient vectors, \vec{e}_i (and hence Yi) are not unique.

 

Example
Suppose that the random vector \vec{X} = {[X_1,X_2,X_3,X_4]}^{\prime} has the covariance matrix below.

\Sigma = \begin{pmatrix}30 & -8 & -8 & 4 \\ -8 & 32 & 12 & -28 \\ -8 & 12 & 13 & -3 \\ 4 & -28 & - 3 & 45 \end{pmatrix}

To determine the principal components, first calculate the eigenvalues ​​and the corresponding eigenvectors. The eigenvectors are so selected such that their norms are 1. The eigenvalues ​​(ordered from the largest to the smallest) and their corresponding eigenvectors are as follows.

 

According to Theorem 1, the principal components are:
Y1 = -0.229 X1 + 0.622 X2 + 0.197 X3 – 0.722 X4
Y2 =   0.861 X1 – 0.028 X2 – 0.328 X3 – 0.387 X4
Y3 =   0.447 X1 + 0.477 X2 + 0.618 X3 + 0.437 X4
Y4 = -0.075 X1 + 0.620 X2 – 0.687 X3 + 0.371 X4

Also, by Theorem 1:

Var(Y1) = λ1 = 71.224
Var(Y2) = λ2 = 31.511
Var(Y3) = λ3 =14.343
Var(Y4) = λ4 = 2.923

 

Theorem 2

Suppose that the random vector \vec{X} = {[X_1,X_2, \cdots X_p]}^{\prime} has the covariance matrix Σ with eigenvalue-eigenvector pairs ({\lambda}_1, \vec{e}_1}), ({\lambda}_2, \vec{e}_2}), \ldots , ({\lambda}_p, \vec{e}_p}) and λ1 ≥ λ2 ≥ … ≥ λp ≥ 0. Let Y_1 = {\vec{e}_1}^{\: \prime} X, Y_2 = {\vec{e}_2}^{\: \prime} X, \ldots , Y_p = {\vec {e}_p}^{\: \prime} X be the principal components. Then, the sum of the variances of X1, X2, …, Xp is equal to the sum of the variances of Y1, Y2, …, Yp.

 

Based on one of the consequences in Theorem 1, i.e. Var(Y_i) = {\vec{e}_i}^{\: \prime} \Sigma \vec{e}_i = \lambda_i for i = 1 , 2, …, p, Theorem 2 deduces the following.

In the example above, \sum_{i=1}^{4} Var(Y_i) = λ1 + λ2 + λ34 = 71.224 + 31.511 + 14.343 + 2.923 = 120.001. The sum of the diagonal elements of matrix Σ is nothing but \sum_{i=1}^{4} Var(X_i) = σ11 + σ22 + σ33 + σ44 = 30 + 32 + 13 + 45 = 120. This is in accordance with the conclusion of Theorem 2.

 

Theorem 3

If Y_1 = {\vec{e}_1}^{\: \prime} X, Y_2 = {\vec{e}_2}^{\: \prime} X, \ldots , Y_p = {\vec {e}_p}^{\: \prime} X are the principal components obtained from the covariance matrix Σ, then the correlation coefficient between the component Yi and the variable Xk is \rho_{Y_i,X_k} = \frac{e_{ik} \sqrt{\lambda_i}}{\sqrt{\sigma_{kk}}} for i, k = 1, 2, …, p where ({\lambda}_1, \vec{e}_1}), ({\lambda}_2, \vec{e}_2}), \ldots , ({\lambda}_p, \vec{e}_p }) are eigenvalue-eigenvector pairs of Σ.

 

As an example of how to apply Theorem 3, suppose that we are to find the correlation between Y4 and X1. From the equation Y4 = -0.075 X1 + 0.620 X2 – 0.687 X3 + 0.371 X4 we have e41 = -0.075. By applying Theorem 1, we have λ4 = 2.923. From the covariance matrix, we get σ11 = 30. Furthermore, by Theorem 3 we get \rho_{Y_4,X_1} = \frac{-0.075 \sqrt{2.923}}{\sqrt{30}} \approx -0.023 . Similarly, the correlation between Y4 and X2 is \rho_{Y_4,X_2} = \frac{0.620 \sqrt{2.923}}{\sqrt{32}} \approx 0.187.

 

To measure the importance of variable Xk in component Yi, some statisticians use eik while others use \rho_{Y_i,X_k}. One of the reasons for not using \rho_{Y_i,X_k}  is that it only measures the univariate contribution of an individual X to a component Y. That is, they do not indicate the importance of an X to a component Y in the presence of the other X’s. In particular, Rencher in Johnson and Wichern (2002) recommends using eik instead of \rho_{Y_i,X_k} to interpret the components. However, Johnson and Wichern (2002) stated “Although coefficients and the correlations can lead to different rankings as measures of the importance of the variables to a given component, it is our experience that these rankings are often not appreciably different.” and they have recommended that both eik and \rho_{Y_i,X_k} be examined to help interpret the principal components.

 

At the beginning of this article, it was mentioned that principal component analysis produced new variables (called components) which were fewer than the original variables but retained as much of the total variance of the original variables as possible. By retaining most of the variability in the original variables, the resulting components can replace the old variables. This can be demonstrated as follows.

 

In the example above, suppose that we take only two components, namely Y1 and Y2. What fraction of the total variance of the original variables is retained by Y1 and Y2 altogether? The proportion of the total population variance retained by the first principal component, i.e. Y1, is \frac{\lambda_1}{\lambda_1 + \lambda_2 + \lambda_3 + \lambda_4} = \frac{\lambda_1} {\sigma_{11} + \sigma_{22} + \sigma_{33} + \sigma_{44}} = \frac{71.224}{30+32+13+45} = 59.35%. The proportion of the total population variance retained by the second component, Y2, is \frac{\lambda_2}{\lambda_1 + \lambda_2 + \lambda_3 + \lambda_4} = \frac{\lambda_2}{\sigma_{11} + \sigma_{22} + \sigma_{33} + \sigma_{44}} = \frac{31.511}{30+32+13+45} = 26.26%. As a consequence, if we use only two components to replace the original variables, the proportion of the total variance preserved by the two components is 59.35% + 26.26% = 85.61%. Thus, we can replace X1, X2, X3, X4 with two components Y1 and Y2. A consequence of this replacement is that most (85.61%) of the total variance is retained.

 

Reference

Johnson, R. A. & Wichern, D. W. (2002). Applied Multivariate Statistical Analysis (5th ed.). Pearson Education

International.

Leave a Reply

Your email address will not be published. Required fields are marked *