Sparse and group-sparse PCA

Introduction

This package performs sparse or group-sparse principal component analysis. Deflation and block algorithms are implemented. Five different definition of explained variance for a set of non orhogonal components are also implemented.

library(sparsePCA)

A small simulated data example

The function simuPCA generates data from an underlying PCA model. More precisely data are drawn from a centered multivariate normal distribution where the covariance matrix is constructed knowing its \(p\) eigenvalues and its \(m < p\) first eigenvectors (see details in Chavent & Chavent, 2020).

Here the Shen & Huang (2008) example is used:

the 10 eigenvalues of the underlying covariance matrix are \((200,100,50,50,6,5,4,3,2,1)\),
the two first eigenvectors are:
- \(z_1=(1,1,1,1,0,0,0,0,0.9,0.9)\)
- \(z_2=(0,0,0,0,1,1,1,1,-0.3,0.3)\)

The positions of the zeros in the eigenvectors indicate the variables that should not be selected to build the first and second principal component and then the zero loadings.

# True eigenvalues
eigenval <- c(200,100,50,50,6,5,4,3,2,1)

# True loading matrix
z1 <- c(1,1,1,1,0,0,0,0,0.9,0.9)
z2 <- c(0,0,0,0,1,1,1,1,-0.3,0.3)
Ztrue <- cbind(z1/sqrt(sum(z1^2)),z2/sqrt(sum(z2^2)))

True loadings
variable	z1	z2
V1	0.422	0.000
V2	0.422	0.000
V3	0.422	0.000
V4	0.422	0.000
V5	0.000	0.489
V6	0.000	0.489
V7	0.000	0.489
V8	0.000	0.489
V9	0.380	-0.147
V10	0.380	0.147

In this underlying PCA model, the four variables V5 to V8 are not used to build the first principal component (zero loadings) and the four variables V1 to V4 are not used to build the second one.

Let us now generate a dataset of \(n=100\) observations and \(p=10\) variables using this underlying model.

# Simulation of n=100 observations
A <- simuPCA(n=100, Ztrue, eigenval, seed=1) # seed=1 for reproducibility

Standard PCA (on the covariance matrix) is applied by SVD (singular value decomposition) of the centered dataset.

B <- scale(A, center=TRUE, scale=FALSE)
Zpca <- svd(B)$v[,1:2] # PCA loadings

PCA loadings
variables	z1	z2
V1	0.395	0.244
V2	0.338	0.208
V3	0.446	0.027
V4	0.359	0.031
V5	-0.107	0.513
V6	-0.183	0.532
V7	-0.143	0.336
V8	-0.015	0.353
V9	0.417	-0.212
V10	0.402	0.260

The values are not far from the true loadings above but zero loadings are not found.

In order to promote sparsity of the loadings, sparse PCA (on the covariance matrix) is applied to the dataset with the following parameters:

\(m=2\) components
a sparsity parameter of 0.1 for the first component and 0.1 for the second.

Zspca <- sparsePCA(A, m=2, lambda= c(0.1,0.1), scale = FALSE)$Z

sparse PCA loadings
variables	z1	z2
V1	0.423	0.089
V2	0.354	0.067
V3	0.459	0.000
V4	0.361	0.000
V5	-0.001	0.524
V6	-0.082	0.590
V7	-0.058	0.390
V8	0.000	0.381
V9	0.397	-0.229
V10	0.433	0.125

The zero true loadings are found for V8 on the first component and V3 and V4 on the second. But some are not found. In order to promote group-sparsity of the loadings, group-sparse PCA (on the covariance matrix) is applied to the dataset with the following parameters:

\(m=2\) components
a sparsity parameter of 0.2 for the first component and 0.2 for the second.
three groups of variables : G1={V1, V2, V3, V4}, G2={V5, V6, V7, V8}, G3={V9, V10}

# 3 groups of variables of size 4,4,2
index <- rep(c(1,2,3),c(4,4,2)) 
Zgspca <- groupsparsePCA(A, m=2, lambda=c(0.2,0.2), 
                         index,scale=F)$Z

group-sparse PCA loadings
variables	z1	z2
V1	0.446	0.000
V2	0.386	0.000
V3	0.464	0.000
V4	0.389	0.000
V5	0.000	0.493
V6	0.000	0.596
V7	0.000	0.453
V8	0.000	0.429
V9	0.360	-0.098
V10	0.395	0.039

Now the loadings of the variables within the same group are set to zero at the same time. The true zero loadings of G1 (resp. G2) on the first (resp. second) component are found. But the loadings of G3 on the second component are wrongly set to zero, probably due to their lower values.

Block or deflation approach

The function groupsparsePCA implements an optimal projected variance sparse block PCA algorithm and a deflation algorithm applying the block algorithm with one single component iteratively to each deflated data matrix.

The arguments of this function are:

groupsparsePCA(A, m, lambda, index = 1:ncol(A), block = 1, mu = 1/1:m, groupsize=FALSE, center = TRUE, scale = TRUE)

where:

A is the data matrix
m is the number of components
lambda is the vector of the \(m\) reduced sparsity parameters (in relative value with respect to the theoretical upper bound). Each reduced sparsity parameter is a value between 0 and 1
index defines the groups of variables. By default, index defines one variable in each group and sparsePCA is performed.
block defines the approach: block=0 means that deflation is used if more than one component. block=1 means that block optimisation is used to compute the m components at once
mu is a vector of weights used in the block approach
groupsize is a logical value indicating wheter the size of the groups should be taken into account

Let us apply this function to the small data example simulated from the Shen & Huang model (see previous section) with:

\(m=2\) components
three groups of variables of size 4, 4 and 2
a reduced sparsity parameter of 0.3 for the first component and 0.3 for the second

Deflation approach

The deflation approach is obtained with the parameter block=0.

groupsparsePCA(A, block = 0, lambda = c(0.3,0.3), m = 2, index)$Z

##           comp1     comp2
##  [1,] 0.4407766 0.0000000
##  [2,] 0.4302970 0.0000000
##  [3,] 0.4731344 0.0000000
##  [4,] 0.4452412 0.0000000
##  [5,] 0.0000000 0.4681241
##  [6,] 0.0000000 0.5113443
##  [7,] 0.0000000 0.5315177
##  [8,] 0.0000000 0.4866988
##  [9,] 0.3091337 0.0000000
## [10,] 0.3207818 0.0000000

Note that when deflation is chosen, the solutions for different values of \(m\) are nested. In other words the loading vector obtained with \(m=1\) components is the same as the first loading vector obtained with \(m=2\). But no global criterion is optimized with this approach.

Block approach

The block approach is obtained with the parameter block=1. This approach uses also a parameter mu which is a vector of weights of size \(m\). Striclty decreasing weights \(\mu_j\), \(j=1 \ldots m\), relieve the underdetermination which happens in some situations and drives to a solution close to the PCA solution. Two possible choices for the parameters mu are :

\(\mu_j=1\) for all \(j\) (equal weights).
\(\mu_j=\frac{1}{j}\) for all \(j\) (striclty decreasing weights),

In the first case, the weights are \(\mu_1=\mu_2=1\).

groupsparsePCA(A, block = 1, lambda = c(0.3,0.3), m = 2, index, mu = c(1,1),)$Z

##           comp1     comp2
##  [1,] 0.4426067 0.0000000
##  [2,] 0.4322088 0.0000000
##  [3,] 0.4729542 0.0000000
##  [4,] 0.4464606 0.0000000
##  [5,] 0.0000000 0.4676194
##  [6,] 0.0000000 0.5077296
##  [7,] 0.0000000 0.5276222
##  [8,] 0.0000000 0.4951339
##  [9,] 0.3005142 0.0000000
## [10,] 0.3224489 0.0000000

In the second case, the weights are \(\mu_1=1\) and \(\mu_2=1/2\).

groupsparsePCA(A, block = 1, lambda = c(0.3,0.3), m = 2, index, mu = c(1,1/2),)$Z

##           comp1     comp2
##  [1,] 0.4414332 0.0000000
##  [2,] 0.4309783 0.0000000
##  [3,] 0.4730742 0.0000000
##  [4,] 0.4456833 0.0000000
##  [5,] 0.0000000 0.4673043
##  [6,] 0.0000000 0.5058661
##  [7,] 0.0000000 0.5256596
##  [8,] 0.0000000 0.4994079
##  [9,] 0.3059766 0.0000000
## [10,] 0.3214653 0.0000000

The loadings obtained in both cases are very similar here.

Note that when block optimisation is chosen, the solutions for different values of \(m\) are not nested. In other words the loading vector obtained with \(m=1\) components may not be the same as the first loading vector obtained with \(m=2\). But a global criterion is optimized with this approach.

The numerical comparison in Chavent & Chavent (2020) between block optimisation and deflation shows that the block optimisation produces steadily slightly better and more robust results for the retrieval of hidden sparse structures.

A real small data example

The protein dataset is originated by A. Weber and cited in Hand et al., A Handbook of Small Data Sets, (1994, p. 297). This small dataset gives the amount of protein consumed for nine food groups in 25 European countries.

data(protein)

Let us first apply standard PCA (on the correlation matrix) by SVD (singular value decomposition) of the standardized dataset to get the variance of each principal components and the loadings.

n <- nrow(protein)
B <- scale(protein)*sqrt(n/(n-1))
eig <- svd(B)$d^2/n

The barplot of the variance explained by the components suggest to choose \(m=4\) components.

pev <-eig/sum(eig)
barplot(pev, names.arg=paste("comp", 1:9, sep=""), 
        las=2, main = "Proportion of explained variance")

The 4 first principal components explain 85.8% of the variance of the standardized data with the following proportion of variance on each component:

Standard PCA
comp1	comp2	comp3	comp4
44.52	18.17	12.53	10.61

The PCA loadings below show that some variables are not really necessary to buid the principal components (loadings close to zero). For instance the variable Red.Meat is almost irrelevant in the construction of the second principal component with a loading of -0.056.

Zpca <- svd(B)$v[,1:4] # PCA loadings

PCA loadings
variables	comp1	comp2	comp3	comp4
Red.Meat	-0.303	-0.056	-0.298	-0.646
White.Meat	-0.311	-0.237	0.624	0.037
Eggs	-0.427	-0.035	0.182	-0.313
Milk	-0.378	-0.185	-0.386	0.003
Fish	-0.136	0.647	-0.321	0.216
Cereals	0.438	-0.233	0.096	0.006
Starchy.Foods	-0.297	0.353	0.243	0.337
Nuts	0.420	0.143	-0.054	-0.330
Fruite.veg.	0.110	0.536	0.408	-0.462

In order to promote sparsity of the loadings, the optimal projected variance sparse block PCA is applied to the dataset with \(m=4\) components. In the usual situation where no a priori information on the sparsity of the underlying loadings is known, one can apply the same level of regularization for all components simply by using the same reduced parameters \(\lambda \in [0, 1]\) for all loadings : \[0 \leq \lambda_1 = \ldots = \lambda_4 \leq 1\]

The influence of the common reduced sparsity parameter \(\lambda\) is studied by letting it vary from 0 to 1 by steps of 0.01.

m <- 4
grid <- seq(0, 1, by=0.01) # grid of lambda

For each sparsity parameter \(\lambda\) in this grid, the proportition of variance explained by the principal components is calculated using the pev function. This function calculates the proportion of variance explained by not necessarily orthogonal principal components using the optimal projected variance (optVar). The number of variables selected to build the principal component with the sparse loadings is also calculated for each \(\lambda\).

pevdim <- matrix(NA,length(grid), m)
nseldim <- matrix(NA,length(grid), m)

for (k in 1:length(grid))
{
  x <- groupsparsePCA(protein, lambda=rep(grid[k], m),
                       m=m, block = 1) #block different mu
  pevdim[k,] <- pev(x)
  nseldim[k,] <- apply(x$Z,2,function(x){sum(x!=0)})
}

colnames(pevdim) <- colnames(nseldim) <- paste("comp", 1:m,sep="")
colnames(nseldim) <- colnames(nseldim) <- paste("comp", 1:m,sep="")

The figure below plots the proportion of variance explained by the principal components (pev) as a function of the reduced sparsity parameter \(\lambda\). This plot suggest to choose \(\lambda=0.6\) (vertical black line).

matplot(grid, pevdim, xlab="lambda", ylab="pev",
        type="l", col=1:m, lty=2) 
legend("topright",legend=colnames(pevdim),
       lty=2, col=1:m, bty="n")
abline(v=0.6)

The next figure plots the number of selected variables (nsel) as a function of \(\lambda\).

matplot(grid,nseldim, xlab="lambda", ylab="nsel",
        type="l",col=1:m,lty=2) 
legend("topright",legend=colnames(pevdim),
       lty=2, col=1:m,bty="n")
abline(v=0.6)

When \(\lambda=0.6\), four variables are selected to build the first component, one is selected to build the second component, two for the third and one for the fourth.

The sparse loadings obtained with \(\lambda=0.6\) are given below.

x <- groupsparsePCA(protein, lambda=rep(0.6, m),
                       m=m, block = 1)
Zspca <- x$Z # sparse PCA loadings

sparse PCA loadings
variables	comp1	comp2	comp3	comp4
Red.Meat	0.000	0	0.000	-1
White.Meat	0.000	0	0.523	0
Eggs	-0.527	0	0.000	0
Milk	-0.358	0	0.000	0
Fish	0.000	0	-0.852	0
Cereals	0.630	0	0.000	0
Starchy.Foods	0.000	0	0.000	0
Nuts	0.445	0	0.000	0
Fruite.veg.	0.000	1	0.000	0

The four principal components build with these sparse loadings explain 62.9 % of the variance against 85.8% for standard PCA.

sum(pev(x))

## [1] 0.6290102

The proportition of variance explained by each component is given below:

sparse PCA
comp1	comp2	comp3	comp4
30.14	10.63	13.27	8.86

The four principal components \(y_j\), \(j=1\ldots4\) are new synthetic variables build using only variables with non zero loadings:

\[y_1=-0.527 \; Eggs -0.358 \; Milk + 0.630 \; Cereals + 0.445 \; Nuts\] \[y_2=Fruit.veg\] \[y_3=0.523 \; White.Meat - 0.852 \; Fish\]

\[y_4=-Red.Meat\]

References

M. Chavent, G. Chavent. Optimal Projected Variance Group-Sparse Block PCA. 2021. ⟨hal-03125264⟩
M. Journee, Y. Nesterov, P. Richtarik, and R. Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11:517-553, 2010.
H. Shen and J.Z. Huang. Sparse principal component analysis via regularized low rank matrix approximation. Journal of multivariate analysis, 99(6):1015-1034, 2008.
H. Zou, T. Hastie, R. Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265-286, 2006.