A small simulated data example
This package performs sparse or group-sparse principal component analysis. Deflation and block algorithms are implemented. Five different definition of explained variance for a set of non orhogonal components are also implemented.
library(sparsePCA)
The function simuPCA generates data from an underlying PCA model. More precisely data are drawn from a centered multivariate normal distribution where the covariance matrix is constructed knowing its \(p\) eigenvalues and its \(m < p\) first eigenvectors (see details in Chavent & Chavent, 2020).
Here the Shen & Huang (2008) example is used:
The positions of the zeros in the eigenvectors indicate the variables that should not be selected to build the first and second principal component and then the zero loadings.
# True eigenvalues
<- c(200,100,50,50,6,5,4,3,2,1)
eigenval
# True loading matrix
<- c(1,1,1,1,0,0,0,0,0.9,0.9)
z1 <- c(0,0,0,0,1,1,1,1,-0.3,0.3)
z2 <- cbind(z1/sqrt(sum(z1^2)),z2/sqrt(sum(z2^2))) Ztrue
variable | z1 | z2 |
---|---|---|
V1 | 0.422 | 0.000 |
V2 | 0.422 | 0.000 |
V3 | 0.422 | 0.000 |
V4 | 0.422 | 0.000 |
V5 | 0.000 | 0.489 |
V6 | 0.000 | 0.489 |
V7 | 0.000 | 0.489 |
V8 | 0.000 | 0.489 |
V9 | 0.380 | -0.147 |
V10 | 0.380 | 0.147 |
In this underlying PCA model, the four variables V5 to V8 are not used to build the first principal component (zero loadings) and the four variables V1 to V4 are not used to build the second one.
Let us now generate a dataset of \(n=100\) observations and \(p=10\) variables using this underlying model.
# Simulation of n=100 observations
<- simuPCA(n=100, Ztrue, eigenval, seed=1) # seed=1 for reproducibility A
Standard PCA (on the covariance matrix) is applied by SVD (singular value decomposition) of the centered dataset.
<- scale(A, center=TRUE, scale=FALSE)
B <- svd(B)$v[,1:2] # PCA loadings Zpca
variables | z1 | z2 |
---|---|---|
V1 | 0.395 | 0.244 |
V2 | 0.338 | 0.208 |
V3 | 0.446 | 0.027 |
V4 | 0.359 | 0.031 |
V5 | -0.107 | 0.513 |
V6 | -0.183 | 0.532 |
V7 | -0.143 | 0.336 |
V8 | -0.015 | 0.353 |
V9 | 0.417 | -0.212 |
V10 | 0.402 | 0.260 |
The values are not far from the true loadings above but zero loadings are not found.
In order to promote sparsity of the loadings, sparse PCA (on the covariance matrix) is applied to the dataset with the following parameters:
<- sparsePCA(A, m=2, lambda= c(0.1,0.1), scale = FALSE)$Z Zspca
variables | z1 | z2 |
---|---|---|
V1 | 0.423 | 0.089 |
V2 | 0.354 | 0.067 |
V3 | 0.459 | 0.000 |
V4 | 0.361 | 0.000 |
V5 | -0.001 | 0.524 |
V6 | -0.082 | 0.590 |
V7 | -0.058 | 0.390 |
V8 | 0.000 | 0.381 |
V9 | 0.397 | -0.229 |
V10 | 0.433 | 0.125 |
The zero true loadings are found for V8 on the first component and V3 and V4 on the second. But some are not found. In order to promote group-sparsity of the loadings, group-sparse PCA (on the covariance matrix) is applied to the dataset with the following parameters:
# 3 groups of variables of size 4,4,2
<- rep(c(1,2,3),c(4,4,2))
index <- groupsparsePCA(A, m=2, lambda=c(0.2,0.2),
Zgspca scale=F)$Z index,
variables | z1 | z2 |
---|---|---|
V1 | 0.446 | 0.000 |
V2 | 0.386 | 0.000 |
V3 | 0.464 | 0.000 |
V4 | 0.389 | 0.000 |
V5 | 0.000 | 0.493 |
V6 | 0.000 | 0.596 |
V7 | 0.000 | 0.453 |
V8 | 0.000 | 0.429 |
V9 | 0.360 | -0.098 |
V10 | 0.395 | 0.039 |
Now the loadings of the variables within the same group are set to zero at the same time. The true zero loadings of G1 (resp. G2) on the first (resp. second) component are found. But the loadings of G3 on the second component are wrongly set to zero, probably due to their lower values.
The function groupsparsePCA implements an optimal projected variance sparse block PCA algorithm and a deflation algorithm applying the block algorithm with one single component iteratively to each deflated data matrix.
The arguments of this function are:
groupsparsePCA(A, m, lambda, index = 1:ncol(A), block = 1, mu = 1/1:m, groupsize=FALSE, center = TRUE, scale = TRUE)
where:
Let us apply this function to the small data example simulated from the Shen & Huang model (see previous section) with:
The deflation approach is obtained with the parameter block=0.
groupsparsePCA(A, block = 0, lambda = c(0.3,0.3), m = 2, index)$Z
## comp1 comp2
## [1,] 0.4407766 0.0000000
## [2,] 0.4302970 0.0000000
## [3,] 0.4731344 0.0000000
## [4,] 0.4452412 0.0000000
## [5,] 0.0000000 0.4681241
## [6,] 0.0000000 0.5113443
## [7,] 0.0000000 0.5315177
## [8,] 0.0000000 0.4866988
## [9,] 0.3091337 0.0000000
## [10,] 0.3207818 0.0000000
Note that when deflation is chosen, the solutions for different values of \(m\) are nested. In other words the loading vector obtained with \(m=1\) components is the same as the first loading vector obtained with \(m=2\). But no global criterion is optimized with this approach.
The block approach is obtained with the parameter block=1. This approach uses also a parameter mu which is a vector of weights of size \(m\). Striclty decreasing weights \(\mu_j\), \(j=1 \ldots m\), relieve the underdetermination which happens in some situations and drives to a solution close to the PCA solution. Two possible choices for the parameters mu are :
In the first case, the weights are \(\mu_1=\mu_2=1\).
groupsparsePCA(A, block = 1, lambda = c(0.3,0.3), m = 2, index, mu = c(1,1),)$Z
## comp1 comp2
## [1,] 0.4426067 0.0000000
## [2,] 0.4322088 0.0000000
## [3,] 0.4729542 0.0000000
## [4,] 0.4464606 0.0000000
## [5,] 0.0000000 0.4676194
## [6,] 0.0000000 0.5077296
## [7,] 0.0000000 0.5276222
## [8,] 0.0000000 0.4951339
## [9,] 0.3005142 0.0000000
## [10,] 0.3224489 0.0000000
In the second case, the weights are \(\mu_1=1\) and \(\mu_2=1/2\).
groupsparsePCA(A, block = 1, lambda = c(0.3,0.3), m = 2, index, mu = c(1,1/2),)$Z
## comp1 comp2
## [1,] 0.4414332 0.0000000
## [2,] 0.4309783 0.0000000
## [3,] 0.4730742 0.0000000
## [4,] 0.4456833 0.0000000
## [5,] 0.0000000 0.4673043
## [6,] 0.0000000 0.5058661
## [7,] 0.0000000 0.5256596
## [8,] 0.0000000 0.4994079
## [9,] 0.3059766 0.0000000
## [10,] 0.3214653 0.0000000
The loadings obtained in both cases are very similar here.
Note that when block optimisation is chosen, the solutions for different values of \(m\) are not nested. In other words the loading vector obtained with \(m=1\) components may not be the same as the first loading vector obtained with \(m=2\). But a global criterion is optimized with this approach.
The numerical comparison in Chavent & Chavent (2020) between block optimisation and deflation shows that the block optimisation produces steadily slightly better and more robust results for the retrieval of hidden sparse structures.
The protein dataset is originated by A. Weber and cited in Hand et al., A Handbook of Small Data Sets, (1994, p. 297). This small dataset gives the amount of protein consumed for nine food groups in 25 European countries.
data(protein)
Let us first apply standard PCA (on the correlation matrix) by SVD (singular value decomposition) of the standardized dataset to get the variance of each principal components and the loadings.
<- nrow(protein)
n <- scale(protein)*sqrt(n/(n-1))
B <- svd(B)$d^2/n eig
The barplot of the variance explained by the components suggest to choose \(m=4\) components.
<-eig/sum(eig)
pev barplot(pev, names.arg=paste("comp", 1:9, sep=""),
las=2, main = "Proportion of explained variance")
The 4 first principal components explain 85.8% of the variance of the standardized data with the following proportion of variance on each component:
comp1 | comp2 | comp3 | comp4 |
---|---|---|---|
44.52 | 18.17 | 12.53 | 10.61 |
The PCA loadings below show that some variables are not really necessary to buid the principal components (loadings close to zero). For instance the variable Red.Meat is almost irrelevant in the construction of the second principal component with a loading of -0.056.
<- svd(B)$v[,1:4] # PCA loadings Zpca
variables | comp1 | comp2 | comp3 | comp4 |
---|---|---|---|---|
Red.Meat | -0.303 | -0.056 | -0.298 | -0.646 |
White.Meat | -0.311 | -0.237 | 0.624 | 0.037 |
Eggs | -0.427 | -0.035 | 0.182 | -0.313 |
Milk | -0.378 | -0.185 | -0.386 | 0.003 |
Fish | -0.136 | 0.647 | -0.321 | 0.216 |
Cereals | 0.438 | -0.233 | 0.096 | 0.006 |
Starchy.Foods | -0.297 | 0.353 | 0.243 | 0.337 |
Nuts | 0.420 | 0.143 | -0.054 | -0.330 |
Fruite.veg. | 0.110 | 0.536 | 0.408 | -0.462 |
In order to promote sparsity of the loadings, the optimal projected variance sparse block PCA is applied to the dataset with \(m=4\) components. In the usual situation where no a priori information on the sparsity of the underlying loadings is known, one can apply the same level of regularization for all components simply by using the same reduced parameters \(\lambda \in [0, 1]\) for all loadings : \[0 \leq \lambda_1 = \ldots = \lambda_4 \leq 1\]
The influence of the common reduced sparsity parameter \(\lambda\) is studied by letting it vary from 0 to 1 by steps of 0.01.
<- 4
m <- seq(0, 1, by=0.01) # grid of lambda grid
For each sparsity parameter \(\lambda\) in this grid, the proportition of variance explained by the principal components is calculated using the pev function. This function calculates the proportion of variance explained by not necessarily orthogonal principal components using the optimal projected variance (optVar). The number of variables selected to build the principal component with the sparse loadings is also calculated for each \(\lambda\).
<- matrix(NA,length(grid), m)
pevdim <- matrix(NA,length(grid), m)
nseldim
for (k in 1:length(grid))
{<- groupsparsePCA(protein, lambda=rep(grid[k], m),
x m=m, block = 1) #block different mu
<- pev(x)
pevdim[k,] <- apply(x$Z,2,function(x){sum(x!=0)})
nseldim[k,]
}
colnames(pevdim) <- colnames(nseldim) <- paste("comp", 1:m,sep="")
colnames(nseldim) <- colnames(nseldim) <- paste("comp", 1:m,sep="")
The figure below plots the proportion of variance explained by the principal components (pev) as a function of the reduced sparsity parameter \(\lambda\). This plot suggest to choose \(\lambda=0.6\) (vertical black line).
matplot(grid, pevdim, xlab="lambda", ylab="pev",
type="l", col=1:m, lty=2)
legend("topright",legend=colnames(pevdim),
lty=2, col=1:m, bty="n")
abline(v=0.6)
The next figure plots the number of selected variables (nsel) as a function of \(\lambda\).
matplot(grid,nseldim, xlab="lambda", ylab="nsel",
type="l",col=1:m,lty=2)
legend("topright",legend=colnames(pevdim),
lty=2, col=1:m,bty="n")
abline(v=0.6)
When \(\lambda=0.6\), four variables are selected to build the first component, one is selected to build the second component, two for the third and one for the fourth.
The sparse loadings obtained with \(\lambda=0.6\) are given below.
<- groupsparsePCA(protein, lambda=rep(0.6, m),
x m=m, block = 1)
<- x$Z # sparse PCA loadings Zspca
variables | comp1 | comp2 | comp3 | comp4 |
---|---|---|---|---|
Red.Meat | 0.000 | 0 | 0.000 | -1 |
White.Meat | 0.000 | 0 | 0.523 | 0 |
Eggs | -0.527 | 0 | 0.000 | 0 |
Milk | -0.358 | 0 | 0.000 | 0 |
Fish | 0.000 | 0 | -0.852 | 0 |
Cereals | 0.630 | 0 | 0.000 | 0 |
Starchy.Foods | 0.000 | 0 | 0.000 | 0 |
Nuts | 0.445 | 0 | 0.000 | 0 |
Fruite.veg. | 0.000 | 1 | 0.000 | 0 |
The four principal components build with these sparse loadings explain 62.9 % of the variance against 85.8% for standard PCA.
sum(pev(x))
## [1] 0.6290102
The proportition of variance explained by each component is given below:
comp1 | comp2 | comp3 | comp4 |
---|---|---|---|
30.14 | 10.63 | 13.27 | 8.86 |
The four principal components \(y_j\), \(j=1\ldots4\) are new synthetic variables build using only variables with non zero loadings:
\[y_1=-0.527 \; Eggs -0.358 \; Milk + 0.630 \; Cereals + 0.445 \; Nuts\] \[y_2=Fruit.veg\] \[y_3=0.523 \; White.Meat - 0.852 \; Fish\]
\[y_4=-Red.Meat\]