Sparse PCA for mixed data

Marie Chavent

Statlog Heart Data Set

PCAmix

sparsePCAmix

References

Statlog Heart Data Set

This dataset from UCI Machine Learning repository1 is a heart disease database similar to the database already present in the repository (Heart Disease databases) but in a slightly different form. The 14 variables are:

library(sparsePCA)
data("HDdata")

PCAmix

The dataset contains 7 quantitative variables and 6 qualitative variables. We first use the function PCAmix to compute PCA of a mixture of numerical and categorical variables.

library(PCAmixdata)
dat <- HDdata[,-14]
X.quanti <- splitmix(dat)$X.quanti
X.quali <- splitmix(dat)$X.quali

pca <- PCAmix(X.quanti, X.quali,
              rename.level = TRUE, graph=FALSE)
pca$eig[1:5,]
##       Eigenvalue Proportion Cumulative
## dim 1   3.216257  17.868093   17.86809
## dim 2   1.670727   9.281815   27.14991
## dim 3   1.486730   8.259613   35.40952
## dim 4   1.275941   7.088559   42.49808
## dim 5   1.159581   6.442117   48.94020

The first principal component (PC) explains 17.8% of the variance of the data. The second explains 9.2% with a total of 27% explained by the two fist PCs.

HD <- HDdata$heart_disease
levels(HD) <- c("absence","presence")
plot(pca, coloring.ind = HD, main="",
     cex.leg=0.8,
     posleg="topright", label=FALSE, cex = 0.8)

The first PC discriminates relatively well between the presence and absence of heart disease.

plot(pca, choice="sqload", cex = 0.8,  cex.leg=0.8,)

The graph of the squared loadings shows for instance that the qualitative variable slope and the quantitative variable oldpeak are linked to the first PC.

plot(pca, choice = "cor", cex = 0.8,  cex.leg=0.8,)

The graph of the correlations between the quantitative variables and the PCs shows the sign of this correlation.

plot(pca, choice = "levels", cex = 0.8,  cex.leg=0.8,)

The graph of the levels shows that the patients with slope=1 are located rather on the left (absence of heart disease) of the factorial map while those with slope=2 or slope=2 are on the right (presence of heart disease).

sparsePCAmix

We use now the function sparsePCAmix to compute sparse PCA of a mixture of quantitative and qualitative variables:

sparsePCAmix(A, m, lambda, block = 1, mu = 1/1:m, rename.level = FALSE)

The result of this function depends on the number of components used. Here we choose \(m=2\) components according to the results of PCAmix above. We choose arbitrarily \(\lambda_1=\lambda_2=0.4\).

m <- 2
res <- sparsePCAmix(X.quanti, X.quali, m=m, 
                      lambda=c(0.4, 0.4), block = 1, 
                      rename.level = TRUE) 

The main results are given below:

res$pev # proportion of variance explained be the components
## [1] 0.09891714 0.05554993
res$varsel # list of variables selected to build each component
## $dim.1
## [1] "max_heart_rate" "oldpeak"        "slope"         
## 
## $dim.2
## [1] "serum_cholestoral"

The proportion of variance explained by the components has necessarily decreased: from 17.9% to 9.8% for the first PC and from 9.2% to 5.5% for the second.

Choice of the sparsity parameters.

In order to choose the sparsity parameters \(\lambda_1\) and \(\lambda_2\), we build a grid of values \(\lambda=\lambda_1=\lambda_2\) in \([0,1]\). Then we compute the proportion of variance explained by the two sparse PCs for each commun sparsity parameter of the grid.

library(sparsePCA)
m <- 2 # number of dimension 
lamb <- seq(0, 1, by=0.01) # grid of common lambda values
pev <- matrix(NA,length(lamb),m) # proportion of explained variance
for (k in 1:length(lamb))
{
  res <- sparsePCAmix(X.quanti, X.quali, m=m, 
                      lambda=rep(lamb[k], m), block = 1, 
                      rename.level = TRUE) 
  pev[k,] <-res$pev
}
colnames(pev) <- paste("dim", 1:m,sep="")

The plot of the pev according to \(\lambda\) suggests to choose \(\lambda=\lambda_1=\lambda_2=0.37\) (vertical blue line).

k=38
matplot(lamb, pev, main="", ylab="pev", xlab = "lambda",
        type="l",col=1:m,lty=2,cex.main=1) 
legend("topright",legend=colnames(pev),
       lty=2,col=1:m,bty="n", cex=1)
abline(v=lamb[k], col="blue")

Let us now compute sparsePCAmix with these values \(\lambda=\lambda_1=\lambda_2=0.37\).

res <- sparsePCAmix(X.quanti, X.quali, lambda=rep(0.37,m),
                  m=m, block = 1, rename.level = TRUE)
res$pev # proportion of explained variance
## [1] 0.1425241 0.0554981
res$varsel # selected variables
## $dim.1
## [1] "max_heart_rate" "oldpeak"        "chest_pain"     "induced_angina"
## [5] "slope"          "thal"          
## 
## $dim.2
## [1] "serum_cholestoral"

With this choice of \(\lambda\), the proportion of variance explained by the first components decreased only from 17.9% to 14.25% for the first PC. This component is build with only 6 variables (max_heart_rate, oldpeak, chest_pain, induced_angina, slope, and thal).

Plot of the patients according to the sparse PCs

The sparse principal components are build with the variables selected previously.

dim1 <- 1
dim2 <- 2
lab.x <- paste("Dim ", dim1, 
               " (", signif(res$pev[dim1]*100, 4), " %)", 
               sep = "")
lab.y <- paste("Dim ", dim2, 
                 " (", signif(res$pev[dim2]*100, 4), " %)", 
                 sep = "")
graphics::plot(res$scores[, c(dim1,dim2)], main="sparsePCAmix",
               xlab = lab.x, ylab = lab.y, 
               pch = 20, col = HD, cex=0.8)
graphics::abline(h = 0, lty = 2)
graphics::abline(v = 0, lty = 2)
graphics::legend("topright", text.col = 1:2,
                 legend = paste("HD=", levels(HD), sep="")
                 )

References


  1. https://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29↩︎