Sparse PCA for mixed data

Statlog Heart Data Set

This dataset from UCI Machine Learning repository¹ is a heart disease database similar to the database already present in the repository (Heart Disease databases) but in a slightly different form. The 14 variables are:

age
sex (0=female, 1=male),
chest pain type ( 1=typical angina, 2=atypical angina, 3=non-anginal pain, 4:=asymptomatic),
resting blood pressure in mm Hg
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl (1=True, 0=False)
resting electrocardiographic results (0=normal,1=ST-T wave abnormality, 2= elevation or depression of > 0.05 mV))
maximum heart rate achieved
exercise induced angina (1=Yes, 0=No)
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment (1=upsloping, 2=flat,3=downsloping)
number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
heart disease : 1=absence, 2=presence

library(sparsePCA)
data("HDdata")

PCAmix

The dataset contains 7 quantitative variables and 6 qualitative variables. We first use the function PCAmix to compute PCA of a mixture of numerical and categorical variables.

library(PCAmixdata)
dat <- HDdata[,-14]
X.quanti <- splitmix(dat)$X.quanti
X.quali <- splitmix(dat)$X.quali

pca <- PCAmix(X.quanti, X.quali,
              rename.level = TRUE, graph=FALSE)
pca$eig[1:5,]

##       Eigenvalue Proportion Cumulative
## dim 1   3.216257  17.868093   17.86809
## dim 2   1.670727   9.281815   27.14991
## dim 3   1.486730   8.259613   35.40952
## dim 4   1.275941   7.088559   42.49808
## dim 5   1.159581   6.442117   48.94020

The first principal component (PC) explains 17.8% of the variance of the data. The second explains 9.2% with a total of 27% explained by the two fist PCs.

HD <- HDdata$heart_disease
levels(HD) <- c("absence","presence")
plot(pca, coloring.ind = HD, main="",
     cex.leg=0.8,
     posleg="topright", label=FALSE, cex = 0.8)

The first PC discriminates relatively well between the presence and absence of heart disease.

plot(pca, choice="sqload", cex = 0.8,  cex.leg=0.8,)

The graph of the squared loadings shows for instance that the qualitative variable slope and the quantitative variable oldpeak are linked to the first PC.

plot(pca, choice = "cor", cex = 0.8,  cex.leg=0.8,)

The graph of the correlations between the quantitative variables and the PCs shows the sign of this correlation.

plot(pca, choice = "levels", cex = 0.8,  cex.leg=0.8,)

The graph of the levels shows that the patients with slope=1 are located rather on the left (absence of heart disease) of the factorial map while those with slope=2 or slope=2 are on the right (presence of heart disease).

sparsePCAmix

We use now the function sparsePCAmix to compute sparse PCA of a mixture of quantitative and qualitative variables:

sparsePCAmix(A, m, lambda, block = 1, mu = 1/1:m, rename.level = FALSE)

A is the data matrix,
m is the number of components,
lambda is the vector of the \(m\) reduced sparsity parameters \(\lambda_j \in [0,1]\),
block defines the approach (block=0 for deflation and block=1 for block).
mu is the vector of weights used in the block approach.

The result of this function depends on the number of components used. Here we choose \(m=2\) components according to the results of PCAmix above. We choose arbitrarily \(\lambda_1=\lambda_2=0.4\).

m <- 2
res <- sparsePCAmix(X.quanti, X.quali, m=m, 
                      lambda=c(0.4, 0.4), block = 1, 
                      rename.level = TRUE)

The main results are given below:

res$pev # proportion of variance explained be the components

## [1] 0.09891714 0.05554993

res$varsel # list of variables selected to build each component

## $dim.1
## [1] "max_heart_rate" "oldpeak"        "slope"         
## 
## $dim.2
## [1] "serum_cholestoral"

The proportion of variance explained by the components has necessarily decreased: from 17.9% to 9.8% for the first PC and from 9.2% to 5.5% for the second.

Choice of the sparsity parameters.

In order to choose the sparsity parameters \(\lambda_1\) and \(\lambda_2\), we build a grid of values \(\lambda=\lambda_1=\lambda_2\) in \([0,1]\). Then we compute the proportion of variance explained by the two sparse PCs for each commun sparsity parameter of the grid.

library(sparsePCA)
m <- 2 # number of dimension 
lamb <- seq(0, 1, by=0.01) # grid of common lambda values
pev <- matrix(NA,length(lamb),m) # proportion of explained variance
for (k in 1:length(lamb))
{
  res <- sparsePCAmix(X.quanti, X.quali, m=m, 
                      lambda=rep(lamb[k], m), block = 1, 
                      rename.level = TRUE) 
  pev[k,] <-res$pev
}
colnames(pev) <- paste("dim", 1:m,sep="")

The plot of the pev according to \(\lambda\) suggests to choose \(\lambda=\lambda_1=\lambda_2=0.37\) (vertical blue line).

k=38
matplot(lamb, pev, main="", ylab="pev", xlab = "lambda",
        type="l",col=1:m,lty=2,cex.main=1) 
legend("topright",legend=colnames(pev),
       lty=2,col=1:m,bty="n", cex=1)
abline(v=lamb[k], col="blue")

Let us now compute sparsePCAmix with these values \(\lambda=\lambda_1=\lambda_2=0.37\).

res <- sparsePCAmix(X.quanti, X.quali, lambda=rep(0.37,m),
                  m=m, block = 1, rename.level = TRUE)
res$pev # proportion of explained variance

## [1] 0.1425241 0.0554981

res$varsel # selected variables

## $dim.1
## [1] "max_heart_rate" "oldpeak"        "chest_pain"     "induced_angina"
## [5] "slope"          "thal"          
## 
## $dim.2
## [1] "serum_cholestoral"

With this choice of \(\lambda\), the proportion of variance explained by the first components decreased only from 17.9% to 14.25% for the first PC. This component is build with only 6 variables (max_heart_rate, oldpeak, chest_pain, induced_angina, slope, and thal).

Plot of the patients according to the sparse PCs

The sparse principal components are build with the variables selected previously.

dim1 <- 1
dim2 <- 2
lab.x <- paste("Dim ", dim1, 
               " (", signif(res$pev[dim1]*100, 4), " %)", 
               sep = "")
lab.y <- paste("Dim ", dim2, 
                 " (", signif(res$pev[dim2]*100, 4), " %)", 
                 sep = "")
graphics::plot(res$scores[, c(dim1,dim2)], main="sparsePCAmix",
               xlab = lab.x, ylab = lab.y, 
               pch = 20, col = HD, cex=0.8)
graphics::abline(h = 0, lty = 2)
graphics::abline(v = 0, lty = 2)
graphics::legend("topright", text.col = 1:2,
                 legend = paste("HD=", levels(HD), sep="")
                 )

References

M. Chavent, G. Chavent. Optimal Projected Variance Group-Sparse Block PCA. 2021. ⟨hal-03125264⟩
M. Chavent, V. Kuentz-Simonet, A. Labenne, J. Saracco. Multivariate Analysis of Mixed Data: The R Package PCAmixdata. Electronic Journal of Applied Statistical Analysis. ⟨hal-01662595⟩
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.