This dataset from UCI Machine Learning repository1 is a heart disease database similar to the database already present in the repository (Heart Disease databases) but in a slightly different form. The 14 variables are:
library(sparsePCA)
data("HDdata")The dataset contains 7 quantitative variables and 6 qualitative
variables. We first use the function PCAmix to compute PCA
of a mixture of numerical and categorical variables.
library(PCAmixdata)
dat <- HDdata[,-14]
X.quanti <- splitmix(dat)$X.quanti
X.quali <- splitmix(dat)$X.quali
pca <- PCAmix(X.quanti, X.quali,
rename.level = TRUE, graph=FALSE)
pca$eig[1:5,]## Eigenvalue Proportion Cumulative
## dim 1 3.216257 17.868093 17.86809
## dim 2 1.670727 9.281815 27.14991
## dim 3 1.486730 8.259613 35.40952
## dim 4 1.275941 7.088559 42.49808
## dim 5 1.159581 6.442117 48.94020
The first principal component (PC) explains 17.8% of the variance of the data. The second explains 9.2% with a total of 27% explained by the two fist PCs.
HD <- HDdata$heart_disease
levels(HD) <- c("absence","presence")
plot(pca, coloring.ind = HD, main="",
cex.leg=0.8,
posleg="topright", label=FALSE, cex = 0.8)The first PC discriminates relatively well between the presence and absence of heart disease.
plot(pca, choice="sqload", cex = 0.8, cex.leg=0.8,)The graph of the squared loadings shows for instance that the
qualitative variable slope and the quantitative variable
oldpeak are linked to the first PC.
plot(pca, choice = "cor", cex = 0.8, cex.leg=0.8,)The graph of the correlations between the quantitative variables and the PCs shows the sign of this correlation.
plot(pca, choice = "levels", cex = 0.8, cex.leg=0.8,)The graph of the levels shows that the patients with
slope=1 are located rather on the left (absence of heart
disease) of the factorial map while those with slope=2 or
slope=2 are on the right (presence of heart disease).
We use now the function sparsePCAmix to compute sparse
PCA of a mixture of quantitative and qualitative variables:
sparsePCAmix(A, m, lambda, block = 1, mu = 1/1:m, rename.level = FALSE)
A is the data matrix,m is the number of components,lambda is the vector of the \(m\) reduced sparsity parameters \(\lambda_j \in [0,1]\),block defines the approach (block=0 for
deflation and block=1 for block).mu is the vector of weights used in the block
approach.The result of this function depends on the number of components used.
Here we choose \(m=2\) components
according to the results of PCAmix above. We choose
arbitrarily \(\lambda_1=\lambda_2=0.4\).
m <- 2
res <- sparsePCAmix(X.quanti, X.quali, m=m,
lambda=c(0.4, 0.4), block = 1,
rename.level = TRUE) The main results are given below:
res$pev # proportion of variance explained be the components## [1] 0.09891714 0.05554993
res$varsel # list of variables selected to build each component## $dim.1
## [1] "max_heart_rate" "oldpeak" "slope"
##
## $dim.2
## [1] "serum_cholestoral"
The proportion of variance explained by the components has necessarily decreased: from 17.9% to 9.8% for the first PC and from 9.2% to 5.5% for the second.
In order to choose the sparsity parameters \(\lambda_1\) and \(\lambda_2\), we build a grid of values \(\lambda=\lambda_1=\lambda_2\) in \([0,1]\). Then we compute the proportion of variance explained by the two sparse PCs for each commun sparsity parameter of the grid.
library(sparsePCA)
m <- 2 # number of dimension
lamb <- seq(0, 1, by=0.01) # grid of common lambda values
pev <- matrix(NA,length(lamb),m) # proportion of explained variance
for (k in 1:length(lamb))
{
res <- sparsePCAmix(X.quanti, X.quali, m=m,
lambda=rep(lamb[k], m), block = 1,
rename.level = TRUE)
pev[k,] <-res$pev
}
colnames(pev) <- paste("dim", 1:m,sep="")The plot of the pev according to \(\lambda\) suggests to choose \(\lambda=\lambda_1=\lambda_2=0.37\)
(vertical blue line).
k=38
matplot(lamb, pev, main="", ylab="pev", xlab = "lambda",
type="l",col=1:m,lty=2,cex.main=1)
legend("topright",legend=colnames(pev),
lty=2,col=1:m,bty="n", cex=1)
abline(v=lamb[k], col="blue")Let us now compute sparsePCAmix with these values \(\lambda=\lambda_1=\lambda_2=0.37\).
res <- sparsePCAmix(X.quanti, X.quali, lambda=rep(0.37,m),
m=m, block = 1, rename.level = TRUE)
res$pev # proportion of explained variance## [1] 0.1425241 0.0554981
res$varsel # selected variables## $dim.1
## [1] "max_heart_rate" "oldpeak" "chest_pain" "induced_angina"
## [5] "slope" "thal"
##
## $dim.2
## [1] "serum_cholestoral"
With this choice of \(\lambda\), the
proportion of variance explained by the first components decreased only
from 17.9% to 14.25% for the first PC. This component is build with only
6 variables (max_heart_rate, oldpeak,
chest_pain, induced_angina,
slope, and thal).
The sparse principal components are build with the variables selected previously.
dim1 <- 1
dim2 <- 2
lab.x <- paste("Dim ", dim1,
" (", signif(res$pev[dim1]*100, 4), " %)",
sep = "")
lab.y <- paste("Dim ", dim2,
" (", signif(res$pev[dim2]*100, 4), " %)",
sep = "")
graphics::plot(res$scores[, c(dim1,dim2)], main="sparsePCAmix",
xlab = lab.x, ylab = lab.y,
pch = 20, col = HD, cex=0.8)
graphics::abline(h = 0, lty = 2)
graphics::abline(v = 0, lty = 2)
graphics::legend("topright", text.col = 1:2,
legend = paste("HD=", levels(HD), sep="")
)