This dataset from UCI Machine Learning repository1 is a heart disease database similar to the database already present in the repository (Heart Disease databases) but in a slightly different form. The 14 variables are:
library(sparsePCA)
data("HDdata")
The dataset contains 7 quantitative variables and 6 qualitative
variables. We first use the function PCAmix
to compute PCA
of a mixture of numerical and categorical variables.
library(PCAmixdata)
<- HDdata[,-14]
dat <- splitmix(dat)$X.quanti
X.quanti <- splitmix(dat)$X.quali
X.quali
<- PCAmix(X.quanti, X.quali,
pca rename.level = TRUE, graph=FALSE)
$eig[1:5,] pca
## Eigenvalue Proportion Cumulative
## dim 1 3.216257 17.868093 17.86809
## dim 2 1.670727 9.281815 27.14991
## dim 3 1.486730 8.259613 35.40952
## dim 4 1.275941 7.088559 42.49808
## dim 5 1.159581 6.442117 48.94020
The first principal component (PC) explains 17.8% of the variance of the data. The second explains 9.2% with a total of 27% explained by the two fist PCs.
<- HDdata$heart_disease
HD levels(HD) <- c("absence","presence")
plot(pca, coloring.ind = HD, main="",
cex.leg=0.8,
posleg="topright", label=FALSE, cex = 0.8)
The first PC discriminates relatively well between the presence and absence of heart disease.
plot(pca, choice="sqload", cex = 0.8, cex.leg=0.8,)
The graph of the squared loadings shows for instance that the
qualitative variable slope
and the quantitative variable
oldpeak
are linked to the first PC.
plot(pca, choice = "cor", cex = 0.8, cex.leg=0.8,)
The graph of the correlations between the quantitative variables and the PCs shows the sign of this correlation.
plot(pca, choice = "levels", cex = 0.8, cex.leg=0.8,)
The graph of the levels shows that the patients with
slope=1
are located rather on the left (absence of heart
disease) of the factorial map while those with slope=2
or
slope=2
are on the right (presence of heart disease).
We use now the function sparsePCAmix
to compute sparse
PCA of a mixture of quantitative and qualitative variables:
sparsePCAmix(A, m, lambda, block = 1, mu = 1/1:m, rename.level = FALSE)
A
is the data matrix,m
is the number of components,lambda
is the vector of the \(m\) reduced sparsity parameters \(\lambda_j \in [0,1]\),block
defines the approach (block=0
for
deflation and block=1
for block).mu
is the vector of weights used in the block
approach.The result of this function depends on the number of components used.
Here we choose \(m=2\) components
according to the results of PCAmix
above. We choose
arbitrarily \(\lambda_1=\lambda_2=0.4\).
<- 2
m <- sparsePCAmix(X.quanti, X.quali, m=m,
res lambda=c(0.4, 0.4), block = 1,
rename.level = TRUE)
The main results are given below:
$pev # proportion of variance explained be the components res
## [1] 0.09891714 0.05554993
$varsel # list of variables selected to build each component res
## $dim.1
## [1] "max_heart_rate" "oldpeak" "slope"
##
## $dim.2
## [1] "serum_cholestoral"
The proportion of variance explained by the components has necessarily decreased: from 17.9% to 9.8% for the first PC and from 9.2% to 5.5% for the second.
In order to choose the sparsity parameters \(\lambda_1\) and \(\lambda_2\), we build a grid of values \(\lambda=\lambda_1=\lambda_2\) in \([0,1]\). Then we compute the proportion of variance explained by the two sparse PCs for each commun sparsity parameter of the grid.
library(sparsePCA)
<- 2 # number of dimension
m <- seq(0, 1, by=0.01) # grid of common lambda values
lamb <- matrix(NA,length(lamb),m) # proportion of explained variance
pev for (k in 1:length(lamb))
{<- sparsePCAmix(X.quanti, X.quali, m=m,
res lambda=rep(lamb[k], m), block = 1,
rename.level = TRUE)
<-res$pev
pev[k,]
}colnames(pev) <- paste("dim", 1:m,sep="")
The plot of the pev
according to \(\lambda\) suggests to choose \(\lambda=\lambda_1=\lambda_2=0.37\)
(vertical blue line).
=38
kmatplot(lamb, pev, main="", ylab="pev", xlab = "lambda",
type="l",col=1:m,lty=2,cex.main=1)
legend("topright",legend=colnames(pev),
lty=2,col=1:m,bty="n", cex=1)
abline(v=lamb[k], col="blue")
Let us now compute sparsePCAmix
with these values \(\lambda=\lambda_1=\lambda_2=0.37\).
<- sparsePCAmix(X.quanti, X.quali, lambda=rep(0.37,m),
res m=m, block = 1, rename.level = TRUE)
$pev # proportion of explained variance res
## [1] 0.1425241 0.0554981
$varsel # selected variables res
## $dim.1
## [1] "max_heart_rate" "oldpeak" "chest_pain" "induced_angina"
## [5] "slope" "thal"
##
## $dim.2
## [1] "serum_cholestoral"
With this choice of \(\lambda\), the
proportion of variance explained by the first components decreased only
from 17.9% to 14.25% for the first PC. This component is build with only
6 variables (max_heart_rate
, oldpeak
,
chest_pain
, induced_angina
,
slope
, and thal
).
The sparse principal components are build with the variables selected previously.
<- 1
dim1 <- 2
dim2 <- paste("Dim ", dim1,
lab.x " (", signif(res$pev[dim1]*100, 4), " %)",
sep = "")
<- paste("Dim ", dim2,
lab.y " (", signif(res$pev[dim2]*100, 4), " %)",
sep = "")
::plot(res$scores[, c(dim1,dim2)], main="sparsePCAmix",
graphicsxlab = lab.x, ylab = lab.y,
pch = 20, col = HD, cex=0.8)
::abline(h = 0, lty = 2)
graphics::abline(v = 0, lty = 2)
graphics::legend("topright", text.col = 1:2,
graphicslegend = paste("HD=", levels(HD), sep="")
)