Data Analytics - Machine Learning

Introduction

Welcome to this Machine Learning tutorial. Please be sure you have done the Statistical Learning tutorial.

In this tutorial supervised and unsupervised learning will be introduced.

Classifications

We will use the Titanic dataset. Please see the Statistical Learning tutorial for a brief introduction and the logistic regression for initial classifications. Here, we will show how to use the decision tree and random forest methods.

Warm-up and Kaggle

What is the proportion of survivors?

T = table(titanic_train$Survived)
prop.table(T)

Answer: 38.4% survived.

Create a file that you can submit at Kaggle’s competition. Predict that there are no survivors.

# initialise such that there are no survivors
ns = rep(0,nrow(titanic_test)) #
# create submission file for Kaggle competition
submit <- data.frame(PassengerId = titanic_test$PassengerId, Survived = ns)
# write.csv(submit,file="no-survivors.csv", row.names = FALSE)
head(submit, n=3L)

The file can be submitted using this link on Kaggle.

Decision trees

Plot a simple decision tree using Survived ~ Sex.

# library(rpart)
dtp <- rpart(Survived ~ Sex, data=titanic_train, method="class" )
plot(dtp,margin=0.2); text(dtp, pretty=0, use.n=TRUE, all=TRUE, cex=.8)

This is pretty ugly. We want to have a better decision tree visualisation and will use another package.

# library(pacman); p_load(rattle, rpart.plot, RColorBrewer)
dt <- rpart(Survived ~ Sex, data=titanic_train, method="class" )
fancyRpartPlot(dt)

This is much better. What is the meaning? The green colours mean passengers have died and blue means they survived. The root node (1) states that 62% of the passengers died. Hence, 38% survived. This agrees with the table we created earlier. Now, there is the questions Sex = male (is the passenger male)? If the answer is yes then we follow the left branch, otherwise the right one. Going to the left we end up at the green node 2. That means, males have a 81% chance of dying; and they constitute 65% of the data. So, a tree that is very easy to interpret.

What is the training-accuracy of this tree?

yh = predict(dt,type="class")
y  = titanic_train$Survived
mean(y == yh)

Answer: 78.7%

Let us now get more factors into the tree. First we will convert several fields to factors. Furthermore, we will set the age na fields to its average. The Name column will be dropped.

train = titanic_train; test = titanic_test;
cols <- c("Pclass", "Sex", "Ticket", "Cabin", "Embarked")
train$Survived = as.factor(train$Survived)
train[cols] <- lapply(train[cols], factor)
test[cols]  <- lapply(test[cols], factor)
mv = mean(c(train$Age,test$Age), na.rm =TRUE)
train$Age[is.na(train$Age)] <- mv
test$Age [is.na(test$Age) ] <- mv

# remove columns , e.g. train$Name = NULL
train = subset( train, select = -c(PassengerId, Name, Ticket, Cabin))
test = subset( test, select = -c(PassengerId, Name, Ticket, Cabin))

Is the new train-accuracy better using all factors?

dtc <- rpart(Survived ~ ., data=train, method="class" )
yh = predict(dtc,type="class"); y  = train$Survived
mean(y == yh)

The accuracy has improved.

Let’s visualise the new model.

fancyRpartPlot(dtc)

Unfortunately, over-fitting can lead to worse test-accuracy.

yh <- predict(dtc, test, type="class")
head(yh, n=3L)

Pruning

Pruning a decision tree is used to reduce complexity (bias). printcp displays the complexity parameter (CP) table and plotcp gives a visual display of it.

printcp(dtc)
plotcp(dtc)

Let us use the displayed CPs and relative error to prune the tree. More specifically the CP between 0.027.

dtp <- prune(dtc, cp=0.027)
#plot(dtp,margin=0.2); text(dtp, pretty=0, use.n=TRUE, all=TRUE, cex=.8)
fancyRpartPlot(dtp)

This is also know as post-pruning.

Try to submit the new model to Kaggle and see whether you improved your ranking.

Please have a look at Trevor Stephens - Titanic for several more interesting ideas.

Random Forest

Let us now use the random forest method. This method creates several decision trees by varying the number of features and amount of data used. These trees then have a majority vote of which group to choose for a given observation.

The random forest function cannot deal with factors that have more than 53 categories. So we had to remove Ticket and Cabine from the input data.

#library(randomForest)
set.seed(7)
rf <- randomForest(Survived ~ ., data = train)
summary(rf)

What is the training-accuracy of this model?

yh = predict(rf) 
mean( yh == train$Survived)

Predictions

We will use linear regression and the regression tree (= decision tree) for predictions. The models quality will be compared using the root-mean-squared-error (RMSE) and the mean-absolute-percentage-error (MAPE) introduced later.

We will use the Wage dataset from the ISLR package. Please see the related book James et al. (2013) for more details.

# library(ISLR) #loads the Wage data (pre-loaded)
data(Wage); head(Wage, n=3L);
str(Wage)

There are 3000 observations and 11 variables. Most variables are factors. We observe that the data is given per year. The salary (wage) is thousands of dollars.

Does the wage depend on the person’s age?

a = ggplot(Wage, aes(age, wage))
a+geom_point(colour="gray")+geom_smooth(method="lm")+geom_smooth()
# plot(Wage$age,Wage$wage) # built-in plot (avoids ggplot)

Visually, it appears that there is a small correlation. We not two distinct groups one with continuously high salary and one that increases initially before dropping. The group distinction might be well modeled with decision trees. The nonlinearity in the second group may require a new polynomial feature to improve the model quality.

What is the correlation coefficient?

cor(Wage$age,Wage$wage)

The correlation value indicates that there is only a very week correlation. That means, to infer the salary from the age variable is risky.

We will now derive a few models and compare their test-accuracy.

Split the data into 70% training samples and 30% test samples. We will also remove the logwage (column 10) since this is the logarithmic wage (see also cor(Wage$wage, Wage$logwage)). Hence, defeating the purpose of predicting the wage.

n = nrow(Wage); ns = floor(.7*n);
set.seed(1) # random generator seed
I = sample(1:n, ns)
train = Wage[I,-10]
test = Wage[-I,-10]
cat("train: ",ns, " observations")

Note, once the best model is derived the entire data-set can be used as training data.

Linear Regression

Create a linear regression and “predict” the salary (wage) at the age of 20, 25 and 30.

pm <- lm(wage ~ age, train)
predict(pm, data.frame(age=c(20,25,30)))

RMSE

What is the test root-mean-squared-error (rmse)? \[\sqrt{\frac{1}{n}\sum_{k=1}^n \left(y_k -\hat{y_k}\right)^2} \]

rmse <- function(y,yh) {sqrt(mean((y-yh)^2))}

pm <- lm(wage ~ age, train)
yh = predict(pm, newdata=test)
y  = test$wage
rmse(y,yh)

MAPE

The mean-absolute-percentage-error (mape, MAPE) is defined as: \[\frac{1}{n}\sum_{k=1}^n \left|\frac{y_k - \hat{y_k}}{y_k}\right| \]

What is the test-MAPE ?

mape     <- function(y,yh) {mean(abs((y-yh)/y))} 

pm <- lm(wage ~ age, train)
yh = predict(pm, newdata=test)
y  = test$wage
mape(y,yh)

Note: What happens, when an element is zero (\(y_i=0\))? The formula breaks. That means it is assumed that all elements are non-zero. Sometimes the MAPE is defined as: \[\frac{1}{n}\sum_{k=1}^n \frac{|y_k - \hat{y_k}|}{y_k} \] This means, negative values would reduce the error; but it works fine if all elements are positive.

That means, on average we expect a model error of 29.2%.

Non-linearity

Previously, we saw that the salary seems to have some non-linearity initially salary increases with age and then it drops again.

What is the correlation between age squared and wage?

cor(Wage$age^2,Wage$wage)

This is even lower than before. This does not support the visual-observation very well.

Create a new model that integrates the squared age in the model. What is the MAPE?

pm <- lm(wage ~ age + I(age^2), train )
mape ( test$wage, predict(pm, newdata=test) )

We see by adding this feature the accuracy has improved.

Note: I() (i.e. as.is) needs to be used, because ^ is used to describe interactions in formulas . Alternatively, you could create numeric data cbind(age, age^2) or poly(age,2,raw=TRUE).

Adding more terms will improve the accuracy. Try to add age^3 and age^4 or try to add a polynomial of degree 20. The theory of Taylor expansions is behind these improvements.

Regression tree

Now, let us use the regression tree (decision tree) to gain more insights.

We begin with the simple formula wage ~ age.

dt <- rpart( wage ~ age, train)
yh = predict(dt, newdata = test)
y  = test$wage
mape(y, yh); # rmse(y, yh)
fancyRpartPlot(dt)

The test-MAPE of the DT is slightly worse than the LR.

Let us use now all variables.

dt <- rpart( wage ~ ., train)
mape(train$wage, predict(dt)); # training error
mape(test$wage, predict(dt, newdata=test)); # test error 

The low training-MAPE is deceiving, making us believe that we can predict very well the expected salary given all the previous factors. However, the high test-error means that overfitting took place. That means a less complex tree would be better. We can obtain such a tree by pruning.

Pruning

Pruning a decision tree is used to reduce complexity (bias). printcp displays the complexity parameter (CP) table and plotcp gives a visual display of it.

printcp(dt)
plotcp(dt)

Let us use the displayed CPs and relative error to prune the tree. More specifically the CP between 0.016 and 0.026.

dtp <- prune(dt, cp=0.021)
mape(test$wage, predict(dtp, newdata=test)); # test error 
plot(dtp,margin=0.2); text(dtp, pretty=0, use.n=TRUE, all=TRUE, cex=.8)

This is also know as post-pruning.

Additionally, we could have done pre-pruning. That means we provide a few control parameters to rpart such as minbucket (minimum number of observations in leaf node), minsplit (minimum number of observations for a split), etc.

dt.pp <- rpart( wage ~ ., train, control = rpart.control(minbucket=100))
mape(train$wage, predict(dt.pp)); # training error
mape(test$wage, predict(dt.pp, newdata=test)); # test error 
fancyRpartPlot(dt.pp)

Clustering

There are a few clustering algorithms. Often used is the \(k\)-means algorithm. Hierarchical clustering comes in two main flavours: bottom-up (agglomerative clustering) and top-down (divisive clustering).

k-means

The \(k\)-means algorithm is a popular heuristic, which finds \(k\) clusters by minimising the distance between observation and cluster centre.

Example - Iris

Irises are purple flowers. We will consider the following species: Setosa, Virginica and Versicolor. Iris-species

We will use the built in dataset.

data(iris); head(iris); summary(iris); str(iris)

As we can see the species is already within the dataset, which defeats the purpose of clustering. However, we pretend that we want to define three different species automatically. We will use the pre-defined species and compare them to the ones suggested by the clustering algorithm.

Let’s visualise the cluster using the petal’s length and width.

ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) +   geom_point()+ theme(legend.position = c(.18,.73))
  unique(iris$Species)

The cluster on the left seems to be straight forward to identify. The top ones are slightly more difficult to distinguish.

We will focus only on the petal length and width, and ignore the sepal (green cover for petals) variables.

Use the kmeans function and form three clusters. Start the algorithms-core 30 times, i.e. the initial centres of the clusters are placed thirty times randomly. How well do they fit the given species classification?

set.seed(7)
cm <- kmeans(iris[, 3:4], 3, nstart = 30)
cm
table(cm$cluster, iris$Species)

We see that the setosa cluster is perfectly matched. 96% of the versicolor species are identified and 92% of the virginica species.

Plot the new 3-means clusters.

cm$cluster <- as.factor(cm$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = cm$cluster)) + geom_point()+ theme(legend.position = "none")

Hierarchical Clustering

We will consider an agglomerative (bottom-up) approach.

Simple Call

Hierarchically cluster the iris petal data using complete linkage. First scale the data by mean and standard-deviation. Then determine all distances between all observations.

What distance measure was used by default?

# p_load(ggplot2, latex2exp) # for plots
# p_load(ggdendro, dendextend) # for dendograms
md = "complete"
hc = iris[, 3:4] %>% scale %>% dist %>% 
  hclust(method = md)
hc

Determine the top three clusters and compare them to the actual species.

I = cutree (hc , 3)
table(I, iris$Species)

We see that the setosa and virginica species were modelled perfectly. However, only 50% of the versicolor irises were identified correctly. The problem is that we used the complete linkage, which was inappropriate in this case.

Visualise the modelled clusters.

X = iris[, 3:4];
qplot(X[,1],X[,2],geom = "point",
        color=as.factor(I))+
    xlab(TeX("x_1"))+ylab(TeX("x_2"))+
    theme(legend.position = "none")

Dendogram

Let’s just use only a small sample (30 observations) to gain more insights about dendograms and their linkages. We begin by defining a function that prepares ourdendogram for plotting.

ppdend <- function(D, method="average", nb_clusters=3){
  return(D %>% scale %>% dist %>% 
    hclust(method = method) %>% as.dendrogram %>%
    set("branches_k_color", k=nb_clusters) %>% set("branches_lwd", .6) %>%
    set("labels_colors") %>% set("labels_cex", c(.6,.6)) %>% 
    set("leaves_pch", 19) %>% set("leaves_col", c("blue", "purple")))
}

set.seed(1); I = sample(1:150, 30);
X = iris[I, 3:4]; 

par(mfrow =c(2,3))
plot( ppdend(X,"single"),main="single")
plot( ppdend(X,"complete"),main="complete")
plot( ppdend(X,"average"),main="average")
plot( ppdend(X,"centroid"),main="centroid")
plot( ppdend(X,"ward.D"),main="ward.D")
plot( ppdend(X,"ward.D2"),main="ward.D2")

Let us have a look at a few more dendogram visualisations.

dend <- ppdend(X)
plot( dend ) # as before simple plot
ggplot(as.ggdend(dend)) # as ggplot
ggplot(as.ggdend(dend), labels = FALSE) + 
  scale_y_reverse(expand = c(0.2, 0)) +
  coord_polar(theta="x") # as ploar coordinate plot

Classifications - Iris

We will use the Iris dataset and divide it into test and train data.

n = nrow(iris)
ns = floor(.7*n)
set.seed(1); I = sample(1:n, ns)
train = iris[I,]; test = iris[-I,]

Decision Tree

Create a classification tree and determine it’s test-accuracy.

#library(rpart); library(pacman); p_load(rattle, rpart.plot, RColorBrewer)
dt <- rpart(Species ~ ., data=train, method="class" )
# fancyRpartPlot(dt) # not much use too complex
yh <- predict(dt, type="class", newdata = test)
mean(yh == test$Species)

Random Forest

Create a random forest and determine it’s test-accuracy.

# library(randomForest)
set.seed(1)
rf <- randomForest(Species ~ ., data=train, importance=TRUE)
yh <- predict(rf,type="class", newdata = test)
mean(yh == test$Species)

kNN

Create a kNN classifier and determine it’s test-accuracy.

# library (class)
knn.cf=knn (train[,1:4], test[,1:4], train$Species, k=3, prob=TRUE)
table(knn.cf , test$Species)
mean(knn.cf == test$Species)

Resources

Acknowledgment

This tutorial was created using RStudio, R, rmarkdown, and many other tools and libraries. The packages learnr and gradethis were particularly useful. I’m very grateful to Prof. Andy Field for sharing his disovr package, which allowed me to improve the style of this tutorial and get more familiar with learnr. Allison Horst wrote a very instructive blog “Teach R with learnr: a powerful tool for remote teaching”, which encouraged me to continue with learnr. By the way, I find her statistic illustrations amazing.

References

“Data Analytics Tutorial for Beginners - From Beginner to Pro in 10 Mins! - DataFlair.” 2019. DataFlair. https://data-flair.training/blogs/data-analytics-tutorial.

Field, Andy P. 2021. Discovering Statistics Using R and RStudio. Second. London: Sage.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge University Press.

Wolfgang Garn