Introduction

Welcome to this Machine Learning tutorial! In this tutorial supervised and unsupervised learning will be introduced.

Iris field

Machine Learning (ML) is mainly concerned with identifying patterns in the data in order to predict and classify. Previously we used linear regression and logistic regression to achieve this. However, we heard about them in the context of statistical learning (SL), since they are widely used for inference about the population and hypotheses. Typically, SL relies on assumptions such as normality, homoscedasticity, independent variables, and others; whereas ML often ignores those. We continue with supervised learning (i.e. response is known) approaches. Please be sure you have done the Statistical Learning tutorial. Particularly, we will look at tree-based methods such as the decision tree and random forest. Then we move to unsupervised learning approaches. Here, we will focus on clustering algorithms.

Predictions and Trees

Regression Tree Introduction

We will use the regression tree (= decision tree) for predictions. The quality of models will be compared using the root-mean-squared-error (RMSE) and the mean-absolute-percentage-error (MAPE).

In the Statistical Learning tutorial we introduced the faithful dataset and predicted eruption times using linear regression. Now, we will do the prediction using the regression tree method.

We will begin by developing some basic understanding of the regression tree algorithm.

First we extract 10 records from the original dataset.

ns = 10  # intended number of sample records
set.seed(1) # fix random generator (for reproducibility)
I = sample(1:nrow(faithful), ns) # get the sample indices
D = faithful[I,] # extract the sample
D

x = D$waiting; y = D$eruptions; # for easy access
cat('average of y is ', mean(y))

Let us visualise the data.

# library(ggplot2)
g <- ggplot(aes(x=waiting,y=eruptions), data = D)+
  geom_point(size=4, color="brown")+
  geom_text(label=1:ns, nudge_x = .5, nudge_y = .05)+
  xlab("x = waiting time [min]")+ ylab("y=eruption time[min]")+
  theme(text = element_text(size=12))
g

How would you split the data to minimise MSE? Recall the mean squared error is: \[\text{MSE} = \frac{1}{n}\sum_{k=1}^n \left(y_k -\hat{y_k}\right)^2, \] where \(n\) are the number of observations, \(y_k\) is the observed value (response, output), \(\hat{y}_k = \hat{f}(x_k)\) is the modelled value (prediction), and \(x_k\) is the input.

Before using the MSE, try it intuitively.

Probably, you did the split in a way similar to:
\[x_s = \frac{x_1+x_5}{2} = \frac{63+73}{2}=68,\] i.e. xs = (x[1]+x[5])/2.

Now, we can use this to predict (approximate) the \(y\)-values on the left \(c_1\) and right side \(c_2\): \[c_1 = \frac{1}{5}(y_1+y_2+y_6+y_8+y_9) = 2.04\] That means the MSE on the left side is: \[ \text{MSE}_\text{left} = \frac{1}{5}( (y_1-c_1)^2+(y_2-c_1)^2+(y_6-c_1)^2+(y_8-c_1)^2+(y_9-c_1)^2) \]

\[c_2 = \frac{1}{5}(y_5+y_7+y_{10}+y_4+y_3) = 4.1\] Similarly, to before we compute the \(\text{MSE}_\text{right}\). Other splits will give a higher MSE.

Let us visualise this additional information.

c1 = mean(y[x<68]) # left region
c2 = mean(y[x>=68]) # right region
g2 <- g + geom_vline(xintercept = 68, size=2, color = "blue")+ 
  geom_text(x=68,y=3, label="68", color="red")+
  geom_segment(aes(x=c(45),xend=c(68), y=c(c1), yend=c(c1)), 
               size=2, color="red", linetype=4)+
  geom_segment(aes(x=c(68),xend=c(95), y=c(c2), yend=c(c2)), 
               size=2, color="red", linetype=4)+
  geom_hline(yintercept = 3.1, size=2, color = "red")
g2

Decision Stump

A decision stump is a one level decision tree. That means, the root (internal node) is connected to the terminal nodes (known as leaves). The prediction is based on a single feature (also known as 1-rule).

How would you represent this information as a decision stump?

#library(pacman); # package manager
#p_load(rpart, rattle, rpart.plot, RColorBrewer) # load packages
dt <- rpart( eruptions ~ waiting, D, minsplit=2, maxdepth=1)
fancyRpartPlot(dt, sub="decision stump")

Let’s explain the graph. The numbers on the top of each node show the node numbers 1, 2 and 3. Let us consider node 1 (root node). If we take the average of all observations, then we get \(\bar{y}=3.1\). This is the average eruption time taken from the \(n=10\) observations, which represents 100% of the data.

Underneath node 1 we have the 1-rule (decision, if condition). That means, if the the waiting time is less than 68 minutes we follow the yes-branch on the left; otherwise the no-branch on the right. The yes-branch takes us to node 2 (a leaf node). We can now read the node 2 information. In case we waited less than 68 minutes we predict that the waiting time is 2 minutes. This is based on \(n=5\) eruption time observations, which represent 50% of the data.

What is the eruption time if we waited 80 minutes? Answer: 4.1 minutes.

The above used the function rpart - Recursive Partitioning and Regression Trees. It has a couple of rpart.control parameters such as the minimum number of observations that must exist in a node in order for a split to be attempted (minsplit), and the maximum depth of the final tree (maxdepth). The root node is at depth zero.

Level 2 - Decision Tree

Can you split again? “The growing of a tree” split region 1 (left) and 2 (right). This is a bit harder now - looking at the left region. It is not obvious where to split. We need to test all possible splits and select the one that minimises the MSE.

The rpart function helps us finding these splits. We just have to increase the maxdepth by one level.

dt <- rpart( eruptions ~ waiting, D, minsplit=2, maxdepth=2)
fancyRpartPlot(dt, sub="Three split nodes - a level 2 regression tree.")
c3 = mean(y[c(9,8,2,6)])
c4 = y[1]
c5 = mean(y[c(5,7,10,4)])
c6 = y[3]

g2 + 
  geom_segment(aes(x=c(45),xend=c(61), y=c(c3), yend=c(c3)), 
               size=1, color="green", linetype=4)+
  geom_segment(aes(x=c(61),xend=c(68), y=c(c4), yend=c(c4)), 
               size=1, color="green", linetype=4)+
  geom_vline(xintercept = 61, size=1, color = "blue", linetype=4)+
  geom_vline(xintercept = 87, size=1, color = "blue", linetype=4)+
  geom_segment(aes(x=c(68),xend=c(87), y=c(c5), yend=c(c5)), 
               size=1, color="green", linetype=4)+
  geom_segment(aes(x=c(87),xend=c(95), y=c(c6), yend=c(c6)), 
               size=1, color="green", linetype=4)+
  geom_hline(yintercept = 3.1, size=1, color = "red", linetype=4)

What is the expected waiting time when we waited:

  • 50 or 60 minutes?
  • 80 minutes?
  • 65 minutes?
  • 100 minutes?

Use the previous two diagrams and compare it to the following.

predict(dt, data.frame(waiting=c(50, 60, 80, 65, 100)))

Reflect how did you obtain this tree? How many split points are created? Can you formulate this as an algorithm!?

Next, learn a prediction tree for the train data (e.g. 70% of the faithful data). Test the quality of the model by computing the MSE, RMSE and MAPE. Compare the results to linear regression computed in Statistical Learning tutorial.

# training & test data split
n  = nrow(faithful); ns = floor(.7*n);
set.seed(1); I = sample(1:n,ns);
train = faithful[I,]; test = faithful[-I,];

# regression tree
dt <- rpart(eruptions ~ waiting, train, minsplit=3, maxdepth=7)
# fancyRpartPlot(dt)

# Quality measures
mse  <- function(y,yh) {mean((y-yh)^2)}
rmse <- function(y,yh) {sqrt(mse(y,yh))}
mape <- function(y,yh) {mean(abs( (y-yh)/y ))}
derr <- function(y,yh) {sprintf('MSE = %.3f, RMSE = %.3f, MAPE = %.2f%%\n',
                                    mse(y,yh),rmse(y,yh),100*mape(y,yh))}
y  = test$eruptions
yh = predict(dt, newdata = test)
cat("Regression tree: ", derr(y,yh))

# compare with linear regression
lr <- lm(eruptions ~ waiting, train)
yh = predict(lr, newdata = test)
cat("linear regression: ", derr(y,yh))

Regression Tree for Wage Data

We will use the Wage dataset from the ISLR package. Please see the related book James et al. (2013) for more details.

# library(ISLR) #loads the Wage data (pre-loaded)
data(Wage); head(Wage, n=3L);
str(Wage)

There are 3000 observations and 11 variables. Most variables are factors. We observe that the data is given per year. The salary (wage) is thousands of dollars.

Does the wage depend on the person’s age?

a = ggplot(Wage, aes(age, wage))
a+geom_point(colour="gray")+geom_smooth(method="lm")+geom_smooth()
# plot(Wage$age,Wage$wage) # built-in plot (avoids ggplot)

Visually, it appears that there is a small correlation. We not two distinct groups one with continuously high salary and one that increases initially before dropping. The group distinction might be well modeled with decision trees. The nonlinearity in the second group may require a new polynomial feature to improve the model quality.

What is the correlation coefficient?

cor(Wage$age,Wage$wage)

The correlation value indicates that there is only a very week correlation. That means, to infer the salary from the age variable is risky.

We will now derive a few models and compare their test-accuracy.

Split the data into 70% training samples and 30% test samples. We will also remove the logwage (column 10) since this is the logarithmic wage (see also cor(Wage$wage, Wage$logwage)). Hence, defeating the purpose of predicting the wage.

n = nrow(Wage); ns = floor(.7*n);
set.seed(1) # random generator seed
I = sample(1:n, ns)
train = Wage[I,-10]
test = Wage[-I,-10]
cat("train: ",ns, " observations")

Note, once the best model is derived the entire data-set can be used as training data.

Linear Regression

Create a linear regression and “predict” the salary (wage) at the age of 20, 25 and 30.

pm <- lm(wage ~ age, train)
predict(pm, data.frame(age=c(20,25,30)))

What is the test root-mean-squared-error (rmse)? \[\sqrt{\frac{1}{n}\sum_{k=1}^n \left(y_k -\hat{y_k}\right)^2} \]

rmse <- function(y,yh) {sqrt(mean((y-yh)^2))}

pm <- lm(wage ~ age, train)
yh = predict(pm, newdata=test)
y  = test$wage
rmse(y,yh)

The mean-absolute-percentage-error (mape, MAPE) is defined as: \[\frac{1}{n}\sum_{k=1}^n \left|\frac{y_k - \hat{y_k}}{y_k}\right| \]

What is the test-MAPE ?

mape     <- function(y,yh) {mean(abs((y-yh)/y))} 

pm <- lm(wage ~ age, train)
yh = predict(pm, newdata=test)
y  = test$wage
mape(y,yh)

Note: What happens, when an element is zero (\(y_i=0\))? The formula breaks. That means it is assumed that all elements are non-zero. Sometimes the MAPE is defined as: \[\frac{1}{n}\sum_{k=1}^n \frac{|y_k - \hat{y_k}|}{y_k} \] This means, negative values would reduce the error; but it works fine if all elements are positive.

That means, on average we expect a model error of 29.2%.

Non-linearity

Previously, we saw that the salary seems to have some non-linearity initially salary increases with age and then it drops again.

What is the correlation between age squared and wage?

cor(Wage$age^2,Wage$wage)

This is even lower than before. This does not support the visual-observation very well.

Create a new model that integrates the squared age in the model. What is the MAPE?

pm <- lm(wage ~ age + I(age^2), train )
mape ( test$wage, predict(pm, newdata=test) )

We see by adding this feature the accuracy has improved.

Note: I() (i.e. as.is) needs to be used, because ^ is used to describe interactions in formulas . Alternatively, you could create numeric data cbind(age, age^2) or poly(age,2,raw=TRUE).

Adding more terms will improve the accuracy. Try to add age^3 and age^4 or try to add a polynomial of degree 20. The theory of Taylor expansions is behind these improvements.

Regression tree

Now, let us use the regression tree (decision tree) to gain more insights.

We begin with the simple formula wage ~ age.

dt <- rpart( wage ~ age, train)
yh = predict(dt, newdata = test)
y  = test$wage
mape(y, yh); # rmse(y, yh)
fancyRpartPlot(dt)

The test-MAPE of the DT is slightly worse than the LR.

Let us use now all variables.

dt <- rpart( wage ~ ., train)
mape(train$wage, predict(dt)); # training error
mape(test$wage, predict(dt, newdata=test)); # test error 

The low training-MAPE is deceiving, making us believe that we can predict very well the expected salary given all the previous factors. However, the high test-error means that overfitting took place. That means a less complex tree would be better. We can obtain such a tree by pruning.

Pruning

Pruning a decision tree is used to reduce complexity (bias). printcp displays the complexity parameter (CP) table and plotcp gives a visual display of it.

printcp(dt)
plotcp(dt)

Let us use the displayed CPs and relative error to prune the tree. More specifically the CP between 0.016 and 0.026.

dtp <- prune(dt, cp=0.021)
mape(test$wage, predict(dtp, newdata=test)); # test error 
plot(dtp,margin=0.2); text(dtp, pretty=0, use.n=TRUE, all=TRUE, cex=.8)

This is also know as post-pruning.

Additionally, we could have done pre-pruning. That means we provide a few control parameters to rpart such as minbucket (minimum number of observations in leaf node), minsplit (minimum number of observations for a split), etc.

dt.pp <- rpart( wage ~ ., train, control = rpart.control(minbucket=100))
mape(train$wage, predict(dt.pp)); # training error
mape(test$wage, predict(dt.pp, newdata=test)); # test error 
fancyRpartPlot(dt.pp)

Classifications

We will use the Titanic dataset. Please see the Statistical Learning tutorial for a brief introduction and the logistic regression for initial classifications. Here, we will show how to use the decision tree and random forest methods.

Warm-up and Kaggle

What is the proportion of survivors?

T = table(titanic_train$Survived)
prop.table(T)

Answer: 38.4% survived.

Create a file that you can submit at Kaggle’s competition. Predict that there are no survivors.

# initialise such that there are no survivors
ns = rep(0,nrow(titanic_test)) #
# create submission file for Kaggle competition
submit <- data.frame(PassengerId = titanic_test$PassengerId, Survived = ns)
# write.csv(submit,file="no-survivors.csv", row.names = FALSE)
head(submit, n=3L)

The file can be submitted using this link on Kaggle.

Decision trees

Plot a simple decision tree using Survived ~ Sex.

# library(rpart)
dtp <- rpart(Survived ~ Sex, data=titanic_train, method="class" )
plot(dtp,margin=0.2); text(dtp, pretty=0, use.n=TRUE, all=TRUE, cex=.8)

This is pretty ugly. We want to have a better decision tree visualisation and will use another package.

# library(pacman); p_load(rattle, rpart.plot, RColorBrewer)
dt <- rpart(Survived ~ Sex, data=titanic_train, method="class" )
fancyRpartPlot(dt)

This is much better. What is the meaning? The green colours mean passengers have died and blue means they survived. The root node (1) states that 62% of the passengers died. Hence, 38% survived. This agrees with the table we created earlier. Now, there is the questions Sex = male (is the passenger male)? If the answer is yes then we follow the left branch, otherwise the right one. Going to the left we end up at the green node 2. That means, males have a 81% chance of dying; and they constitute 65% of the data. So, a tree that is very easy to interpret.

What is the training-accuracy of this tree?

yh = predict(dt,type="class")
y  = titanic_train$Survived
mean(y == yh)

Answer: 78.7%

Let us now get more factors into the tree. First we will convert several fields to factors. Furthermore, we will set the age na fields to its average. The Name column will be dropped.

train = titanic_train; test = titanic_test;
cols <- c("Pclass", "Sex", "Ticket", "Cabin", "Embarked")
train$Survived = as.factor(train$Survived)
train[cols] <- lapply(train[cols], factor)
test[cols]  <- lapply(test[cols], factor)
mv = mean(c(train$Age,test$Age), na.rm =TRUE)
train$Age[is.na(train$Age)] <- mv
test$Age [is.na(test$Age) ] <- mv

# remove columns , e.g. train$Name = NULL
train = subset( train, select = -c(PassengerId, Name, Ticket, Cabin))
test  = subset( test, select = -c(PassengerId, Name, Ticket, Cabin))
cat("train: ",nrow(train)," rows"); head(train)
cat("test: " ,nrow(test)," rows"); head(test)

Please note that this test set is for Kaggle submissions. That means it does not include the Survived field.

Is the new train-accuracy better using all factors?

dtc <- rpart(Survived ~ ., data=train, method="class" )
yh = predict(dtc,type="class"); y  = train$Survived
mean(y == yh)

The accuracy has improved.

Let’s visualise the new model.

fancyRpartPlot(dtc)

Unfortunately, over-fitting can lead to worse test-accuracy.

yh <- predict(dtc, test, type="class")
head(yh, n=3L)

Pruning

Pruning a decision tree is used to reduce complexity (bias). printcp displays the complexity parameter (CP) table and plotcp gives a visual display of it.

printcp(dtc)
plotcp(dtc)

Let us use the displayed CPs and relative error to prune the tree. More specifically the CP between 0.027.

dtp <- prune(dtc, cp=0.027)
#plot(dtp,margin=0.2); text(dtp, pretty=0, use.n=TRUE, all=TRUE, cex=.8)
fancyRpartPlot(dtp)

This is also know as post-pruning.

Try to submit the new model to Kaggle and see whether you improved your ranking.

Please have a look at Trevor Stephens - Titanic for several more interesting ideas.

Random Forest

Let us now use the random forest method. This method creates several decision trees by varying the number of features and amount of data used. These trees then have a majority vote of which group to choose for a given observation.

The random forest function cannot deal with factors that have more than 53 categories. So we had to remove Ticket and Cabine from the input data.

#library(randomForest)
set.seed(7)
rf <- randomForest(Survived ~ ., data = train)
summary(rf)

What is the training-accuracy of this model?

yh = predict(rf) 
mean( yh == train$Survived)

Classifications - Iris

We will use the Iris dataset and divide it into test and train data.

n = nrow(iris)
ns = floor(.7*n)
set.seed(1); I = sample(1:n, ns)
train = iris[I,]; test = iris[-I,]

Decision Tree

Create a classification tree and determine it’s test-accuracy.

#library(rpart); library(pacman); p_load(rattle, rpart.plot, RColorBrewer)
dt <- rpart(Species ~ ., data=train, method="class" )
# fancyRpartPlot(dt) # not much use too complex
yh <- predict(dt, type="class", newdata = test)
mean(yh == test$Species)

Random Forest

Create a random forest and determine it’s test-accuracy.

# library(randomForest)
set.seed(1)
rf <- randomForest(Species ~ ., data=train, importance=TRUE)
yh <- predict(rf,type="class", newdata = test)
mean(yh == test$Species)

kNN

Create a kNN classifier and determine it’s test-accuracy.

# library (class)
knn.cf=knn (train[,1:4], test[,1:4], train$Species, k=3, prob=TRUE)
table(knn.cf , test$Species)
mean(knn.cf == test$Species)

Clustering

There are a few clustering algorithms. Often used is the \(k\)-means algorithm. Hierarchical clustering comes in two main flavours: bottom-up (agglomerative clustering) and top-down (divisive clustering).

k-means

The \(k\)-means algorithm is a popular heuristic, which finds \(k\) clusters by minimising the distance between observation and cluster centre.

Example - Iris

Irises are purple flowers. We will consider the following species: Setosa, Virginica and Versicolor. Iris-species

We will use the built in dataset.

data(iris); head(iris); summary(iris); str(iris)

As we can see the species is already within the dataset, which defeats the purpose of clustering. However, we pretend that we want to define three different species automatically. We will use the pre-defined species and compare them to the ones suggested by the clustering algorithm.

Let’s visualise the cluster using the petal’s length and width.

ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) +   geom_point()+ theme(legend.position = c(.18,.73))
  unique(iris$Species)

The cluster on the left seems to be straight forward to identify. The top ones are slightly more difficult to distinguish.

We will focus only on the petal length and width, and ignore the sepal (green cover for petals) variables.

Use the kmeans function and form three clusters. Start the algorithms-core 30 times, i.e. the initial centres of the clusters are placed thirty times randomly. How well do they fit the given species classification?

set.seed(7)
cm <- kmeans(iris[, 3:4], 3, nstart = 30)
cm
table(cm$cluster, iris$Species)

We see that the setosa cluster is perfectly matched. 96% of the versicolor species are identified and 92% of the virginica species.

Plot the new 3-means clusters.

cm$cluster <- as.factor(cm$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = cm$cluster)) + geom_point()+ theme(legend.position = "none")

Hierarchical Clustering

We will consider an agglomerative (bottom-up) approach.

Simple Call

Hierarchically cluster the iris petal data using complete linkage. First scale the data by mean and standard-deviation. Then determine all distances between all observations.

What distance measure was used by default?

# p_load(ggplot2, latex2exp) # for plots
# p_load(ggdendro, dendextend) # for dendograms
md = "complete"
hc = iris[, 3:4] %>% scale %>% dist %>% 
  hclust(method = md)
hc

Determine the top three clusters and compare them to the actual species.

I = cutree (hc , 3)
table(I, iris$Species)

We see that the setosa and virginica species were modelled perfectly. However, only 50% of the versicolor irises were identified correctly. The problem is that we used the complete linkage, which was inappropriate in this case.

Visualise the modelled clusters.

X = iris[, 3:4];
qplot(X[,1],X[,2],geom = "point",
        color=as.factor(I))+
    xlab(TeX("x_1"))+ylab(TeX("x_2"))+
    theme(legend.position = "none")

Dendogram

Let’s just use only a small sample (30 observations) to gain more insights about dendograms and their linkages. We begin by defining a function that prepares ourdendogram for plotting.

ppdend <- function(D, method="average", nb_clusters=3){
  return(D %>% scale %>% dist %>% 
    hclust(method = method) %>% as.dendrogram %>%
    set("branches_k_color", k=nb_clusters) %>% set("branches_lwd", .6) %>%
    set("labels_colors") %>% set("labels_cex", c(.6,.6)) %>% 
    set("leaves_pch", 19) %>% set("leaves_col", c("blue", "purple")))
}

set.seed(1); I = sample(1:150, 30);
X = iris[I, 3:4]; 

par(mfrow =c(2,3))
plot( ppdend(X,"single"),main="single")
plot( ppdend(X,"complete"),main="complete")
plot( ppdend(X,"average"),main="average")
plot( ppdend(X,"centroid"),main="centroid")
plot( ppdend(X,"ward.D"),main="ward.D")
plot( ppdend(X,"ward.D2"),main="ward.D2")

Let us have a look at a few more dendogram visualisations.

dend <- ppdend(X)
plot( dend ) # as before simple plot
ggplot(as.ggdend(dend)) # as ggplot
ggplot(as.ggdend(dend), labels = FALSE) + 
  scale_y_reverse(expand = c(0.2, 0)) +
  coord_polar(theta="x") # as ploar coordinate plot

Resources

Acknowledgment

This tutorial was created using RStudio, R, rmarkdown, and many other tools and libraries. The packages learnr and gradethis were particularly useful. I’m very grateful to Prof. Andy Field for sharing his disovr package, which allowed me to improve the style of this tutorial and get more familiar with learnr. Allison Horst wrote a very instructive blog “Teach R with learnr: a powerful tool for remote teaching”, which encouraged me to continue with learnr. By the way, I find her statistic illustrations amazing.

References

Brauer, Claudia. 2020. A-very-short-introduction-to-R.” GitHub. https://github.com/ClaudiaBrauer/A-very-short-introduction-to-R/blob/master/documents/A%20(very)%20short%20introduction%20to%20R.pdf.
Data Analytics Tutorial for Beginners - From Beginner to Pro in 10 Mins! - DataFlair.” 2019. DataFlair. https://data-flair.training/blogs/data-analytics-tutorial.
Field, Andy P. 2021. Discovering Statistics Using R and RStudio. Second. London: Sage.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge University Press.