Welcome to this Statistical Learning tutorial. Please be sure you have done the datatypes tutorial and controls & functions tutorial
In this tutorial linear regression will be applied to predict the eruption duration of a geyser.
Logistic regression will be used to predict whether a passenger will survive.
Old Faithful is a geyser in Yellowstone National Park in Wyoming, United States.
Predict the duration of the geyser’s eruption, depending on how long you had to wait. Assume you have waited 80min. How long will the eruption last?
How many observations and variables are in the “faithful” data set?
str (faithful)
Answer: 272 observations and 2 variables
Display the first six observations?
head (faithful)
Answer: 3.6 79, …
What is the shortest, expected and third quartile eruption duration?
summary(faithful)
Answer: Min 1.6min, mean: 3.488 and 3rd Quartile is 4.454min.
Display the data as a scatterplot and add a “trend line”.
# library(ggplot2)
g <- ggplot(data = faithful,aes(waiting,eruptions))
g + geom_point()+stat_smooth(method="lm")
Develop a linear regression prediction model using linear regression. What are the coefficients of the model \(\hat{y}= \beta_0+\beta_1 x\)?
formula = eruptions ~ waiting #symbolic description of model
model = lm(formula, data=faithful) # fit a linear model
c = model$coefficients # coefficients(model) #an alternative
c
Solution: \(\beta_0\) = -1.874 and \(\beta_1\)=0.0756
What is the eruption duration given that we waited \(w=80\) minutes? Hint: Use the coefficients from before.
w = 80 #min waiting time
d = c[1] + c[2]*w
d
Solution: We expect to observe an eruption lasting for 4.18min (given 80min waiting time).
Multiple predictions: Predict the eruption times for times 40, 60, 80 and 100 minutes. Can you observe any systematic change.
p = data.frame(waiting=c(40, 60,80,100))
predict(model,p)
Answer: every 20 minutes of waiting gives us an additional 1.51 minutes of eruption (see waiting
coefficient \(\beta_1\)).
How much of the total variation can be explained by the model?
summary(model)$r.squared
Answer: 81.1% can be explained.
Determine the expected eruption time using many independent samples of 80 min of waiting. What is our 95% confidence interval of the expected eruption duration?
attach(faithful) # attach the data frame
my.lm = lm(eruptions ~ waiting)
w = data.frame(waiting=80)
predict(my.lm, w, interval = "confidence")
Answer: The CI of the expected eruption time is [4.105, 4.248]
Which eruption time will we observe with 95% certainty when the waiting time for the eruption was 80 minutes?
predict(my.lm, w, interval = "predict")
Answer: we will observe an eruption time which is between 3.2min and 5.2min.
Overall we expect the eruption to last for 4.18 minutes (with a 95% - CI of [4.105, 4.248]). We predict that the eruption will last between 3.2min and 5.2min with 95% certainty, when the eruption occurs after 80 minutes.
RMS Titanic
The ship - RMS Titanic - sank in the early morning hours of 15 April 1912 in the North Atlantic Ocean. The disaster led to significant changes in maritime regulations.
Develop a prediction model that checks whether a passenger would have survived based on his profile.
The data is obtained using the Titanic library. This is based on the Kaggle data set. Check the content of the titanic
library. Does it contain a training and test data set?
# library(titanic)
ls("package:titanic")
Please inspect the training and test data set. Does the test data set contain passenger information?
#?titanic_train
?titanic_test
Answer: no passenger information is available, i.e. we cannot determine the test accuracy of the model. Note: the information was not provided, since it is a competition on Kaggle.
By default there is another data set about the Titanic in the R environment. However, it contains only aggregated information.
There is a Titanic website, which contains even more data about the disaster.
However, we will use the previous train data set, and we will assume that this is the entire data available.
Similar, to the linear regression explore the data using names
, head
and summary
.
Survived
and Sex
field?Age
field?T = titanic_train # assume from now on this is training and test data.
names(T)
head(T)
summary(T)
Answers: 1) yes; 2) James Moran; 3) 891; 4) 177.
The field Age
has several “not available(na)” entries. Your task is to set those to the average age. What is the average age?
# sum(is.na(T$Age))
mv = mean(T$Age, na.rm = TRUE)
T$Age[is.na(T$Age)] = mv
mv
Answer: 29.7 years.
Create a training set D
that contains 80% of the data (round down). How many records has D
?
n = length(T$PassengerId) # number of passengers
ns = floor(0.8*n) # #records for training data set
set.seed(1) # fix random number generator
I = sample(1:n, ns) # create sample indizis (no repetitions)
I[1:10] # show the first 10 records
length(I); # or a bit more precise: length(unique(I))
D = T[I,]
Answer: 712.
Let’s derive two simple prediction models.
How many passengers have survived (died)? What is the corresponding percentage?
# library(dplyr)
aD <- D %>% group_by(Survived) %>% summarise(nb = length(PassengerId))
aD %>% mutate(pnb = nb/ns)
Answer: 278 survived (61.0%), 434 died (39.0%).
That means, if we assume everyone will die the model is correct 61% of the time for the training data.
Hence, let model 1 assume that everyone dies (=0). What is the accuracy for the test data?
test = T$Survived[-I]
mean(0 == test)
The above shows that model 1 is 64.2% correct for the test data. This is better than a random guess.
Let us inspect whether the gender has any impact.
aD2 <- D %>% group_by(Survived, Sex) %>% summarise(nb = length(PassengerId))
# aD2
# library (tidyr) # reshape aD2
aD2 %>% pivot_wider(Sex, names_from = Survived, values_from=nb)
The above table states: if your are female you are more likely to survive. What is test-accuracy for a model where all females survive?
pm <- rep(0,length(test)) # no one survives (as model 1)
pm[T$Sex[-I] =="female"]=1 # all females survive
mean(pm == test)
Answer: There is a 81% test-accuracy.
Let’s derive a logistic regression model. Previously, we found the existing field names.
names(D)
From the previous models we have learned that Sex
is an important feature. Let us use this field to build the logistic regression model f
.
f <- glm(Survived ~ Sex, data = D, family = binomial ) # would result in the same result as above
#f <- glm(Survived ~ Sex + Age + Pclass+ SibSp + Fare + Embarked, data = D, family = binomial )
#summary(f)
summary(f)$coef
Predict the probability for f
and display the first 10 probabilities.
f_prob = predict(f, type = "response")
f_prob[1:10]
Transform the probabilities to the response variable. What threshold is used? Display the first five elements of \(\hat{y}\).
yh = rep(0, length(f_prob))
yh[f_prob>=0.5] = 1
yh[1:5]
Answer: 0.5
Build a matrix that shows the values predicted correctly and wrongly. What are the true positives (survivors identified correctly by the prediction model)? What are the true negatives (passengers that did not survive predicted correctly)?
C = table(yh, D$Survived)
C
Answer: The number of true positives is 192, and the number of true negatives is 364.
What is the train-accuracy of this model?
# correctly predicted in training set
C[1,1]+C[2,2] # true negatives + true positives
# cat( 'ns = ', ns, ' = ', length(D$Survived), ' = ', sum(C), '\n')
(C[1,1]+C[2,2]) / ns
# mean(yh == D$Survived) # same as before
Answer: The train-accuracy is 78.1%.
What is the test-accuracy of the model?
# correctly predicted in test set
ft_prob = predict(f, newdata= T[-I,], type = "response") # note newdata
# length(ft_prob)
# ft_prob[1:10]
yht = rep(0, length(ft_prob)) # init
yht[ft_prob>=0.5] = 1 # set modeled response
Ct = table(yht, T[-I, "Survived"])
# Ct
mean(yht == T$Survived[-I])
The test-accuracy is 81.0%, which is the same as our simple model.
Let us now add more features. Why don’t we add the passenger identifier?
fe <- glm(Survived ~ Sex + Age + Pclass+ SibSp + Fare + Embarked, data = D, family = binomial )
fe
Answer: The passenger identifier does not contain any information.
What is the test-accuracy of the new model?
fte_prob = predict(fe, newdata= T[-I,], type = "response") # test data probs.
yhte = rep(0, length(fte_prob)) # init
yhte[fte_prob>=0.5] = 1 # set response
mean(yhte == T$Survived[-I])
Answer: The test-accuracy is 77.1%
It is interesting to note that the additional number of features did not improve the test-accuracy for this specific sample.
This tutorial was created using RStudio, R, rmarkdown, and many other tools and libraries. The packages learnr
and gradethis
were particularly useful. I’m very grateful to Prof. Andy Field for sharing his disovr package, which allowed me to improve the style of this tutorial and get more familiar with learnr
. Allison Horst wrote a very instructive blog “Teach R with learnr: a powerful tool for remote teaching”, which encouraged me to continue with learnr
. By the way, I find her statistic illustrations amazing.
Brauer, Claudia. 2020. “A-very-short-introduction-to-R.” GitHub. https://github.com/ClaudiaBrauer/A-very-short-introduction-to-R/blob/master/documents/A%20(very)%20short%20introduction%20to%20R.pdf.
“Data Analytics Tutorial for Beginners - From Beginner to Pro in 10 Mins! - DataFlair.” 2019. DataFlair. https://data-flair.training/blogs/data-analytics-tutorial.
Field, Andy P. 2021. Discovering Statistics Using R and RStudio. Second. London: Sage.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge University Press.