Data Analytics - Statistical Learning

Introduction

Welcome to this Statistical Learning tutorial. Please be sure you have done the datatypes tutorial and controls & functions tutorial

In this tutorial linear regression will be applied to predict the eruption duration of a geyser.

Logistic regression will be used to predict whether a passenger will survive.

Linear Regression

Old Faithful Old Faithful is a geyser in Yellowstone National Park in Wyoming, United States.

Predict the duration of the geyser’s eruption, depending on how long you had to wait. Assume you have waited 80min. How long will the eruption last?

Explore the data

How many observations and variables are in the “faithful” data set?

str    (faithful)

Answer: 272 observations and 2 variables

Display the first six observations?

head   (faithful)

Answer: 3.6 79, …

What is the shortest, expected and third quartile eruption duration?

summary(faithful)

Answer: Min 1.6min, mean: 3.488 and 3rd Quartile is 4.454min.

Visual inspection

Display the data as a scatterplot and add a “trend line”.

# library(ggplot2)
g <- ggplot(data = faithful,aes(waiting,eruptions))
g + geom_point()+stat_smooth(method="lm")

Prediction Model

Develop a linear regression prediction model using linear regression. What are the coefficients of the model \(\hat{y}= \beta_0+\beta_1 x\)?

formula =  eruptions ~ waiting #symbolic description of model
model   = lm(formula, data=faithful) # fit a linear model

c = model$coefficients # coefficients(model) #an alternative
c

Solution: \(\beta_0\) = -1.874 and \(\beta_1\)=0.0756

Prediction

What is the eruption duration given that we waited \(w=80\) minutes? Hint: Use the coefficients from before.

w = 80 #min waiting time
d = c[1] + c[2]*w
d

Solution: We expect to observe an eruption lasting for 4.18min (given 80min waiting time).

Multiple predictions: Predict the eruption times for times 40, 60, 80 and 100 minutes. Can you observe any systematic change.

p = data.frame(waiting=c(40, 60,80,100))
predict(model,p)

Answer: every 20 minutes of waiting gives us an additional 1.51 minutes of eruption (see waiting coefficient \(\beta_1\)).

Goodness of model

How much of the total variation can be explained by the model?

summary(model)$r.squared

Answer: 81.1% can be explained.

Confidence Interval

Determine the expected eruption time using many independent samples of 80 min of waiting. What is our 95% confidence interval of the expected eruption duration?

attach(faithful)     # attach the data frame 
my.lm = lm(eruptions ~ waiting)
w = data.frame(waiting=80)
predict(my.lm, w, interval = "confidence")

Answer: The CI of the expected eruption time is [4.105, 4.248]

Prediction Interval

Which eruption time will we observe with 95% certainty when the waiting time for the eruption was 80 minutes?

predict(my.lm, w, interval = "predict")

Answer: we will observe an eruption time which is between 3.2min and 5.2min.

Overall we expect the eruption to last for 4.18 minutes (with a 95% - CI of [4.105, 4.248]). We predict that the eruption will last between 3.2min and 5.2min with 95% certainty, when the eruption occurs after 80 minutes.

Logistic Regression

RMS Titanic

The ship - RMS Titanic - sank in the early morning hours of 15 April 1912 in the North Atlantic Ocean. The disaster led to significant changes in maritime regulations.

Develop a prediction model that checks whether a passenger would have survived based on his profile.

Data source

The data is obtained using the Titanic library. This is based on the Kaggle data set. Check the content of the titanic library. Does it contain a training and test data set?

# library(titanic)
ls("package:titanic")

Please inspect the training and test data set. Does the test data set contain passenger information?

#?titanic_train
?titanic_test

Answer: no passenger information is available, i.e. we cannot determine the test accuracy of the model. Note: the information was not provided, since it is a competition on Kaggle.

By default there is another data set about the Titanic in the R environment. However, it contains only aggregated information.

There is a Titanic website, which contains even more data about the disaster.

However, we will use the previous train data set, and we will assume that this is the entire data available.

Explore data

Similar, to the linear regression explore the data using names, head and summary.

  1. Does the data contain a Survived and Sex field?
  2. What is the name in the sixth record?
  3. How many records does the data set contain?
  4. How many elements are not available in the Age field?
T = titanic_train # assume from now on this is training and test data.
names(T)
head(T)
summary(T)

Answers: 1) yes; 2) James Moran; 3) 891; 4) 177.

Data cleaning

The field Age has several “not available(na)” entries. Your task is to set those to the average age. What is the average age?

# sum(is.na(T$Age))
mv = mean(T$Age, na.rm = TRUE)
T$Age[is.na(T$Age)] = mv
mv

Answer: 29.7 years.

Create a training set D that contains 80% of the data (round down). How many records has D?

n = length(T$PassengerId) # number of passengers
ns = floor(0.8*n) # #records for training data set 

set.seed(1) # fix random number generator
I = sample(1:n, ns) # create sample indizis (no repetitions)

I[1:10] # show the first 10 records
length(I); # or a bit more precise: length(unique(I)) 
D = T[I,]

Answer: 712.

Two models

Let’s derive two simple prediction models.

Model 1 - everyone dies

How many passengers have survived (died)? What is the corresponding percentage?

# library(dplyr)
aD <- D %>% group_by(Survived) %>% summarise(nb = length(PassengerId)) 
aD %>% mutate(pnb = nb/ns)

Answer: 278 survived (61.0%), 434 died (39.0%).

That means, if we assume everyone will die the model is correct 61% of the time for the training data.

Hence, let model 1 assume that everyone dies (=0). What is the accuracy for the test data?

test = T$Survived[-I]
mean(0 == test) 

The above shows that model 1 is 64.2% correct for the test data. This is better than a random guess.

Model 2 - females survive

Let us inspect whether the gender has any impact.

aD2 <- D %>% group_by(Survived, Sex) %>% summarise(nb = length(PassengerId)) 
# aD2
# library (tidyr) # reshape aD2
aD2 %>% pivot_wider(Sex, names_from = Survived, values_from=nb)

The above table states: if your are female you are more likely to survive. What is test-accuracy for a model where all females survive?

pm <- rep(0,length(test)) # no one survives (as model 1)
pm[T$Sex[-I] =="female"]=1 # all females survive
mean(pm == test) 

Answer: There is a 81% test-accuracy.

Logistic Regression Model

Let’s derive a logistic regression model. Previously, we found the existing field names.

names(D)

From the previous models we have learned that Sex is an important feature. Let us use this field to build the logistic regression model f.

First model: Survived ~ Sex

f <- glm(Survived ~ Sex, data = D, family = binomial ) # would result in the same result as above
#f <- glm(Survived ~ Sex + Age + Pclass+ SibSp + Fare + Embarked, data = D, family = binomial )
#summary(f)
summary(f)$coef

Predict the probability for f and display the first 10 probabilities.

f_prob = predict(f, type = "response")
f_prob[1:10]

Transform the probabilities to the response variable. What threshold is used? Display the first five elements of \(\hat{y}\).

yh = rep(0, length(f_prob))
yh[f_prob>=0.5] = 1
yh[1:5]

Answer: 0.5

Confusion matrix

Build a matrix that shows the values predicted correctly and wrongly. What are the true positives (survivors identified correctly by the prediction model)? What are the true negatives (passengers that did not survive predicted correctly)?

C = table(yh, D$Survived)
C

Answer: The number of true positives is 192, and the number of true negatives is 364.

What is the train-accuracy of this model?

# correctly predicted in training set
C[1,1]+C[2,2] # true negatives + true positives
# cat( 'ns = ', ns, ' = ', length(D$Survived), ' = ', sum(C), '\n')
(C[1,1]+C[2,2]) / ns
# mean(yh == D$Survived) # same as before 

Answer: The train-accuracy is 78.1%.

What is the test-accuracy of the model?

# correctly predicted in test set
ft_prob = predict(f, newdata= T[-I,], type = "response") # note newdata
# length(ft_prob)
# ft_prob[1:10]

yht = rep(0, length(ft_prob)) # init
yht[ft_prob>=0.5] = 1 # set modeled response

Ct = table(yht, T[-I, "Survived"])
# Ct
mean(yht == T$Survived[-I])

The test-accuracy is 81.0%, which is the same as our simple model.

Model with several features

Let us now add more features. Why don’t we add the passenger identifier?

fe <- glm(Survived ~ Sex + Age + Pclass+ SibSp + Fare + Embarked, data = D, family = binomial )
fe

Answer: The passenger identifier does not contain any information.

What is the test-accuracy of the new model?

fte_prob = predict(fe, newdata= T[-I,], type = "response") # test data probs.

yhte = rep(0, length(fte_prob)) # init
yhte[fte_prob>=0.5] = 1 # set response

mean(yhte == T$Survived[-I])

Answer: The test-accuracy is 77.1%

It is interesting to note that the additional number of features did not improve the test-accuracy for this specific sample.

Resources

Acknowledgment

This tutorial was created using RStudio, R, rmarkdown, and many other tools and libraries. The packages learnr and gradethis were particularly useful. I’m very grateful to Prof. Andy Field for sharing his disovr package, which allowed me to improve the style of this tutorial and get more familiar with learnr. Allison Horst wrote a very instructive blog “Teach R with learnr: a powerful tool for remote teaching”, which encouraged me to continue with learnr. By the way, I find her statistic illustrations amazing.

References

“Data Analytics Tutorial for Beginners - From Beginner to Pro in 10 Mins! - DataFlair.” 2019. DataFlair. https://data-flair.training/blogs/data-analytics-tutorial.

Field, Andy P. 2021. Discovering Statistics Using R and RStudio. Second. London: Sage.

Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge University Press.

Wolfgang Garn