Titanic Dataset

Let's work with the titanic dataset.

1. Construct a table showing the distribution of passengers by class and survival.

titanic %$% table(survived, pclass)
        pclass
survived   1   2   3
       0 123 158 528
       1 200 119 181
# magrittr does this --> table(titanic$survived, titanic$pclass)

2. Construct a logistic regression model that links survival to the passenger class. Write out the equation first without Running it in R. HINT: Class is a factor variable

\[ log(odds(y)) = β0 + β1*pclass2 + β2*plcass3 \]

\[ odds(y) = e^{β_0} * e^{β_1 * pclass2} * e^{β_2 * pclass3} \]

β0 = intercept

y = survival

3. Using hand-calculations, determine the coefficients in the model and interpret them (HINT: all you need to do is to use the table, calculate odds for the default category and the odds-ratios for the other categories versus the default)

e.g. not survival is the default exercise (i.e. =0). prob of survival of:

first class => 200/(200+123) = 62% of people survived

second class => 119/(119+158) = 43% of people survived

third class => 181/(181+528) = 26% of people survived

Probabilities vary between zero and one. Instead, for the purpose of the logistical regression we can calculate odds of survival of:

first class =>200/123=1.62 => for each first class person who died, 1.62 first class people survived.

second class =>119/158 = 0.75 => for each second class person who died, 0.75 second class person survived.

third class =>181/528 = 0.34 => for each third class person who died, 0.34 third class person survived.

When we take the log of odds, the result varies from - infinity to + infinity. log of survival of:

first class people: log(200/123)=0.486

second class people: log(119/158)=-0.28

third class people: log(181/528)=-1.07

4. Now Run the model in R. Confirm that you got the same results as in part c). Interpret the results and talk about significance (both statistical and substantive).

titanic %$% summary(glm(survived ~ factor(pclass), family = "binomial"))

Call:
glm(formula = survived ~ factor(pclass), family = "binomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3896  -0.7678  -0.7678   0.9791   1.6525  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)       0.4861     0.1146   4.242 2.21e-05 ***
factor(pclass)2  -0.7696     0.1669  -4.611 4.02e-06 ***
factor(pclass)3  -1.5567     0.1433 -10.860  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1741.0  on 1308  degrees of freedom
Residual deviance: 1613.3  on 1306  degrees of freedom
AIC: 1619.3

Number of Fisher Scoring iterations: 4
question2 <- titanic %$% glm(survived ~ factor(pclass), family = "binomial")
exp(question2$coefficients)
    (Intercept) factor(pclass)2 factor(pclass)3 
      1.6260163       0.4631962       0.2108239 

This code leaves us with the following equation of log odds:

Survived = 0.486 -0.77class2 -1.56class3

The significant p-values are telling us that the intercept is not zero. It means that the beta one (β1) is different from zero. This implies that there is a difference between classes and survival. We do not know yet what is the magnitude of the difference because this is a difference in log odds.

Since is negative it means that classes 2 and 3 have lower survival chances from the first class. The 3rd class has the lowest chances of survival since its β1 coefficient has the highest negative number.

5. What's the probability of survival for each class of passengers?

First class had the odds of survival of 1.62. We can calculate the probability of survival as (1.62)/(1 + 1.62) = 0.6183206 or about 61.8%.

Second class had the odds of survival of 0.75. We can calculate the probability of survival as (0.75)/(1 + 0.75) = 0.4285714 or about 42.8%

Third class had the odds of survival of 0.34. We can calculate the probability of survival as (0.34)/(1 + 0.34) = 0.2537313 or about 25.3%

6. Construct a model that interacts class of passenger and his/her gender. Interpret the results the same way you did before.

First we conduct the model just like as before but this time we add gender (aka ‘sex’) as a factor variable. This time though we either add or multiply the one variable with the other. This model would help us make predictions for a passenger.

Because here the model assumes interaction between class and gender we construct the model with a ‘*’ sign. If we add a ‘*’ sign we assume that there are interactions; meaning that the effect of class depends on gender OR the effect of gender depends on class.

—The alternative was the ‘+’ sign. If we add a ‘+’ sign we assume that there are no interactions; meaning that the effect of class does not depend on gender. This was our assumption when we were doing linear models.—

summary(glm(survived~factor(pclass) * factor(sex), family = "binomial", data = titanic))

Call:
glm(formula = survived ~ factor(pclass) * factor(sex), family = "binomial", 
    data = titanic)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5924  -0.5745  -0.5745   0.4902   1.9610  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                       3.3250     0.4549   7.309 2.68e-13 ***
factor(pclass)2                  -1.2666     0.5485  -2.309   0.0209 *  
factor(pclass)3                  -3.3621     0.4748  -7.081 1.43e-12 ***
factor(sex)male                  -3.9848     0.4815  -8.277  < 2e-16 ***
factor(pclass)2:factor(sex)male   0.1617     0.6104   0.265   0.7911    
factor(pclass)3:factor(sex)male   2.3039     0.5158   4.467 7.95e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1741  on 1308  degrees of freedom
Residual deviance: 1210  on 1303  degrees of freedom
AIC: 1222

Number of Fisher Scoring iterations: 5

Let’s make an assumption for a male sitting in first class.

log odds of survival => 3.325 - 3.9848 = -0.6598

odds of survival => exp(-0.6598) = 0.5169547

probability of survival => 0.5169547/(1+0.5169547) = 0.3407845 or 34%

Let’s make an assumption for a male sitting in second class.

log odds of survival => 3.325 -1.2666 - 3.9848+0.1617 = -1.7647

odds of survival => exp(-1.7647) = 0.1712382

probability of survival => 0.1712382/(1+0.1712382) = 0.1462027 or 14.6%

Let’s make an assumption for a male sitting in third class.

log odds of survival => 3.325 -3.3621 - 3.9848 +2.3039 = -1.718

odds of survival => exp(-1.718) = 0.1794246

probability of survival => 0.1794246/(1+0.1794246) = 0.1521289 or 15.21%

Let’s make an assumption for a female sitting in first class.

log odds of survival => 3.325

odds of survival => exp(3.325) = 27.799

probability of survival => 27.799/(1+27.799) = 0.9652766 or 96.5%

Let’s make an assumption for a female sitting in second class.

log odds of survival => 3.325 -1.2666 = 2.0584

odds of survival => exp(2.0584) = 7.833426

probability of survival => 7.833426/(1+7.833426) = 0.8867936 or 88.6%

Let’s make an assumption for a female sitting in third class.

log odds of survival => 3.325 -3.3621 = -0.0371

odds of survival => exp(-0.0371) = 0.9635798

probability of survival => 0.9635798/(1+0.9635798) = 0.4907261 or 49%

Interpretation of results

It looks like the closer someone is in the first class, the better the chance to survive. However, there is an interesting result between men of third and second category, where the men in third category had slightly higher chances of survival than the men in second class. In general, females had higher chances of survival. Even the females in the third class had 15% higher chance of survival than the males in the first class (49% vs 34%). Around 9 out of 10 women survived in classes 1 and 2. One out of two women survived out of the third class. Only one out of three men in the first class survived while in classes 2 and 3 men had the lowest chances of survival (14.6% and 15.2%). This slight difference though can also be due to random noise.

How we arrived to those results:

After running the logistical regression model we come up with the intercept and coefficient number odds which where in log form; hence we had to convert those in actual odds. Before converting them to odds, we sum up the related variables that we want to find in order to also have the log odds of survival (e.g. for female in second class we used the intercept 3.325 and the pclass2 value -1.2666).

Then, we exponentiate the log odds of survival to find the actual odds of survival (e.g. female in second class = 7.83 — which means each female who died, 7.83 survived).

Because human brain is not well designed enough to perceive odds as a measurement, we converted those to probability using the formula:

\[ probability = odds/1+odds \]

We did that for each one of the six outcomes