%$% table(survived, pclass) titanic
pclass
survived 1 2 3
0 123 158 528
1 200 119 181
# magrittr does this --> table(titanic$survived, titanic$pclass)
Let's work with the titanic dataset.
1. Construct a table showing the distribution of passengers by class and survival.
pclass
survived 1 2 3
0 123 158 528
1 200 119 181
2. Construct a logistic regression model that links survival to the passenger class. Write out the equation first without Running it in R. HINT: Class is a factor variable
\[ log(odds(y)) = β0 + β1*pclass2 + β2*plcass3 \]
\[ odds(y) = e^{β_0} * e^{β_1 * pclass2} * e^{β_2 * pclass3} \]
β0 = intercept
y = survival
3. Using hand-calculations, determine the coefficients in the model and interpret them (HINT: all you need to do is to use the table, calculate odds for the default category and the odds-ratios for the other categories versus the default)
e.g. not survival is the default exercise (i.e. =0). prob of survival of:
first class => 200/(200+123) = 62% of people survived
second class => 119/(119+158) = 43% of people survived
third class => 181/(181+528) = 26% of people survived
Probabilities vary between zero and one. Instead, for the purpose of the logistical regression we can calculate odds of survival of:
first class =>200/123=1.62 => for each first class person who died, 1.62 first class people survived.
second class =>119/158 = 0.75 => for each second class person who died, 0.75 second class person survived.
third class =>181/528 = 0.34 => for each third class person who died, 0.34 third class person survived.
When we take the log of odds, the result varies from - infinity to + infinity. log of survival of:
first class people: log(200/123)=0.486
second class people: log(119/158)=-0.28
third class people: log(181/528)=-1.07
4. Now Run the model in R. Confirm that you got the same results as in part c). Interpret the results and talk about significance (both statistical and substantive).
Call:
glm(formula = survived ~ factor(pclass), family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3896 -0.7678 -0.7678 0.9791 1.6525
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.4861 0.1146 4.242 2.21e-05 ***
factor(pclass)2 -0.7696 0.1669 -4.611 4.02e-06 ***
factor(pclass)3 -1.5567 0.1433 -10.860 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1741.0 on 1308 degrees of freedom
Residual deviance: 1613.3 on 1306 degrees of freedom
AIC: 1619.3
Number of Fisher Scoring iterations: 4
question2 <- titanic %$% glm(survived ~ factor(pclass), family = "binomial")
exp(question2$coefficients)
(Intercept) factor(pclass)2 factor(pclass)3
1.6260163 0.4631962 0.2108239
This code leaves us with the following equation of log odds:
Survived = 0.486 -0.77class2 -1.56class3
The significant p-values are telling us that the intercept is not zero. It means that the beta one (β1) is different from zero. This implies that there is a difference between classes and survival. We do not know yet what is the magnitude of the difference because this is a difference in log odds.
Since is negative it means that classes 2 and 3 have lower survival chances from the first class. The 3rd class has the lowest chances of survival since its β1 coefficient has the highest negative number.
5. What's the probability of survival for each class of passengers?
First class had the odds of survival of 1.62. We can calculate the probability of survival as (1.62)/(1 + 1.62) = 0.6183206 or about 61.8%.
Second class had the odds of survival of 0.75. We can calculate the probability of survival as (0.75)/(1 + 0.75) = 0.4285714 or about 42.8%
Third class had the odds of survival of 0.34. We can calculate the probability of survival as (0.34)/(1 + 0.34) = 0.2537313 or about 25.3%
6. Construct a model that interacts class of passenger and his/her gender. Interpret the results the same way you did before.
First we conduct the model just like as before but this time we add gender (aka ‘sex’) as a factor variable. This time though we either add or multiply the one variable with the other. This model would help us make predictions for a passenger.
Because here the model assumes interaction between class and gender we construct the model with a ‘*’ sign. If we add a ‘*’ sign we assume that there are interactions; meaning that the effect of class depends on gender OR the effect of gender depends on class.
—The alternative was the ‘+’ sign. If we add a ‘+’ sign we assume that there are no interactions; meaning that the effect of class does not depend on gender. This was our assumption when we were doing linear models.—
Call:
glm(formula = survived ~ factor(pclass) * factor(sex), family = "binomial",
data = titanic)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5924 -0.5745 -0.5745 0.4902 1.9610
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.3250 0.4549 7.309 2.68e-13 ***
factor(pclass)2 -1.2666 0.5485 -2.309 0.0209 *
factor(pclass)3 -3.3621 0.4748 -7.081 1.43e-12 ***
factor(sex)male -3.9848 0.4815 -8.277 < 2e-16 ***
factor(pclass)2:factor(sex)male 0.1617 0.6104 0.265 0.7911
factor(pclass)3:factor(sex)male 2.3039 0.5158 4.467 7.95e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1741 on 1308 degrees of freedom
Residual deviance: 1210 on 1303 degrees of freedom
AIC: 1222
Number of Fisher Scoring iterations: 5
Let’s make an assumption for a male sitting in first class.
log odds of survival => 3.325 - 3.9848 = -0.6598
odds of survival => exp(-0.6598) = 0.5169547
probability of survival => 0.5169547/(1+0.5169547) = 0.3407845 or 34%
Let’s make an assumption for a male sitting in second class.
log odds of survival => 3.325 -1.2666 - 3.9848+0.1617 = -1.7647
odds of survival => exp(-1.7647) = 0.1712382
probability of survival => 0.1712382/(1+0.1712382) = 0.1462027 or 14.6%
Let’s make an assumption for a male sitting in third class.
log odds of survival => 3.325 -3.3621 - 3.9848 +2.3039 = -1.718
odds of survival => exp(-1.718) = 0.1794246
probability of survival => 0.1794246/(1+0.1794246) = 0.1521289 or 15.21%
Let’s make an assumption for a female sitting in first class.
log odds of survival => 3.325
odds of survival => exp(3.325) = 27.799
probability of survival => 27.799/(1+27.799) = 0.9652766 or 96.5%
Let’s make an assumption for a female sitting in second class.
log odds of survival => 3.325 -1.2666 = 2.0584
odds of survival => exp(2.0584) = 7.833426
probability of survival => 7.833426/(1+7.833426) = 0.8867936 or 88.6%
Let’s make an assumption for a female sitting in third class.
log odds of survival => 3.325 -3.3621 = -0.0371
odds of survival => exp(-0.0371) = 0.9635798
probability of survival => 0.9635798/(1+0.9635798) = 0.4907261 or 49%
Interpretation of results
It looks like the closer someone is in the first class, the better the chance to survive. However, there is an interesting result between men of third and second category, where the men in third category had slightly higher chances of survival than the men in second class. In general, females had higher chances of survival. Even the females in the third class had 15% higher chance of survival than the males in the first class (49% vs 34%). Around 9 out of 10 women survived in classes 1 and 2. One out of two women survived out of the third class. Only one out of three men in the first class survived while in classes 2 and 3 men had the lowest chances of survival (14.6% and 15.2%). This slight difference though can also be due to random noise.
How we arrived to those results:
After running the logistical regression model we come up with the intercept and coefficient number odds which where in log form; hence we had to convert those in actual odds. Before converting them to odds, we sum up the related variables that we want to find in order to also have the log odds of survival (e.g. for female in second class we used the intercept 3.325 and the pclass2 value -1.2666).
Then, we exponentiate the log odds of survival to find the actual odds of survival (e.g. female in second class = 7.83 — which means each female who died, 7.83 survived).
Because human brain is not well designed enough to perceive odds as a measurement, we converted those to probability using the formula:
\[ probability = odds/1+odds \]
We did that for each one of the six outcomes
male 1st class = 34%
male 2nd class = 14.6%
male 3rd class = 15.2%
female 1st class = 96.5%
female 2nd class = 88.6%
female 3rd class = 49%