Practice in R: Analysis on all individuals diagnosed with Type I diabetes

The Wisconsin Diabetes Registry Study targeted all individuals $<30$ years of age diagnosed with Type I diabetes in southern Wisconsin, USA. Participants were requested to submit blood samples and were sent a questionnaire inquiring about hospitalizations and other events. The blood samples were used to determine glycosylated hemoglobin (GHb), and important indicator of glycemic control.

The data set diabetes.txt, which can be downloaded from here, contains the data from the Wisconsin Diabetes Registry Study. The data items are:

Variable name	meaning
ID	an unique identification number
HEALTH	self-reported health status: 1=excellent, 2=good, 3=fair, 4=poor
BH	dichotomized health status: 1=excellent health, 0=worse than excellent health
GENDER	sex code: 1=female, 0=male
GHb	overal mean glycosylated hemoglobin value in study
AGE	age at diagnosis (years)

In the section “Relationship of GHb with age”, we are interested in the relationship of GHb with age at diagnosis and/or self-reported health status
In the section “Modeling of dichotomized health status”, we use BH as the dependent variable (response variable) and use models such as logistic regression to show the explanationary power of other independent variables.

library(psych)
library(dplyr)
library(ggplot2)

df <- read.table("http://ghuang.stat.nctu.edu.tw/course/datasci17/files/data/diabetes.txt", header=T)
head(df)

Relationship of GHb with age

Investigate the association between GHb and age at diagnosis

Our goal is to fit a linear regression to investigate the association between GHb and age at diagnosis.

First, let us observe the distribution of response variable, $GHb$.

par(mfrow=c(3,1))
hist(df$GHb, main="Histogram for GHb", xlab="GHb", ylab="", border="blue",  col="green", breaks=100, prob = TRUE)
lines(density(df$GHb, na.rm=TRUE))

hist(sqrt(df$GHb), main=expression("Histogram for" ~ sqrt(GHb)), xlab=expression(sqrt(GHb)), ylab="", border="blue",  col="green", breaks=100, prob = TRUE)
lines(density(sqrt(df$GHb), na.rm=TRUE))

hist(log(df$GHb), main="Histogram for log(GHb)", xlab="log(GHb)", ylab="", border="blue",  col="green", breaks=100, prob = TRUE)
lines(density(log(df$GHb), na.rm=TRUE))

Since $GHb$ is right skewed, we should transfer $GHb$ to $log(GHb)$ according to the observation above.

y <- describe(df$GHb)
logy <- describe(log(df$GHb))

## Show Result
tab <- matrix(c(y$skew, logy$skew), ncol=2, byrow=F)
colnames(tab) <- c("GHb", "log(GHb)")
rownames(tab) <- c("skewness")
tab

##                GHb   log(GHb)
## skewness 0.8574995 0.03079306

Now let’s fit the linear regression model for the original value of $GHb$, $AGE$, and their transformation.

df$AGE2 <- df$AGE^2

s1_1 <- df %>% lm(GHb ~ AGE, dat=.) %>% summary
print(s1_1)

## 
## Call:
## lm(formula = GHb ~ AGE, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1885 -1.4515 -0.2526  1.2454 11.0896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.27118    0.17685  63.733   <2e-16 ***
## AGE         -0.02470    0.01391  -1.776   0.0763 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.132 on 555 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.005652,   Adjusted R-squared:  0.00386 
## F-statistic: 3.155 on 1 and 555 DF,  p-value: 0.07625

s1_2 <- df %>% lm(GHb ~ AGE + AGE2, dat=.) %>% summary
print(s1_2)

## 
## Call:
## lm(formula = GHb ~ AGE + AGE2, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3567 -1.3668 -0.2108  1.2241 10.8219 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.444419   0.275136  37.961  < 2e-16 ***
## AGE          0.140172   0.044566   3.145 0.001748 ** 
## AGE2        -0.006034   0.001552  -3.889 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.105 on 554 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.03207,    Adjusted R-squared:  0.02858 
## F-statistic: 9.179 on 2 and 554 DF,  p-value: 0.0001198

s1_3 <- df %>% lm(log(GHb) ~ AGE, dat=.) %>% summary
print(s1_3)

## 
## Call:
## lm(formula = log(GHb) ~ AGE, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62850 -0.12590 -0.00618  0.12537  0.71620 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.414185   0.015712 153.650   <2e-16 ***
## AGE         -0.003141   0.001236  -2.542   0.0113 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1894 on 555 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.01151,    Adjusted R-squared:  0.009732 
## F-statistic: 6.464 on 1 and 555 DF,  p-value: 0.01128

s1_4 <- df %>% lm(log(GHb) ~ AGE + AGE2, dat=.) %>% summary
print(s1_4)

## 
## Call:
## lm(formula = log(GHb) ~ AGE + AGE2, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6432 -0.1218 -0.0054  0.1230  0.6928 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.3420739  0.0244565  95.765  < 2e-16 ***
## AGE          0.0112391  0.0039615   2.837 0.004719 ** 
## AGE2        -0.0005263  0.0001379  -3.816 0.000151 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1871 on 554 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.03683,    Adjusted R-squared:  0.03335 
## F-statistic: 10.59 on 2 and 554 DF,  p-value: 3.062e-05

Q. Is the square term of AGE needed?

From the table below we can conclude that

When $AGE^2$ is not included in the model, with $GHb$ as response, the variable $AGE$ has about 85% confidence to be significant.
When $AGE^2$ is included in the model, with $GHb$ as response, the variable $AGE$ has about 99.65% confidence to be significant and the variable $AGE^2$ has about 99.99% confidence to be siginificant.
The value of $adjusted~R^2$ increases by about 0.0247 when $AGE^2$ is included in the model.

tab <- matrix(c(s1_1$coefficients[2], s1_2$coefficients[2], 
                s1_1$coefficients[8], s1_2$coefficients[11], 
                0, s1_2$coefficients[3], 
                0, s1_2$coefficients[12], 
                s1_1$adj.r.squared, s1_2$adj.r.squared), ncol=2, byrow=TRUE)
colnames(tab) <- c("[GHb ~ AGE]", "[GHb ~ AGE + AGE^2]")
rownames(tab) <- c("Coefficient of AGE", "(P-Value of AGE)", "Coefficient of AGE^2", "(P-Value of AGE^2)", 
                   "Adjusted r-squared")
tab

##                       [GHb ~ AGE] [GHb ~ AGE + AGE^2]
## Coefficient of AGE   -0.024701508        0.1401722095
## (P-Value of AGE)      0.076253926        0.0017484520
## Coefficient of AGE^2  0.000000000       -0.0060337124
## (P-Value of AGE^2)    0.000000000        0.0001129816
## Adjusted r-squared    0.003860494        0.0285792238

From the table below we can conclude that

When $AGE^2$ is not included in the model, with $log(GHb)$ as response, the variable $AGE$ has about 98% confidence to be significant.
When $AGE^2$ is included in the model, with $log(GHb)$ as response, the variable $AGE$ has about 99% confidence to be significant and the variable $AGE^2$ has about 99.97% confidence to be siginificant.
The value of $adjusted~R^2$ increases by about 0.0236 when $AGE^2$ is included in the model.

tab <- matrix(c(s1_3$coefficients[2], s1_4$coefficients[2], 
                s1_3$coefficients[8], s1_4$coefficients[11], 
                0, s1_4$coefficients[3], 
                0, s1_4$coefficients[12], 
                s1_3$adj.r.squared, s1_4$adj.r.squared), ncol=2, byrow=TRUE)
colnames(tab) <- c("[log(GHb) ~ AGE]", "[log(GHb) ~ AGE + AGE^2]")
rownames(tab) <- c("Coefficient of AGE", "(P-Value of AGE)", "Coefficient of AGE^2", "(P-Value of AGE^2)", 
                   "Adjusted r-squared")
tab

##                      [log(GHb) ~ AGE] [log(GHb) ~ AGE + AGE^2]
## Coefficient of AGE       -0.003141446             0.0112390969
## (P-Value of AGE)          0.011277878             0.0047190001
## Coefficient of AGE^2      0.000000000            -0.0005262698
## (P-Value of AGE^2)        0.000000000             0.0001510187
## Adjusted r-squared        0.009731938             0.0333503337

In conclusion, the linear regression model fit better when the variable $AGE^2$ is included in the model. So the square term of $AGE$ is needed.

Q. How to interpret the regression coefficients?

print(s1_4)

## 
## Call:
## lm(formula = log(GHb) ~ AGE + AGE2, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6432 -0.1218 -0.0054  0.1230  0.6928 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.3420739  0.0244565  95.765  < 2e-16 ***
## AGE          0.0112391  0.0039615   2.837 0.004719 ** 
## AGE2        -0.0005263  0.0001379  -3.816 0.000151 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1871 on 554 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.03683,    Adjusted R-squared:  0.03335 
## F-statistic: 10.59 on 2 and 554 DF,  p-value: 3.062e-05

The glycosylated hemoglobin value of a person is correlated with his/her age at diagnoisis. The model above stated that there is a linear relationship between the logarithm value of a person’s glycosylated hemoglobin value and his/her age.

That is,

The initial value of $log(GHb)$ is $2.3420739$ when $AGE$ is 0.
When $AGE$ changes from 0 to 1, it can cause $(1-0) \times (0.0112391)+(1^2-0^2) \times (-0.0005263)=0.0107128$ units of increase in $log(GHb)$.
When $AGE$ changes from 1 to 2, it can cause $(1-0) \times (0.0112391)+(2^2-1^2) \times (-0.0005263)=0.0096602$ units of increase in $log(GHb)$.
…(and so on.)

Investigate the association between GHb and self-reported health status

Now fit a linear regression model for GHb versus self-reported health status.

First, we create dummy variables for self-reported health status, using the group that has the highest mean GHb value as the baseline. Which group is the baseline? (Note: when creating dummy variables, pay attention to the missing values in HEALTH).

tab <- matrix(c(mean(subset(df, df$HEALTH == 1 & !is.na(df$GHb))$GHb), 
                mean(subset(df, df$HEALTH == 2 & !is.na(df$GHb))$GHb), 
                mean(subset(df, df$HEALTH == 3 & !is.na(df$GHb))$GHb), 
                mean(subset(df, df$HEALTH == 4 & !is.na(df$GHb))$GHb)), ncol=4, byrow=TRUE)
colnames(tab) <- c(1:4)
rownames(tab) <- c("Mean GHb")
tab

##                 1        2        3        4
## Mean GHb 10.62296 11.13718 12.08103 11.38529

We will use the group with $HEALTH = 3$ as baseline when creating dummy variables since it is the group that has the highest mean GHb value.

## Create dummy variables
df$H1 <- ifelse(df$HEALTH == 1, 1, 0)
df$H2 <- ifelse(df$HEALTH == 2, 1, 0)
df$H4 <- ifelse(df$HEALTH == 4, 1, 0)

## Remove NA values in HEALTH column to prevent counting these records as HEALTH == 3
df %>% select(HEALTH, H1, H2, H4, GHb) %>% 
       na.omit() %>% 
       lm(log(GHb) ~ H1 + H2 + H4, dat=.) %>% 
       summary()

## 
## Call:
## lm(formula = log(GHb) ~ H1 + H2 + H4, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66510 -0.11337  0.00143  0.12009  0.64702 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.46878    0.02758  89.520  < 2e-16 ***
## H1          -0.12256    0.03011  -4.071 5.38e-05 ***
## H2          -0.07558    0.02992  -2.526   0.0118 *  
## H4          -0.03657    0.13510  -0.271   0.7867    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.187 on 544 degrees of freedom
## Multiple R-squared:  0.03483,    Adjusted R-squared:  0.02951 
## F-statistic: 6.543 on 3 and 544 DF,  p-value: 0.0002368

Q. What is the interpretation of regression coefficients?

The linear model we obtain is $log(GHb) = 2.46878 - 0.12256 \times H1 - 0.07558 \times H2 - 0.03657 \times H4$

For people whose health status is 3, the mean value of $log(GHb)$ is $2.46878$.
$-0.12256$ is the difference in mean value of $log(GHb)$ between people whose health status is 1 and people whose health status is 3.
$-0.07558$ is the difference in mean value of $log(GHb)$ between people whose health status is 2 and people whose health status is 3.
$-0.03657$ is the difference in mean value of $log(GHb)$ between people whose health status is 4 and people whose health status is 3.

Investigate the association between GHb and self-reported health status after adjusting for age

Based on previous results, study the relationship of GHb with age at diagnosis and self-reported health status. Is self-reported health status significantly associated with GHb after adjusting for age at diagnosis, and vice versa?

## fit sevral linear models needed
s3_1 <- df %>% lm(log(GHb) ~ HEALTH, dat=.) %>% summary
s3_2 <- df %>% lm(log(GHb) ~ AGE, dat=.) %>% summary
s3_3 <- df %>% lm(log(GHb) ~ HEALTH + AGE, dat=.) %>% summary

## Compare coefficients of the fitted models
tab <- matrix(c(s3_1$coefficients[2], 0, 
                0, s3_2$coefficients[2], 
                s3_3$coefficients[2], s3_3$coefficients[3]), ncol=3, byrow=FALSE)
colnames(tab) <- c("[log(GHb) ~ HEALTH]", "[log(GHb) ~ AGE]", "[log(GHb) ~ HEALTH + AGE]")
rownames(tab) <- c("Coefficient of HEALTH", "Coefficient of AGE")
tab

##                       [log(GHb) ~ HEALTH] [log(GHb) ~ AGE]
## Coefficient of HEALTH          0.05361706      0.000000000
## Coefficient of AGE             0.00000000     -0.003141446
##                       [log(GHb) ~ HEALTH + AGE]
## Coefficient of HEALTH               0.059016012
## Coefficient of AGE                 -0.003951742

Since the coefficient of $HEALTH$ doesn’t change substantially when $AGE$ is included in the model, $AGE$ is not a confounding effect of the association between $HEALTH$ and $log(GHb)$.

At the same time, the coefficient of $AGE$ doesn’t change substantially when $HEALTH$ is included in the model, so $HEALTH$ is not a confounding effect of the association between $AGE$ and $log(GHb)$ as well.

## Compare p-values of the fitted models
tab <- matrix(c(s3_1$coefficients[8], 0, 
                0, s3_2$coefficients[8], 
                s3_3$coefficients[11], s3_3$coefficients[12]), ncol=3, byrow=FALSE)
colnames(tab) <- c("[log(GHb) ~ HEALTH]", "[log(GHb) ~ AGE]", "[log(GHb) ~ HEALTH + AGE]")
rownames(tab) <- c("P-value of HEALTH", "P-value of AGE")
tab

##                   [log(GHb) ~ HEALTH] [log(GHb) ~ AGE]
## P-value of HEALTH        1.743485e-05       0.00000000
## P-value of AGE           0.000000e+00       0.01127788
##                   [log(GHb) ~ HEALTH + AGE]
## P-value of HEALTH              2.411324e-06
## P-value of AGE                 1.411564e-03

The p-value of $HEALTH$ does get a little lower (from 99.997% confidence level to 99.999% confidence level) when $AGE$ is included in the model, so we would say that self-reported health status is a little more significantly associated with GHb after adjusting for age at diagnosis.

At the same time, the p-value of $AGE$ gets lower (from 97% confidence level to 99.7% confidence level) when $AGE$ is included in the model, so we would say that age at diagnosis is more significantly associated with GHb after adjusting for self-reported health status.

Association between GHb and AGE varies across different health status groups

Now we need a model that allows the association between GHb and AGE varies across different health status groups. The square term of AGE is not contained in the model.

Q. What is the model?

lr <- df %>% lm(GHb ~ AGE + H1 + H2 + H4 + H1*AGE + H2*AGE + H4*AGE, dat=.)
print(summary(lr))

## 
## Call:
## lm(formula = GHb ~ AGE + H1 + H2 + H4 + H1 * AGE + H2 * AGE + 
##     H4 * AGE, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0986 -1.4412 -0.2503  1.1178 10.2820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.203788   0.671771  16.678   <2e-16 ***
## AGE          0.064940   0.044260   1.467   0.1429    
## H1          -0.256499   0.721944  -0.355   0.7225    
## H2           0.573093   0.718160   0.798   0.4252    
## H4          -0.005499   2.299794  -0.002   0.9981    
## AGE:H1      -0.096599   0.049536  -1.950   0.0517 .  
## AGE:H2      -0.122140   0.048391  -2.524   0.0119 *  
## AGE:H4      -0.052905   0.114284  -0.463   0.6436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.077 on 540 degrees of freedom
##   (18 observations deleted due to missingness)
## Multiple R-squared:  0.05972,    Adjusted R-squared:  0.04753 
## F-statistic: 4.899 on 7 and 540 DF,  p-value: 2.226e-05

Q. Average GHb change for each health status group?

For each health status group, what is the mean GHb change per one-year incresae in AGE (the slope)?

When $HEALTH = 3$, the mean $GHb$ increase by $0.064940$ per one-year incresae in $AGE$, while its initial value is $11.203788$ when $AGE = 0$.
When $HEALTH = 1$, the mean $GHb$ increase by $0.064940+(-0.096599)=-0.031659$ per one-year incresae in $AGE$, while its initial value is $11.203788+(-0.256499)=10.94729$ when $AGE = 0$.
When $HEALTH = 2$, the mean $GHb$ increase by $0.064940+(-0.122140)=-0.0572$ per one-year incresae in $AGE$, while its initial value is $11.203788+0.573093=11.77688$ when $AGE = 0$.
When $HEALTH = 4$, the mean $GHb$ increase by $0.064940+(-0.052905)=0.012035$ per one-year incresae in $AGE$, while its initial value is $11.203788+(-0.052905)=11.15088$ when $AGE = 0$.

Modeling of dichotomized health status

Diagnostic plots of linear model

Draw various diagnostic plots (e.g., the q-q plot, the (studentized)-residuals-versus-predicted-values plot) of the model built in above. Do you find any outliers/extreme points? Are there any model assumptions being violated?

plot(lr, which=1)

The first plot depicts residuals versus fitted values. The plot of residuals versus predicted values is useful for checking the assumption of linearity and homoscedasticity.

This model does NOT perfectly meet the assumption of linearity since we could see our residuals take on a defined shape (an unobvious parabola) instead of equally spread around the y = 0 line.

plot(lr, which=2)

The normality assumption is evaluated based on the residuals and can be evaluated using a QQ-plot by comparing the residuals to “ideal” normal observations along the 45-degree line.

With this model, the tails are observed to be ‘heavier’ (have larger values) than what we would expect, which violates the normality of residuals assumption.

plot(lr, which=3)

The third plot is a scale-location plot (square rooted standardized residual vs. predicted value). This is useful for checking the assumption of homoscedasticity.

We can see there is almost constant variance across the range of standardized residuals for each fitted value, suggesting that the homoscedasticity assumption is satisfied here.

In this scale-location plot, we can also see some potential outliers (observation 544, 273, and 369) that have specifically high standardized rediduals (> 2.0).

plot(lr, which=5)

The fourth plot helps us find influential cases, if any are present in the data. Influential outliers are of the greatest concern while outliers may or may not be influential points.

Observations 369, 32, and 547 stick out. However, they are not influential outlier since they don’t cross Cook’s distance line.

Age adjusted odds ratios for sex versus dichotomized health status

Use logistic regression to obtain the crude odds ratio as well as the age adjusted odds ratio for sex versus dichotomized health status (use age as a continuous variable).

df %>% glm(BH ~ GENDER, family=binomial, dat=.) %>% summary()

## 
## Call:
## glm(formula = BH ~ GENDER, family = binomial, data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1481  -1.1481  -0.9703   1.2070   1.3997  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.06947    0.11792  -0.589   0.5558  
## GENDER      -0.43937    0.17251  -2.547   0.0109 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 761.49  on 556  degrees of freedom
## Residual deviance: 754.95  on 555  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: 758.95
## 
## Number of Fisher Scoring iterations: 4

The crude odds ratio:

For people whose $GENDER = 0$ (male), their odds to be $BH = 1$ (excellent health) is $e^{-0.06947} = 0.9328881$.
For people whose $GENDER = 1$ (female), their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.43937} = 0.6444423$, comparing to people whose $GENDER = 0$ (male).

df %>% glm(BH ~ GENDER + AGE, family=binomial, dat=.) %>% summary()

## 
## Call:
## glm(formula = BH ~ GENDER + AGE, family = binomial, data = .)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.313  -1.056  -0.900   1.240   1.682  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.34282    0.19032   1.801  0.07166 . 
## GENDER      -0.44678    0.17377  -2.571  0.01014 * 
## AGE         -0.03716    0.01350  -2.753  0.00591 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 761.49  on 556  degrees of freedom
## Residual deviance: 747.12  on 554  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: 753.12
## 
## Number of Fisher Scoring iterations: 4

The adjusted odds ratio:

For people whose $GENDER = 0$ (male) and $AGE = 0$, their odds to be $BH = 1$ (excellent health) is $e^{0.34282} = 1.408915$.
When the value of $AGE$ is the same, for people whose $GENDER = 1$ (female), their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.44678} = 0.6396846$, comparing to people whose $GENDER = 0$ (male).
When the value of $GENDER$ is the same, for every one unit increase in $AGE$, the odds ratio to be $BH = 1$ (excellent health) is $e^{-0.03716} = 0.963522$.

Age-stratum adjusted odds ratios for sex versus dichotomized health status

Create an age stratum variable (values 1=$\mbox{age}\leq 7$, 2=$7<\mbox{age}\leq 11$, 3=$11<\mbox{age}\leq 15$, 4=$15<\mbox{age}$), and corresponding dummy variables, using ages 0-7 as the baseline group. Use logistic regression to obtain age-stratum adjusted odds ratios for sex versus dichotomized health status.

## Create age stratum variables
df$AGE_7_11 <- ifelse(df$AGE>7 & df$AGE<=11, 1, 0)
df$AGE_11_15 <- ifelse(df$AGE>11 & df$AGE<=15, 1, 0)
df$AGE_15_30 <- ifelse(df$AGE>15, 1, 0)

## Fit logistic regression model
s7 <- df %>% glm(BH ~ GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30, family=binomial, dat=.) %>% summary()
print(s7)

## 
## Call:
## glm(formula = BH ~ GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30, 
##     family = binomial, data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2546  -1.0617  -0.8012   1.2664   1.6079  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.174604   0.172706   1.011  0.31202   
## GENDER      -0.457927   0.175081  -2.616  0.00891 **
## AGE_7_11     0.004949   0.233725   0.021  0.98311   
## AGE_11_15   -0.381393   0.238721  -1.598  0.11012   
## AGE_15_30   -0.688456   0.246636  -2.791  0.00525 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 761.49  on 556  degrees of freedom
## Residual deviance: 744.27  on 552  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: 754.27
## 
## Number of Fisher Scoring iterations: 4

The adjusted odds ratio:

For people whose $GENDER = 0$ (male) and $AGE <= 7$, their odds to be $BH = 1$ (excellent health) is $e^{0.174604} = 1.190775$.
When the value of $AGE$ is in the same group (1~7, 7~11, 11~15, or 15~30), for people whose $GENDER = 1$ (female), their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.457927} = 0.6325937$, comparing to people whose $GENDER = 0$ (male).
When the value of $GENDER$ is the same, for people whose $AGE$ is in $(7, 11]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{0.004949} = 1.004961$, comparing to people whose $AGE$ is in $(0, 7]$.
When the value of $GENDER$ is the same, for people whose $AGE$ is in $(11, 15]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.381393} = 0.6829095$, comparing to people whose $AGE$ is in $(0, 7]$.
When the value of $GENDER$ is the same, for people whose $AGE$ is in $(15, 30]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.688456} = 0.5023511$, comparing to people whose $AGE$ is in $(0, 7]$.

Differences in two ways of adjusting for age

Comment on whether the adjusted odds ratio for sex versus health status is different with the two ways of adjusting for age (i.e., compare adjusting for continuous age versus adjusting for age strata). What assumptions do you make in each way of adjusting for age? Which analysis is more appropriate?

The table below compares the odds ratio (when the value of $GENDER$ is the same and the value of $AGE$ increases) using the two ways of adjusting for age.

we can see that the odds ratio remains the same when adjusting for continuous age, while the odss ratio decreases with the rise of $AGE$ when adjusting for age strata.

odds ratio	AGE from 0 to 1	AGE from 1 to 2	…	AGE from 7 to 8	…	AGE from 11 to 12	…	AGE from 15 to 16	…	AGE from 29 to 30
adjusting for continuous age	0.963522	0.963522	0.963522	0.963522	0.963522	0.963522	0.963522	0.963522	0.963522	0.963522
adjusting for age strata	1	1	1	1.004961	1	0.6829095/1.004961 = 0.6795383	1	0.5023511/0.6829095 = 0.7356042	1	1

The table below compares the odds (when $GENDER = 0$ and the value of $AGE$ increases) using the two ways of adjusting for age.

we can see that the odds ratio decreases linearly with the rise of $AGE$ when adjusting for continuous age, while the odss ratio decreases nonlinearly with the rise of $AGE$ when adjusting for age strata.

odds	AGE = 0	AGE = 1	…	AGE = 8	…	AGE = 12	…	AGE = 16	…	AGE = 30
adjusting for continuous age	1.408915	1.357521	…	1.046593	…	0.9020371	…	0.7774473	…	0.4620977
adjusting for age strata	1.190775	1.190775	1.190775	1.190775*1.004961 = 1.196682	1.196682	1.190775*0.6829095 = 0.8131916	0.8131916	1.190775*0.5023511 = 0.5981871	0.5981871	0.5981871

When we adjust for continuous $AGE$, we take the assumption that the odds ratio is the same for every 1 unit increase in $AGE$.

However, in the plot below we can see that the decrease in proportion of people whose $BH = 1$ is not linear to the increase in $AGE$.

## Create discrete number of AGE and factor type of BH for plot usage
df$AGE_floor <- floor(df$AGE)
df$BH_type <- as.factor(df$BH)

## Plot the stacked bar chart to show the changes in count for BH=0 and BH=1 as AGE increases
par(mfrow=c(1,2))
ggplot(data = df, aes(x=AGE_floor, fill = BH_type)) + geom_bar()

The odds ratio for $AGE$ changed from 7 to 8 should not be the same as the odds ratio for $AGE$ changed from 15 to 16. So in my opinion, age-stratum adjusted odds ratios is more appropriate while the age stratum variable should be carefully set to meet the criteria.

Interpretation of the odds ratios for age

Interpret the odds ratios for age in the two logistic regression approaches above (i.e., one adjusted for continuous age and one adjusted for age strata).

Odds ratios for age when adjusted for continuous age
- For people whose $GENDER = 0$ (male) and $AGE = 0$, their odds to be $BH = 1$ (excellent health) is $e^{0.34282} = 1.408915$.
- When the value of $GENDER$ is the same, for every one unit increase in $AGE$, the odds ratio to be $BH = 1$ (excellent health) is $e^{-0.03716} = 0.963522$.
Odds ratios for age when adjusted for age strata
- For people whose $GENDER = 0$ (male) and $AGE <= 7$, their odds to be $BH = 1$ (excellent health) is $e^{0.174604} = 1.190775$.
- For people whose $GENDER = 0$ (male) and $AGE <= 7$, their odds to be $BH = 1$ (excellent health) is $e^{0.181806} = 1.199381$.
- When the value of $GENDER$ is the same, for people whose $AGE$ is in $(7, 11]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{0.004949} = 1.004961$, comparing to people whose $AGE$ is in $(0, 7]$.
- When the value of $GENDER$ is the same, for people whose $AGE$ is in $(11, 15]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.381393} = 0.6829095$, comparing to people whose $AGE$ is in $(0, 7]$.
- When the value of $GENDER$ is the same, for people whose $AGE$ is in $(15, 30]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.688456} = 0.5023511$, comparing to people whose $AGE$ is in $(0, 7]$.

Add GHb to age-stratum adjusted logistic regression model

Add GHb to the logistic regression in section “Age-stratum adjusted odds ratios for sex versus dichotomized health status”. Do the relationships of sex and/or age with health status change with and without adding GHb?

## Add GHb to the logistic regression in Problem 7
s10 <- df %>% glm(BH ~ GHb + GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30, family=binomial, dat=.) %>% summary()
print(s10)

## 
## Call:
## glm(formula = BH ~ GHb + GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30, 
##     family = binomial, data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5019  -1.0810  -0.7762   1.1771   2.0793  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.22501    0.54087   4.114 3.89e-05 ***
## GHb         -0.18960    0.04742  -3.998 6.39e-05 ***
## GENDER      -0.47525    0.17869  -2.660  0.00782 ** 
## AGE_7_11     0.14736    0.23993   0.614  0.53910    
## AGE_11_15   -0.27022    0.24427  -1.106  0.26863    
## AGE_15_30   -0.72966    0.25671  -2.842  0.00448 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.23  on 547  degrees of freedom
## Residual deviance: 719.08  on 542  degrees of freedom
##   (18 observations deleted due to missingness)
## AIC: 731.08
## 
## Number of Fisher Scoring iterations: 4

## Create table that compares the model created in Problem 7 with the model created in Problem 10
tab <- matrix(c(s7$coefficients[2:5], s10$coefficients[3:6], s10$coefficients[3:6]-s7$coefficients[2:5])
              ,ncol=3, byrow=FALSE)

## Show results
colnames(tab) <- c("Without adjusted for GHb (A)", "Adjusted for GHb (B)", "Difference (B-A)")
rownames(tab) <- paste("Coefficient of", rownames(s7$coefficients)[2:5])
tab

##                          Without adjusted for GHb (A) Adjusted for GHb (B)
## Coefficient of GENDER                    -0.457927163           -0.4752462
## Coefficient of AGE_7_11                   0.004948754            0.1473575
## Coefficient of AGE_11_15                 -0.381392527           -0.2702213
## Coefficient of AGE_15_30                 -0.688455966           -0.7296650
##                          Difference (B-A)
## Coefficient of GENDER         -0.01731905
## Coefficient of AGE_7_11        0.14240873
## Coefficient of AGE_11_15       0.11117118
## Coefficient of AGE_15_30      -0.04120900

From the table above, we can see that the difference in coefficient of $GENDER$ is little, so we can say that the relationship between sex and health status barely changes when adjusting for $GHb$.

For the relationship between age and health status, we can see that the coefficient of AGE_15_30 also barely changes while the changes in the coefficient of AGE_7_11 and AGE_11_15 are both obvious, indicating that when adjusting for GHb, the relationship between age and health status changes.

Interactions between GHb and sex

For the model above, examine whether GHb interacts with sex in its relationship with binary self reported health status. Interpret the odds ratio for GHb with and without the interaction term between sex and GHb.

## Add interaction term of GHb and sex to the logistic regression in Problem 10
lr <- df %>% glm(BH ~ GHb*GENDER + GHb + GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30, family=binomial, dat=.)
s11 <- summary(lr)
print(s11)

## 
## Call:
## glm(formula = BH ~ GHb * GENDER + GHb + GENDER + AGE_7_11 + AGE_11_15 + 
##     AGE_15_30, family = binomial, data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4803  -1.0843  -0.7648   1.1717   2.1282  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  2.10289    0.68891   3.053  0.00227 **
## GHb         -0.17839    0.06150  -2.901  0.00372 **
## GENDER      -0.19460    1.00698  -0.193  0.84677   
## AGE_7_11     0.14893    0.24010   0.620  0.53509   
## AGE_11_15   -0.27181    0.24436  -1.112  0.26599   
## AGE_15_30   -0.73012    0.25646  -2.847  0.00441 **
## GHb:GENDER  -0.02590    0.09148  -0.283  0.77711   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.23  on 547  degrees of freedom
## Residual deviance: 719.00  on 541  degrees of freedom
##   (18 observations deleted due to missingness)
## AIC: 733
## 
## Number of Fisher Scoring iterations: 4

The table below compares the coefficient estimates of the model “BH ~ GHb + GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30” with that of the model “BH ~ GHbxGENDER + GHb + GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30”.

We can see that only the coefficient of $GENDER$ changes a lot when the interaction term between sex and GHb is included in the model.

## Create a table to compares the coefficient estimates of the model created in Problem 10
## with that of the model created in Problem 11
c10 <- c(s10$coefficients[1:6], 0)
tab <- matrix(c(c10, s11$coefficients[1:7], s11$coefficients[1:7]-c10), ncol=3, byrow=FALSE)
colnames(tab) <- c("Without interaction term (A)", 
                   "With GHbxGENDER interaction term (B)", 
                   "Difference (B-A)")
rownames(tab) <- paste("Coefficient of", rownames(s11$coefficients)[1:7])
tab

##                            Without interaction term (A)
## Coefficient of (Intercept)                    2.2250121
## Coefficient of GHb                           -0.1895995
## Coefficient of GENDER                        -0.4752462
## Coefficient of AGE_7_11                       0.1473575
## Coefficient of AGE_11_15                     -0.2702213
## Coefficient of AGE_15_30                     -0.7296650
## Coefficient of GHb:GENDER                     0.0000000
##                            With GHbxGENDER interaction term (B)
## Coefficient of (Intercept)                            2.1028939
## Coefficient of GHb                                   -0.1783871
## Coefficient of GENDER                                -0.1945955
## Coefficient of AGE_7_11                               0.1489271
## Coefficient of AGE_11_15                             -0.2718147
## Coefficient of AGE_15_30                             -0.7301234
## Coefficient of GHb:GENDER                            -0.0258973
##                            Difference (B-A)
## Coefficient of (Intercept)    -0.1221181704
## Coefficient of GHb             0.0112123582
## Coefficient of GENDER          0.2806507166
## Coefficient of AGE_7_11        0.0015696085
## Coefficient of AGE_11_15      -0.0015933818
## Coefficient of AGE_15_30      -0.0004584788
## Coefficient of GHb:GENDER     -0.0258973009

The table below compares the p-values of coefficient estimates of the model “BH ~ GHb + GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30” with that of the model “BH ~ GHbxGENDER + GHb + GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30”.

We can see that only the p-values of $AGE$ decreases, however, the decreasement is very little.

## Create a table to compares the p-values of coefficient estimates of the model created in Problem 10
## with that of the model created in Problem 11
c10 <- c(s10$coefficients[19:24])
tab <- matrix(c(paste(c10*100,"%"), "", paste(s11$coefficients[22:28]*100,"%"), 
                c10 > s11$coefficients[22:27], ""), ncol=3, byrow=FALSE)
colnames(tab) <- c("Without interaction term (A)", "With GHbxGENDER interaction term (B)", "Improve? (B<A)")
rownames(tab) <- paste("P-Value of", rownames(s11$coefficients)[1:7])
tab

##                        Without interaction term (A)
## P-Value of (Intercept) "0.00389207609628805 %"     
## P-Value of GHb         "0.00638515903016335 %"     
## P-Value of GENDER      "0.78244556917265 %"        
## P-Value of AGE_7_11    "53.9096020131784 %"        
## P-Value of AGE_11_15   "26.8628262334859 %"        
## P-Value of AGE_15_30   "0.447848256770686 %"       
## P-Value of GHb:GENDER  ""                          
##                        With GHbxGENDER interaction term (B) Improve? (B<A)
## P-Value of (Intercept) "0.226931741023107 %"                "FALSE"       
## P-Value of GHb         "0.3724813715314 %"                  "FALSE"       
## P-Value of GENDER      "84.6765569053408 %"                 "FALSE"       
## P-Value of AGE_7_11    "53.5086408698207 %"                 "TRUE"        
## P-Value of AGE_11_15   "26.5986364359719 %"                 "TRUE"        
## P-Value of AGE_15_30   "0.441444423092713 %"                "TRUE"        
## P-Value of GHb:GENDER  "77.7107997742709 %"                 ""

Q. Does GHb interacts with sex in its relationship with binary self reported health status?

In the second table above, we can see that we have extremely low confidence to reject the hypothesis that the interation term $GHb:GENDER$ is not 0. Thus in my opinion, $GHb$ does not well interacts with $GENDER$ although the coefficient estimate of $GENDER$ changes a lot when the interaction term $GHb:GENDER$ is included in the model.

We can also check the diagnostic plots as belows, and it would be obvious that this model that includes the interaction term violoates the assumption of linearity, homoscedasticity, and normality since the residuals are not randomly spread out about the line and the variability in the residuals is not constant either.

## Diagnostic charts for 
par(mfrow=c(2,2))
plot(lr)

Q. Interpret the odds ratio for GHb without the interaction term between sex and GHb

print(s10$call)
print(s10$coefficients)

## glm(formula = BH ~ GHb + GENDER + AGE_7_11 + AGE_11_15 + AGE_15_30, 
##     family = binomial, data = .)

##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  2.2250121 0.54086611  4.1137947 3.892076e-05
## GHb         -0.1895995 0.04742233 -3.9981051 6.385159e-05
## GENDER      -0.4752462 0.17869406 -2.6595524 7.824456e-03
## AGE_7_11     0.1473575 0.23992534  0.6141806 5.390960e-01
## AGE_11_15   -0.2702213 0.24427295 -1.1062270 2.686283e-01
## AGE_15_30   -0.7296650 0.25671349 -2.8423320 4.478483e-03

For people whose $GHb = 0$, $GENDER = 0$ (male) and $AGE <= 7$, their odds to be $BH = 1$ (excellent health) is $e^{2.2250121} = 9.253595$.
When the value of $GENDER$ is the same and the value of $AGE$ is in the same group (1~7, 7~11, 11~15, or 15~30), every one unit increase in $GHb$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.1895995} = 0.8272904$.
When the value of $GHb$ is the same and the value of $AGE$ is in the same group (1~7, 7~11, 11~15, or 15~30), for people whose $GENDER = 1$ (female), their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.4752462} = 0.621732$, comparing to people whose $GENDER = 0$ (male).
When the value of $GENDER$ is the same and the value of $GHb$ is also the same,, for people whose $AGE$ is in $(7, 11]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{0.1473575} = 1.158768$, comparing to people whose $AGE$ is in $(0, 7]$.
When the value of $GENDER$ is the same and the value of $GHb$ is also the same, for people whose $AGE$ is in $(11, 15]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.2702213} = 0.7632106$, comparing to people whose $AGE$ is in $(0, 7]$.
When the value of $GENDER$ is the same and the value of $GHb$ is also the same,, for people whose $AGE$ is in $(15, 30]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.7296650} = 0.4820705$, comparing to people whose $AGE$ is in $(0, 7]$.

Q. Interpret the odds ratio for GHb with the interaction term between sex and GHb

print(s11$call)
print(s11$coefficients)

## glm(formula = BH ~ GHb * GENDER + GHb + GENDER + AGE_7_11 + AGE_11_15 + 
##     AGE_15_30, family = binomial, data = .)

##               Estimate Std. Error    z value    Pr(>|z|)
## (Intercept)  2.1028939 0.68890517  3.0525159 0.002269317
## GHb         -0.1783871 0.06150065 -2.9005727 0.003724814
## GENDER      -0.1945955 1.00697835 -0.1932470 0.846765569
## AGE_7_11     0.1489271 0.24010416  0.6202603 0.535086409
## AGE_11_15   -0.2718147 0.24436011 -1.1123531 0.265986364
## AGE_15_30   -0.7301234 0.25646081 -2.8469201 0.004414444
## GHb:GENDER  -0.0258973 0.09148087 -0.2830898 0.777107998

For people whose $GHb = 0$, $GENDER = 0$ (male) and $AGE <= 7$, their odds to be $BH = 1$ (excellent health) is $e^{2.1028939} = 8.189836$.
When the value of $AGE$ is in the same group (1~7, 7~11, 11~15, or 15~30): - When $GENDER = 0$, every one unit increase in $GHb$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.1783871} = 0.8366185$. - When $GENDER = 1$, every one unit increase in $GHb$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.1783871} + e^{-0.0258973} = 1.811054$.
When the value of $AGE$ is in the same group (1~7, 7~11, 11~15, or 15~30): - When $GHb = 0$, for people whose $GENDER = 1$ (female), their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.1945955} = 0.8231676$, comparing to people whose $GENDER = 0$ (male). - When $GHb = c = constant$, for people whose $GENDER = 1$ (female), their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.1945955}+e^{-0.0258973 \times c}$, comparing to people whose $GENDER = 0$ (male).
When the value of $GENDER$ is the same and the value of $GHb$ is also the same,, for people whose $AGE$ is in $(7, 11]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{0.14892715} = 1.160588$, comparing to people whose $AGE$ is in $(0, 7]$.
When the value of $GENDER$ is the same and the value of $GHb$ is also the same, for people whose $AGE$ is in $(11, 15]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.27181473} = 0.76199546$, comparing to people whose $AGE$ is in $(0, 7]$.
When the value of $GENDER$ is the same and the value of $GHb$ is also the same,, for people whose $AGE$ is in $(15, 30]$, their odds ratio to be $BH = 1$ (excellent health) is $e^{-0.7301234} = 0.4818495$, comparing to people whose $AGE$ is in $(0, 7]$.

References

Data Science, Statistics, and R (資料科學、統計與 R) Fall 2017, lectured by Dr. Guan-Hua Huang (黃冠華)

An Explorer of Things

Hello, my name is Chihling :)