Back to Article
Hypothesis Testing
Download Source

Hypothesis Testing

Author

Lukas Graz

Published

February 13, 2025

In [1]:
Code
source("R/data_prep.R")
Number of matches per filter criteria (not disjoint)
  Headphone  PRS_all_NA    Distance Activity_NA    Duration  HMNoise_NA 
        303         226         221         102          96          96 
JourneyTime 
         20 
Keep  1494 of 2206 observations
Imputing PRS_orig_vars
Code
options(digits = 3)

# interactive <- function() FALSE
# D_trn$HM_NOISE_nrm <- scale(D_trn$HM_NOISE)
# D_tst$HM_NOISE_nrm <- scale(D_tst$HM_NOISE)
## No suspicious patterns in missing data
# mice::md.pattern(D[c(Mediator_vars, GIS_vars)[nNAs>0]], plot = FALSE)
# mice::md.pattern(D[PRS_orig_vars])

Linear Modeling

Imputation with MissForest on Training Data

In [2]:
Number of NAs in Mediators and GIS variables
sapply(D[Mediator_vars], \(x) sum(is.na(x)))
 FEELNAT   LNOISE LOC_SENS LOC_SOUN LOC_SCEN LOC_VISE LOC_VEGE LOC_FAUN 
      16      291       28       30       36       62       69       88 
Number of NAs in Mediators and GIS variables
sapply(D[GIS_vars], \(x) sum(is.na(x)))
  LCARTIF_sqrt  LCFOREST_sqrt          HETER    OVDIST_sqrt     VIS5K_sqrt 
             0              0              0              0              0 
       RL_NDVI       RL_NOISE    DISTKM_sqrt   JNYTIME_sqrt STRIMP123_sqrt 
             0              0              0             86              0 
STRIMP999_sqrt 
             0 
In [3]:
Impute missing values using MissForest
# Mediator imputation
D_trn[Mediator_vars] <- xfun::cache_rds({
  missForest(as.matrix(D_trn[Mediator_vars]))
  }, 
  file = "Mediator_imputation.rds", 
  dir = "cache/",
  hash = list(as.matrix(D_trn[Mediator_vars]))
)$ximp |> as.data.frame()

# GIS imputation (missForest)
D_trn[GIS_vars] <- xfun::cache_rds({
  missForest(as.matrix(D_trn[GIS_vars]))
  }, 
  file = "GIS_imputation.rds", 
  dir = "cache/",
  hash = list(as.matrix(D_trn[GIS_vars]))
)$ximp |> as.data.frame()

Scaling Test Data

In [4]:
Scaling variables and show old scale
all_vars <- c(Mediator_vars, GIS_vars, PRS_vars)
old_scale <- t(sapply(D_tst[c(all_vars)], \(x) 
  c(mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE))))

D_tst[c(all_vars)] <- lapply(D_tst[c(all_vars)], scale)

old_scale
                 mean     sd
FEELNAT         6.142  1.055
LNOISE          4.210  0.747
LOC_SENS        4.098  1.016
LOC_SOUN        4.296  0.947
LOC_SCEN        3.967  1.056
LOC_VISE        4.080  1.027
LOC_VEGE        4.343  0.859
LOC_FAUN        3.298  1.365
LCARTIF_sqrt    0.271  0.269
LCFOREST_sqrt   0.454  0.311
HETER           1.305  0.402
OVDIST_sqrt    21.797 10.144
VIS5K_sqrt      3.323  1.620
RL_NDVI         0.635  0.202
RL_NOISE       41.615  9.261
DISTKM_sqrt     1.473  1.156
JNYTIME_sqrt    3.830  2.247
STRIMP123_sqrt  6.555 10.739
STRIMP999_sqrt 47.557 13.162
MEAN            4.989  0.882
FA              5.266  1.111
BA              5.141  1.156
EC              4.545  1.285
ES              5.009  1.432

Testing VIF

In [5]:
VIF: PRS ~ Mediators + GIS_vars (NO interaction)
car::vif(fit_PRS_MED <- lm(as.formula(paste0(
  PRS_vars[1], 
  " ~ ", 
  paste(Mediator_vars, collapse = " + "), " + ",
  paste(GIS_vars,      collapse = " + ")
)), D_trn)) |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.21    1.46    1.94    2.03    2.24    4.99 
In [6]:
VIF: PRS ~ (Mediators + GIS_vars)^2 (WITH interaction)
suppressMessages(
car::vif(fit_PRS_MED <- lm(as.formula(paste0(
  PRS_vars[1], 
  " ~ ", 
  "(", paste(Mediator_vars, collapse = " + "), 
     " + ", paste(GIS_vars, collapse = " + "), 
  ")^2"
)), D_trn))) |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     12      81     148     205     271    1392 

Since we model with interactions later, the latter VIF are relevant for us. Given that they are very high (c.f. median and max), we would have no hope of finding any significant results in the full interaction model. Therefore, we will first perfom a variable selection, to reduce the VIF and enable us to find significant effects.

All Interactions: Mediators ~ (GIS)^2

In [7]:
Res3 <- list()
for (mediator in Mediator_vars) {
  intercept_model <- lm(as.formula(paste0(
    mediator, " ~ 1")), D_trn)
  step_model <- step(intercept_model, 
    scope = as.formula(paste0(
      mediator, " ~ ", 
      "(", paste(GIS_vars, collapse = " + "), ")^2"
    )),
    trace = FALSE, k = log(nrow(D_trn))
  )
  Res3[[mediator]] <- lm(formula(step_model), D_tst)
}
lapply(Res3, summary)
$FEELNAT

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-4.959 -0.391  0.264  0.607  1.685 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.0618     0.0410    1.51  0.13260    
LCARTIF_sqrt          -0.1524     0.0570   -2.67  0.00770 ** 
RL_NDVI                0.1498     0.0436    3.43  0.00063 ***
OVDIST_sqrt            0.0270     0.0452    0.60  0.55112    
LCARTIF_sqrt:RL_NDVI   0.1146     0.0402    2.85  0.00446 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.948 on 733 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.106, Adjusted R-squared:  0.101 
F-statistic: 21.8 on 4 and 733 DF,  p-value: <2e-16


$LNOISE

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.999 -0.527  0.012  0.676  1.622 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.00097    0.03862   -0.03    0.980    
LCARTIF_sqrt -0.12357    0.04841   -2.55    0.011 *  
RL_NOISE     -0.24203    0.04890   -4.95  9.7e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.945 on 596 degrees of freedom
  (148 observations deleted due to missingness)
Multiple R-squared:  0.11,  Adjusted R-squared:  0.107 
F-statistic: 36.6 on 2 and 596 DF,  p-value: 9.79e-16


$LOC_SENS

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.319 -0.217 -0.007  0.831  1.390 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.000148   0.036675    0.00  0.99678    
HETER           0.129769   0.038233    3.39  0.00073 ***
STRIMP999_sqrt -0.072664   0.038328   -1.90  0.05837 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.993 on 730 degrees of freedom
  (14 observations deleted due to missingness)
Multiple R-squared:  0.0168,    Adjusted R-squared:  0.0141 
F-statistic: 6.22 on 2 and 730 DF,  p-value: 0.00209


$LOC_SOUN

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.623 -0.402  0.482  0.698  1.512 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.000403   0.036322    0.01   0.9912    
LCARTIF_sqrt -0.175213   0.037180   -4.71  2.9e-06 ***
HETER         0.109010   0.037186    2.93   0.0035 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.984 on 731 degrees of freedom
  (13 observations deleted due to missingness)
Multiple R-squared:  0.0344,    Adjusted R-squared:  0.0317 
F-statistic:   13 on 2 and 731 DF,  p-value: 2.82e-06


$LOC_SCEN

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0383 -0.2135  0.0669  0.8101  1.8029 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.00114    0.03618   -0.03     0.97    
RL_NDVI      0.21701    0.03629    5.98  3.5e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.977 on 727 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.0469,    Adjusted R-squared:  0.0456 
F-statistic: 35.8 on 1 and 727 DF,  p-value: 3.48e-09


$LOC_VISE

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0692 -0.1488 -0.0235  0.8511  1.0861 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -0.000356   0.037305   -0.01    0.992  
LCARTIF_sqrt -0.071106   0.037580   -1.89    0.059 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.998 on 714 degrees of freedom
  (31 observations deleted due to missingness)
Multiple R-squared:  0.00499,   Adjusted R-squared:  0.0036 
F-statistic: 3.58 on 1 and 714 DF,  p-value: 0.0589


$LOC_VEGE

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.971 -0.492  0.443  0.706  1.599 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.0192     0.0378   -0.51   0.6111    
RL_NDVI        0.2195     0.0382    5.75  1.4e-08 ***
JNYTIME_sqrt  -0.1139     0.0379   -3.01   0.0027 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.98 on 671 degrees of freedom
  (73 observations deleted due to missingness)
Multiple R-squared:  0.0549,    Adjusted R-squared:  0.052 
F-statistic: 19.5 on 2 and 671 DF,  p-value: 6.03e-09


$LOC_FAUN

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-1.898 -0.914  0.300  0.796  1.744 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.00127    0.03685   -0.03     0.97    
LCARTIF_sqrt -0.21409    0.03697   -5.79  1.1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.978 on 702 degrees of freedom
  (43 observations deleted due to missingness)
Multiple R-squared:  0.0456,    Adjusted R-squared:  0.0442 
F-statistic: 33.5 on 1 and 702 DF,  p-value: 1.05e-08

All Interactions: PRS ~ (Mediators + GIS)^2

In [8]:
Res4 <- list()
for (prs in PRS_vars) {
  intercept_model <- lm(as.formula(paste0(
    prs, " ~ 1")), D_trn)
  step_model <- step(intercept_model, 
    scope = as.formula(paste0(
      prs, " ~ ", 
      "(", paste(GIS_vars, collapse = " + "), " + ", 
      paste(Mediator_vars, collapse = " + "), ")^2"
    )),
    trace = FALSE, k = log(nrow(D_trn))
  )
  Res4[[prs]] <- lm(formula(step_model), D_tst)
}
lapply(Res4, summary)
$MEAN

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7969 -0.5836 -0.0034  0.5952  2.8496 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -0.00405    0.03819   -0.11  0.91547    
LOC_VISE          0.15995    0.04455    3.59  0.00036 ***
FEELNAT           0.20119    0.04557    4.41  1.2e-05 ***
LOC_SENS          0.09361    0.04437    2.11  0.03531 *  
LNOISE            0.16980    0.04201    4.04  6.1e-05 ***
DISTKM_sqrt      -0.00204    0.03977   -0.05  0.95908    
LCARTIF_sqrt      0.02009    0.04143    0.49  0.62784    
LOC_SCEN          0.04419    0.04611    0.96  0.33836    
FEELNAT:LOC_SENS  0.05374    0.02953    1.82  0.06935 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.897 on 560 degrees of freedom
  (178 observations deleted due to missingness)
Multiple R-squared:  0.19,  Adjusted R-squared:  0.178 
F-statistic: 16.4 on 8 and 560 DF,  p-value: <2e-16


$FA

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.392 -0.518  0.092  0.592  2.566 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -0.002857   0.037976   -0.08  0.94006    
LOC_VISE          0.127624   0.041264    3.09  0.00208 ** 
LOC_FAUN          0.176321   0.041690    4.23  2.7e-05 ***
LNOISE            0.133255   0.040187    3.32  0.00097 ***
RL_NDVI          -0.132538   0.039072   -3.39  0.00074 ***
FEELNAT           0.168093   0.045299    3.71  0.00023 ***
LOC_SCEN          0.163809   0.045689    3.59  0.00037 ***
FEELNAT:LOC_SCEN -0.000854   0.031493   -0.03  0.97837    
RL_NDVI:LOC_SCEN  0.024030   0.034060    0.71  0.48079    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.865 on 556 degrees of freedom
  (182 observations deleted due to missingness)
Multiple R-squared:  0.255, Adjusted R-squared:  0.245 
F-statistic: 23.8 on 8 and 556 DF,  p-value: <2e-16


$BA

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.966 -0.512  0.070  0.647  2.405 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.00836    0.03572   -0.23  0.81498    
LOC_VISE     0.12118    0.04046    3.00  0.00284 ** 
FEELNAT      0.18712    0.03684    5.08  4.9e-07 ***
LOC_SENS     0.14908    0.04055    3.68  0.00025 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.948 on 700 degrees of freedom
  (43 observations deleted due to missingness)
Multiple R-squared:  0.115, Adjusted R-squared:  0.111 
F-statistic: 30.2 on 3 and 700 DF,  p-value: <2e-16


$EC

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9353 -0.5355 -0.0494  0.6754  2.3735 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.00940    0.03666   -0.26  0.79767    
LOC_SENS       0.14370    0.04085    3.52  0.00046 ***
LCFOREST_sqrt -0.09087    0.03751   -2.42  0.01566 *  
LOC_SCEN       0.00375    0.04121    0.09  0.92746    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.986 on 719 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.0304,    Adjusted R-squared:  0.0263 
F-statistic:  7.5 on 3 and 719 DF,  p-value: 5.99e-05


$ES

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.492 -0.539  0.140  0.688  2.288 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.03143    0.04159    0.76   0.4500    
LNOISE          0.13206    0.04315    3.06   0.0023 ** 
LOC_SENS        0.09993    0.04085    2.45   0.0147 *  
DISTKM_sqrt     0.08078    0.04039    2.00   0.0460 *  
FEELNAT         0.25739    0.04639    5.55  4.4e-08 ***
LNOISE:FEELNAT -0.00638    0.03426   -0.19   0.8522    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.95 on 576 degrees of freedom
  (165 observations deleted due to missingness)
Multiple R-squared:  0.141, Adjusted R-squared:  0.133 
F-statistic: 18.9 on 5 and 576 DF,  p-value: <2e-16

Legacy Code

In [9]:
Variable selection over multiple y (not desired)

# library(grpreg)
# fit <- cv.grpreg(X = model.matrix(fit_MED_GIS)[,-1], y = D_trn[Mediator_vars[1:2]])
# coef(fit) |> t()
# fit$beta
# plot(fit)
In [10]:
Linear models with mice

Y <- D[]

library(mice, quietly = TRUE)
library(car, quietly = TRUE)
library(miceadds, quietly = TRUE)
data(nhanes2, package = "mice")
set.seed(9090)

mi.res <- miceadds::mice.1chain(nhanes2, burnin = 4, iter = 20, Nimp = 8)
an2a <- miceadds::mi.anova(mi.res = mi.res, formula = "bmi ~ age * chl")

mod1 <- with(mi.res, stats::lm(bmi ~ age * chl))
mod0 <- with(mi.res, stats::lm(bmi ~ age + chl))

mitml::testModels(model = mod1$analyses, null.model = mod0$analyses, method = "D1")
mitml::testModels(model = mod1$analyses, null.model = mod0$analyses, method = "D2")

an2b <- miceadds::mi.anova(mi.res = mi.res, formula = "bmi ~ age * chl", type = 3)
In [11]:
Mediators ~ GIS
stop("this should not run")
Res1 <- list()
for (mediator in Mediator_vars) {
  full_model <- lm(as.formula(paste0(
    mediator, " ~ ", 
    "HM_NOISE_nrm * (", paste(GIS_vars, collapse = " + "), ")"
  )), D_trn)
  small_model <- step(full_model, trace = FALSE, k = log(nrow(D_trn)))
  Res1[[mediator]] <- lm(formula(small_model), D_tst)
}
lapply(Res1, summary)
In [12]:
PRS ~ Mediators
Res2 <- list()
for (mediator in Mediator_vars) {
  full_model <- lm(as.formula(paste0(
    mediator, " ~ ", 
    "HM_NOISE_nrm * (", paste(GIS_vars, collapse = " + "), ")"
  )), D_trn)
  small_model <- step(full_model, trace = FALSE, k = log(nrow(D_trn)))
  Res2[[mediator]] <- lm(formula(small_model), D_tst)
}
lapply(Res2, summary)