Back to Article
Hypothesis Testing
Download Source

Hypothesis Testing

Author

Lukas Graz

Published

February 13, 2025

In [1]:
Code
source("R/data_prep.R")
Number of matches per filter criteria (not disjoint)
  Headphone  PRS_all_NA    Distance Activity_NA    Duration  HMNoise_NA 
        303         226         221         102          96          96 
JourneyTime 
         20 
Keep  1494 of 2206 observations
Imputing PRS_orig_vars
Code
options(digits = 3)

# interactive <- function() FALSE
# D_trn$HM_NOISE_nrm <- scale(D_trn$HM_NOISE)
# D_tst$HM_NOISE_nrm <- scale(D_tst$HM_NOISE)
## No suspicious patterns in missing data
# mice::md.pattern(D[c(Mediator_vars, GIS_vars)[nNAs>0]], plot = FALSE)
# mice::md.pattern(D[PRS_orig_vars])

Linear Modeling

Imputation with MissForest on Training Data

In [2]:
Number of NAs in Mediators and GIS variables
sapply(D[Mediator_vars], \(x) sum(is.na(x)))
 FEELNAT   LNOISE LOC_SENS LOC_SOUN LOC_SCEN LOC_VISE LOC_VEGE LOC_FAUN 
      16      291       28       30       36       62       69       88 
Number of NAs in Mediators and GIS variables
sapply(D[GIS_vars], \(x) sum(is.na(x)))
  LCARTIF_sqrt  LCFOREST_sqrt          HETER    OVDIST_sqrt     VIS5K_sqrt 
             0              0              0              0              0 
       RL_NDVI       RL_NOISE    DISTKM_sqrt   JNYTIME_sqrt STRIMP123_sqrt 
             0              0              0             86              0 
STRIMP999_sqrt 
             0 
In [3]:
Impute missing values using MissForest
# Mediator imputation
D_trn[Mediator_vars] <- xfun::cache_rds({
  missForest(as.matrix(D_trn[Mediator_vars]))
  }, 
  file = "Mediator_imputation.rds", 
  dir = "cache/",
  hash = list(as.matrix(D_trn[Mediator_vars]))
)$ximp |> as.data.frame()

# GIS imputation (missForest)
D_trn[GIS_vars] <- xfun::cache_rds({
  missForest(as.matrix(D_trn[GIS_vars]))
  }, 
  file = "GIS_imputation.rds", 
  dir = "cache/",
  hash = list(as.matrix(D_trn[GIS_vars]))
)$ximp |> as.data.frame()

Remove “LOC_” variables from Mediators

In [4]:
Mediator_vars <- grep("LOC_", Mediator_vars, invert = TRUE, value = TRUE)

Scaling Test Data

In [5]:
Scaling variables and show old scale
all_vars <- c(Mediator_vars, GIS_vars, PRS_vars)
old_scale <- t(sapply(D_tst[c(all_vars)], \(x) 
  c(mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE))))

D_tst[c(all_vars)] <- lapply(D_tst[c(all_vars)], scale)
D_trn_scaled <- D_trn
D_trn_scaled[c(all_vars)] <- lapply(D_trn[c(all_vars)], scale)

old_scale
                 mean     sd
FEELNAT         6.142  1.055
LNOISE          4.210  0.747
LCARTIF_sqrt    0.271  0.269
LCFOREST_sqrt   0.454  0.311
HETER           1.305  0.402
OVDIST_sqrt    21.797 10.144
VIS5K_sqrt      3.323  1.620
RL_NDVI         0.635  0.202
RL_NOISE       41.615  9.261
DISTKM_sqrt     1.473  1.156
JNYTIME_sqrt    3.830  2.247
STRIMP123_sqrt  6.555 10.739
STRIMP999_sqrt 47.557 13.162
MEAN            4.987  0.881
FA              5.266  1.111
BA              5.140  1.157
EC              4.539  1.288
ES              5.010  1.429

Testing VIF

In [6]:
VIF: PRS ~ Mediators + GIS_vars (NO interaction)
car::vif(fit_PRS_MED <- lm(as.formula(paste0(
  PRS_vars[1], 
  " ~ ", 
  paste(Mediator_vars, collapse = " + "), " + ",
  paste(GIS_vars,      collapse = " + ")
)), D_trn)) |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.16    1.37    1.93    2.07    2.31    4.94 
In [7]:
VIF: PRS ~ (Mediators + GIS_vars)^2 (WITH interaction)
suppressMessages(
car::vif(fit_PRS_MED <- lm(as.formula(paste0(
  PRS_vars[1], 
  " ~ ", 
  "(", paste(Mediator_vars, collapse = " + "), 
     " + ", paste(GIS_vars, collapse = " + "), 
  ")^2"
)), D_trn))) |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.46   37.91  103.11  152.87  214.68  928.44 

Since we model with interactions later, the latter VIF are relevant for us. Given that they are very high (c.f. median and max), we would have no hope of finding any significant results in the full interaction model. Therefore, we will first perfom a variable selection, to reduce the VIF and enable us to find significant effects.

All Interactions: Mediators ~ (GIS)^2

In [8]:
Code: Print Coef Table

# Elegant function to create coefficient tables from model summaries
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code: Print Coef Table

library(tidyr)
library(knitr)
library(purrr)

#' Create a formatted coefficient table from model summary list
#' 
#' @param model_summaries List of model summaries (e.g., output from lapply(models, summary))
#' @param sig_threshold Significance threshold for bold formatting (default: 0.001)
#' @param covariate_order Optional vector specifying order of covariates
#' @return Formatted table with models as columns and covariates as rows
create_coef_table <- function(model_summaries, sig_threshold = 0.001, covariate_order = NULL) {
  
  # Extract and format coefficients for all models
  format_model_coef <- function(coef_matrix, model_name) {
    estimates <- coef_matrix[, "Estimate"]
    p_values <- coef_matrix[, "Pr(>|t|)"]
    
    # Format with significance stars (common notation)
    formatted_coef <- sapply(seq_along(estimates), function(i) {
      est_str <- sprintf("%.3f", estimates[i])
      stars <- case_when(
        p_values[i] < 0.001 ~ "***",
        p_values[i] < 0.01 ~ "**",
        p_values[i] < 0.05 ~ "*",
        p_values[i] < 0.1 ~ ".",
        TRUE ~ ""
      )
      paste0(est_str, stars)
    })
    
    tibble(
      Model = model_name,
      Covariate = rownames(coef_matrix),
      Coefficient = formatted_coef
    )
  }
  
  # Process all models and create wide table
  coef_list <- map(model_summaries, coef)
  
  results_table <- map2_dfr(coef_list, names(coef_list), format_model_coef) %>%
    pivot_wider(names_from = Model, values_from = Coefficient, values_fill = "")
  
  # Apply covariate ordering
  if (is.null(covariate_order)) {
    # Default: Intercept first, then alphabetical
    all_covariates <- unique(results_table$Covariate)
    covariate_order <- c("(Intercept)", sort(all_covariates[all_covariates != "(Intercept)"]))
  }
  
  # Filter and reorder covariates
  results_table <- results_table %>%
    filter(Covariate %in% covariate_order) %>%
    slice(match(covariate_order, Covariate))
  
  kable(results_table, 
        format = "pipe",
        align = c("l", rep("c", ncol(results_table) - 1)))
}
In [9]:
Res3 <- list()
for (mediator in Mediator_vars) {
  intercept_model <- lm(as.formula(paste0(
    mediator, " ~ 1")), D_trn)
  step_model <- step(intercept_model, 
    scope = as.formula(paste0(
      mediator, " ~ ", 
      "(", paste(GIS_vars, collapse = " + "), ")^2"
    )),
    trace = FALSE, k = log(nrow(D_trn))
  )
  Res3[[mediator]] <- lm(formula(step_model), D_tst)
}
(ResSum3 <- lapply(Res3, summary))
$FEELNAT

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-4.959 -0.391  0.264  0.607  1.685 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.0618     0.0410    1.51  0.13260    
LCARTIF_sqrt          -0.1524     0.0570   -2.67  0.00770 ** 
RL_NDVI                0.1498     0.0436    3.43  0.00063 ***
OVDIST_sqrt            0.0270     0.0452    0.60  0.55112    
LCARTIF_sqrt:RL_NDVI   0.1146     0.0402    2.85  0.00446 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.948 on 733 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.106, Adjusted R-squared:  0.101 
F-statistic: 21.8 on 4 and 733 DF,  p-value: <2e-16


$LNOISE

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.999 -0.527  0.012  0.676  1.622 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.00097    0.03862   -0.03    0.980    
LCARTIF_sqrt -0.12357    0.04841   -2.55    0.011 *  
RL_NOISE     -0.24203    0.04890   -4.95  9.7e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.945 on 596 degrees of freedom
  (148 observations deleted due to missingness)
Multiple R-squared:  0.11,  Adjusted R-squared:  0.107 
F-statistic: 36.6 on 2 and 596 DF,  p-value: 9.79e-16

Visualization of Interactioneffect of LCARTIF_sqrt * RL_NDVI

In [10]:

library(ggplot2)
D_plot <- rbind(D_trn_scaled, D_tst)
D_plot$FEELNAT <- pmax(-1,pmin(1,D_plot$FEELNAT))
ggplot(D_plot, aes(x = LCARTIF_sqrt, y=RL_NDVI, col = FEELNAT)) +
  geom_jitter(width=0.07)


# Erstelle RL_NDVI Gruppen
D_plot$RL_NDVI_group <- cut(D_plot$RL_NDVI, 
                          breaks = c(-Inf, -0.25, 0.25, Inf),
                          labels = c("Low (< -0.25)", "Medium (-0.25 to 0.25)", "High (> 0.25)"))

ggplot(D_plot, aes(x = LCARTIF_sqrt, y = FEELNAT, 
                  col = RL_NDVI_group, shape = RL_NDVI_group)) +
 geom_point(width = 0.07, alpha = 0.6) +
 geom_smooth(method = "loess", se = TRUE) +
 scale_shape_manual(values = c(16, 17, 15)) +  # Kreis, Dreieck, Quadrat
 scale_color_manual(values = c("red", "blue", "green")) +
 labs(x = "LCARTIF_sqrt", 
      y = "FEELNAT",
      color = "RL_NDVI Group",
      shape = "RL_NDVI Group") +
 theme_minimal()
Warning in geom_point(width = 0.07, alpha = 0.6): Ignoring unknown parameters:
`width`
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 9 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_point()`).

All Interactions: PRS ~ (Mediators + GIS)^2

In [11]:
Res4 <- list()
for (prs in PRS_vars) {
  intercept_model <- lm(as.formula(paste0(
    prs, " ~ 1")), D_trn)
  step_model <- step(intercept_model, 
    scope = as.formula(paste0(
      prs, " ~ ", 
      "(", paste(GIS_vars, collapse = " + "), " + ", 
      paste(Mediator_vars, collapse = " + "), ")^2"
    )),
    trace = FALSE, k = log(nrow(D_trn))
  )
  Res4[[prs]] <- lm(formula(step_model), D_tst)
}
(ResSum4 <- lapply(Res4, summary))
$MEAN

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.204 -0.587 -0.018  0.567  2.396 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.00524    0.04015   -0.13    0.896    
FEELNAT         0.25177    0.04540    5.55  4.4e-08 ***
LNOISE          0.21239    0.04182    5.08  5.1e-07 ***
LCFOREST_sqrt  -0.06579    0.03901   -1.69    0.092 .  
DISTKM_sqrt    -0.01873    0.03878   -0.48    0.629    
FEELNAT:LNOISE  0.02186    0.03321    0.66    0.511    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.926 on 586 degrees of freedom
  (155 observations deleted due to missingness)
Multiple R-squared:  0.136, Adjusted R-squared:  0.129 
F-statistic: 18.5 on 5 and 586 DF,  p-value: <2e-16


$FA

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.799 -0.580  0.032  0.634  2.263 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.00562    0.03998    0.14   0.8883    
FEELNAT         0.26555    0.04521    5.87  7.1e-09 ***
LNOISE          0.17048    0.04140    4.12  4.4e-05 ***
RL_NDVI        -0.10495    0.03887   -2.70   0.0071 ** 
FEELNAT:LNOISE -0.03572    0.03300   -1.08   0.2795    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.922 on 587 degrees of freedom
  (155 observations deleted due to missingness)
Multiple R-squared:  0.143, Adjusted R-squared:  0.137 
F-statistic: 24.5 on 4 and 587 DF,  p-value: <2e-16


$BA

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.775 -0.607 -0.030  0.675  2.846 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.00788    0.03554   -0.22     0.82    
FEELNAT      0.25245    0.03557    7.10    3e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.966 on 736 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.0641,    Adjusted R-squared:  0.0628 
F-statistic: 50.4 on 1 and 736 DF,  p-value: 2.99e-12


$EC

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8576 -0.5267 -0.0496  0.6828  2.1388 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)   -0.00646    0.03651   -0.18   0.8595   
LCFOREST_sqrt -0.11357    0.03769   -3.01   0.0027 **
FEELNAT        0.05802    0.03766    1.54   0.1238   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.992 on 735 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.0131,    Adjusted R-squared:  0.0104 
F-statistic: 4.89 on 2 and 735 DF,  p-value: 0.00778


$ES

Call:
lm(formula = formula(step_model), data = D_tst)

Residuals:
   Min     1Q Median     3Q    Max 
-3.397 -0.533  0.159  0.693  2.365 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.0271     0.0415    0.65    0.515    
LNOISE           0.1418     0.0430    3.29    0.001 ** 
FEELNAT          0.2697     0.0460    5.86  7.6e-09 ***
DISTKM_sqrt      0.0727     0.0401    1.81    0.070 .  
LNOISE:FEELNAT  -0.0127     0.0343   -0.37    0.711    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.958 on 587 degrees of freedom
  (155 observations deleted due to missingness)
Multiple R-squared:  0.128, Adjusted R-squared:  0.122 
F-statistic: 21.6 on 4 and 587 DF,  p-value: <2e-16

Table Summarizing Coefficients

Significant codes as usual: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All Interactions: Mediators ~ (GIS)^2

In [12]:
ResTab3 <- create_coef_table(ResSum3)
saveRDS(ResTab3, "cache/ResSum3.rds")
ResTab3
Covariate FEELNAT LNOISE
(Intercept) 0.062 -0.001
LCARTIF_sqrt -0.152** -0.124*
LCARTIF_sqrt:RL_NDVI 0.115**
OVDIST_sqrt 0.027
RL_NDVI 0.150***
RL_NOISE -0.242***

PRS ~ (Mediators + GIS)^2

In [13]:
ResTab4 <- create_coef_table(ResSum4)
saveRDS(ResTab4, "cache/ResSum4.rds")
ResTab4
Covariate MEAN FA BA EC ES
(Intercept) -0.005 0.006 -0.008 -0.006 0.027
DISTKM_sqrt -0.019 0.073.
FEELNAT 0.252*** 0.266*** 0.252*** 0.058 0.270***
FEELNAT:LNOISE 0.022 -0.036
LCFOREST_sqrt -0.066. -0.114**
LNOISE 0.212*** 0.170*** 0.142**
LNOISE:FEELNAT -0.013
RL_NDVI -0.105**