Part 3: Secondary Data Tools for Causal Identification

This part covers three secondary-data tools for causal identification — instrumental variables (IV), regression discontinuity design (RDD), and difference-in-differences (DiD). They are not three unrelated methods. They share a common logic: find variation in the treatment that is as good as random — variation that is unrelated to the unobserved confounders that make simple comparisons misleading. IV finds that variation through an external lever; RDD finds it at a sharp threshold; DiD finds it in a policy timing that differs across groups. Understanding IV first makes RDD and DiD easier to see as extensions of the same core idea.

Instrumental Variables

The core problem

Part 2 showed that when an unobserved variable $U$ causes both treatment $X$ and outcome $Y$, no amount of regression adjustment on observed covariates can recover the causal effect of $X$ on $Y$. The path $X \leftarrow U \rightarrow Y$ is a backdoor that remains open whenever $U$ is unmeasured.

In the eco-label context: suppose stores that display eco-labels ($X = 1$) tend to be in neighbourhoods with higher baseline environmental orientation ($U$). $U$ independently raises consumer WTP ($Y$). A naive comparison of WTP between labelling and non-labelling stores overstates the label’s causal effect, because part of the WTP gap is driven by neighbourhood composition, not the label.

The IV solution

An instrument $Z$ provides a second source of variation in $X$ — one that is unrelated to $U$ and therefore free of confounding. Instead of using all variation in $X$ to estimate its effect on $Y$, IV uses only the variation in $X$ that comes from $Z$. Because $Z$ is unrelated to $U$, this slice of variation in $X$ is clean.

For $Z$ to be a valid instrument, three conditions must hold:

Condition	What it requires	Testable?
Relevance	$Z$ predicts $X$ (strong first stage)	Yes — run the first-stage regression and check $F > 10$
Independence	$Z$ is unrelated to $U$ (and to everything else that affects $Y$ except through $X$)	No — requires theory or design
Exclusion	$Z$ affects $Y$ only through $X$, not through any direct path	No — requires theory

Relevance is the only condition you can test directly. Independence and exclusion are assumptions — they must be argued from the design, not demonstrated from the data.

The eco-label pilot: a concrete instrument

The government randomly selects stores to participate in a subsidized eco-labelling pilot program ($Z = 1$ if selected). Not every selected store actually installs the label — some lack the shelf space, staff time, or interest — so $X \neq Z$. The instrument works as follows:

Relevance: being selected for the pilot substantially raises the probability of displaying a label (first stage)
Independence: pilot selection is random, so $Z$ is unrelated to neighbourhood eco-orientation $U$
Exclusion: being selected for the pilot affects consumer WTP only through the store actually displaying the label, not through any other route

This is precisely the ITT / LATE framework from Part 1. Pilot selection is the encouragement ($Z$), label display is compliance ($X$), and the Wald estimator gives the causal effect for compliers — stores that display the label because they were selected for the pilot, not those that would have displayed it anyway.

The Wald estimator

\[\hat{\beta}_{IV} = \frac{\widehat{\text{Cov}}(Y, Z)}{\widehat{\text{Cov}}(X, Z)} = \frac{\text{Effect of } Z \text{ on } Y}{\text{Effect of } Z \text{ on } X} = \frac{\text{ITT}}{\text{First stage}} = \text{LATE}\]

This is also called two-stage least squares (2SLS): first regress $X$ on $Z$ to get predicted values $\hat{X}$; then regress $Y$ on $\hat{X}$. The second stage coefficient is $\hat{\beta}_{IV}$.

Simulation

▶ Simulate IV setting: U confounds X → Y; Z is a valid instrument

set.seed(2024)
N_iv <- 600
# ── YOUR DATA: U is the unobserved confounder (not in your dataset);
#    Z is your instrument (must be in your dataset); X is your endogenous treatment;
#    Y is your outcome. The structural equations below define the DGP.
U <- rnorm(N_iv)                                          # neighbourhood eco-orientation: UNOBSERVED
Z <- rbinom(N_iv, 1, 0.5)                                 # random pilot selection: instrument
# Compliance: high-U stores adopt even without the pilot (always-takers);
# pilot selection tilts borderline stores into adoption (compliers)
X <- rbinom(N_iv, 1, plogis(-2.0 + 2.5 * Z + 1.0 * U))  # label adoption
# True causal effect of the label = $0.80; U creates confounding
Y <- 5.00 + 0.80 * X + 0.60 * U + rnorm(N_iv, 0, 0.50)  # WTP
df_iv <- tibble(Y, X, Z, U_true = U)

# Compliance breakdown (visible only in simulation)
compliance <- case_when(
  X == 1 & Z == 0 ~ "Always-taker",
  X == 0 & Z == 1 ~ "Never-taker",
  X == Z          ~ "Complier",
  TRUE            ~ "Defier"
)
cat("Compliance breakdown:\n")

Compliance breakdown:

▶ Simulate IV setting: U confounds X → Y; Z is a valid instrument

print(table(compliance))

compliance
Always-taker     Complier  Never-taker 
          50          434          116

▶ Simulate IV setting: U confounds X → Y; Z is a valid instrument

cat(sprintf("\nFirst-stage compliance rate: %.1f%%\n", 100 * (mean(X[Z==1]) - mean(X[Z==0]))))


First-stage compliance rate: 43.8%

▶ OLS vs. IV: OLS is biased; IV recovers the true effect

# ── YOUR DATA: replace Y ~ X with your outcome ~ treatment; replace | Z with
#    | your_instrument. The pipe syntax means "instrument for X using Z".
#    Add covariates on both sides of | to control for observed confounders
#    (e.g., Y ~ X + age + income | Z + age + income).

# OLS: biased — uses all variation in X, including the U-driven part
ols_iv <- lm_robust(Y ~ X, data = df_iv)

# IV (2SLS): uses only the Z-driven variation in X
iv_est  <- iv_robust(Y ~ X | Z, data = df_iv)

# ── CHECK: first-stage F-statistic should exceed 10 (rule of thumb for strong
#    instruments). Weak instruments (F < 10) produce severely biased IV estimates
#    and wide confidence intervals.
fs_model <- lm_robust(X ~ Z, data = df_iv)
fs_F <- summary(lm(X ~ Z, data = df_iv))$fstatistic[1]

tibble(
  Method     = c("OLS (biased — ignores U)", "IV / 2SLS (valid — uses Z only)"),
  Estimate   = round(c(coef(ols_iv)["X"], coef(iv_est)["X"]), 3),
  SE         = round(c(ols_iv$std.error["X"], iv_est$std.error["X"]), 3),
  `95% CI`   = sprintf("[%.3f, %.3f]",
                       c(ols_iv$conf.low["X"],  iv_est$conf.low["X"]),
                       c(ols_iv$conf.high["X"], iv_est$conf.high["X"])),
  `True effect` = 0.80
) |> knitr::kable(caption = sprintf(
  "OLS vs. IV: true eco-label effect = $0.80 | first-stage F = %.1f | N = %d", fs_F, N_iv))

OLS vs. IV: true eco-label effect = $0.80 | first-stage F = 154.2 | N = 600
Method	Estimate	SE	95% CI	True effect
OLS (biased — ignores U)	1.151	0.060	[1.033, 1.270]	0.8
IV / 2SLS (valid — uses Z only)	0.734	0.138	[0.462, 1.006]	0.8

▶ First stage: does Z predict X?

df_iv |>
  group_by(Z) |>
  summarise(P_X = mean(X), .groups="drop") |>
  ggplot(aes(x = factor(Z, labels=c("Not selected\n(Z = 0)", "Selected\n(Z = 1)")),
             y = P_X, fill = factor(Z))) +
  geom_col(width = 0.5, alpha = 0.85) +
  geom_text(aes(label = sprintf("%.0f%%", 100 * P_X)), vjust = -0.4, size = 4) +
  scale_fill_manual(values = c("0" = clr_ctrl, "1" = clr_eco), guide = "none") +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  labs(x = "Pilot selection (instrument Z)", y = "Proportion implementing label (X)",
       title = "First stage: pilot selection strongly predicts label adoption",
       subtitle = sprintf("F = %.1f  —  well above the weak-instrument threshold of 10", fs_F)) +
  theme_mod3()

▶ OLS vs. IV coefficient comparison

tibble(
  Method   = factor(c("OLS (biased)", "IV / 2SLS (valid)"),
                    levels = c("OLS (biased)", "IV / 2SLS (valid)")),
  Estimate = c(coef(ols_iv)["X"], coef(iv_est)["X"]),
  lo       = c(ols_iv$conf.low["X"], iv_est$conf.low["X"]),
  hi       = c(ols_iv$conf.high["X"], iv_est$conf.high["X"])
) |>
  ggplot(aes(y = Method, x = Estimate, xmin = lo, xmax = hi, colour = Method)) +
  geom_vline(xintercept = 0.80, linetype = "dashed", colour = "grey40", linewidth = 0.9) +
  geom_pointrange(size = 1, linewidth = 1.2) +
  scale_colour_manual(values = c("OLS (biased)" = clr_ctrl, "IV / 2SLS (valid)" = clr_eco),
                      guide = "none") +
  scale_x_continuous(limits = c(0.5, 1.6),
                     labels = function(x) sprintf("$%.2f", x)) +
  labs(y = NULL, x = "Estimated eco-label effect",
       title = "OLS overstates the effect; IV recovers the truth",
       subtitle = "Dashed line = true effect ($0.80)  |  OLS biased upward by U") +
  theme_mod3()

The Danger of a Weak Instrument

The simulation above uses a strong instrument — pilot selection has a large, reliable first-stage relationship with label adoption. Real instruments are rarely so clean. A weak instrument is one where $Z$ barely predicts $X$: the first-stage $F$ is low, and only a small fraction of the variation in $X$ can be traced to $Z$.

The consequence is severe and somewhat counterintuitive. Recall the Wald estimator:

\[\hat{\beta}_{IV} = \frac{\text{Cov}(Y,\, Z)}{\text{Cov}(X,\, Z)}\]

When the denominator — the first stage — is close to zero, two things go wrong simultaneously. First, the estimator’s variance explodes: dividing by a small number amplifies every source of noise, producing confidence intervals that are orders of magnitude wider than OLS. Second, any slight violation of the exclusion restriction gets amplified by the same factor. Even if $Z$ has a tiny direct effect on $Y$ that would normally be negligible — say, the announcement of pilot selection itself slightly changed consumer expectations — a weak denominator turns that small violation into a large bias in the numerator. The result is an IV estimate that can be more biased than OLS, not less, accompanied by wide confidence intervals that give a false impression of informative inference.

▶ Simulate a weak instrument: low first-stage F leads to biased, imprecise IV

set.seed(2024)
N_wk <- 600
U_wk <- rnorm(N_wk)
Z_wk <- rbinom(N_wk, 1, 0.5)
# Weak compliance: pilot selection barely shifts adoption probability
X_wk <- rbinom(N_wk, 1, plogis(-1.5 + 0.30 * Z_wk + 1.2 * U_wk))
Y_wk <- 5.00 + 0.80 * X_wk + 0.60 * U_wk + rnorm(N_wk, 0, 0.50)
df_wk <- tibble(Y = Y_wk, X = X_wk, Z = Z_wk)

ols_wk <- lm_robust(Y ~ X, data = df_wk)
iv_wk  <- iv_robust(Y ~ X | Z, data = df_wk)
fs_wk  <- summary(lm(X ~ Z, data = df_wk))$fstatistic[1]
cat(sprintf("Weak first-stage F = %.1f  (well below the threshold of 10)\n", fs_wk))

Weak first-stage F = 0.2  (well below the threshold of 10)

▶ Strong vs. weak instrument: estimates, SEs, and confidence intervals

bind_rows(
  tibble(Instrument = sprintf("Strong (F = %.0f)", fs_F),
         Method   = c("OLS", "IV / 2SLS"),
         Estimate = c(coef(ols_iv)["X"], coef(iv_est)["X"]),
         SE       = c(ols_iv$std.error["X"], iv_est$std.error["X"]),
         lo       = c(ols_iv$conf.low["X"],  iv_est$conf.low["X"]),
         hi       = c(ols_iv$conf.high["X"], iv_est$conf.high["X"])),
  tibble(Instrument = sprintf("Weak (F = %.0f)", fs_wk),
         Method   = c("OLS", "IV / 2SLS"),
         Estimate = c(coef(ols_wk)["X"], coef(iv_wk)["X"]),
         SE       = c(ols_wk$std.error["X"], iv_wk$std.error["X"]),
         lo       = c(ols_wk$conf.low["X"],  iv_wk$conf.low["X"]),
         hi       = c(ols_wk$conf.high["X"], iv_wk$conf.high["X"]))
) |>
  mutate(`95% CI` = sprintf("[%.2f, %.2f]", lo, hi),
         Estimate  = round(Estimate, 3),
         SE        = round(SE, 3)) |>
  select(Instrument, Method, Estimate, SE, `95% CI`) |>
  knitr::kable(caption = "Strong vs. weak instrument: true eco-label effect = $0.80. Weak IV is imprecise and biased.")

Strong vs. weak instrument: true eco-label effect = $0.80. Weak IV is imprecise and biased.
Instrument	Method	Estimate	SE	95% CI
Strong (F = 154)	OLS	1.151	0.060	[1.03, 1.27]
Strong (F = 154)	IV / 2SLS	0.734	0.138	[0.46, 1.01]
Weak (F = 0)	OLS	1.359	0.061	[1.24, 1.48]
Weak (F = 0)	IV / 2SLS	-1.077	6.737	[-14.31, 12.15]

▶ Visualise strong vs. weak IV: coefficient estimates and CIs

bind_rows(
  tibble(Instrument = factor("Strong instrument", levels = c("Strong instrument","Weak instrument")),
         Method   = factor(c("OLS","IV / 2SLS"), levels = c("OLS","IV / 2SLS")),
         Estimate = c(coef(ols_iv)["X"], coef(iv_est)["X"]),
         lo       = c(ols_iv$conf.low["X"],  iv_est$conf.low["X"]),
         hi       = c(ols_iv$conf.high["X"], iv_est$conf.high["X"])),
  tibble(Instrument = factor("Weak instrument", levels = c("Strong instrument","Weak instrument")),
         Method   = factor(c("OLS","IV / 2SLS"), levels = c("OLS","IV / 2SLS")),
         Estimate = c(coef(ols_wk)["X"], coef(iv_wk)["X"]),
         lo       = c(ols_wk$conf.low["X"],  iv_wk$conf.low["X"]),
         hi       = c(ols_wk$conf.high["X"], iv_wk$conf.high["X"]))
) |>
  ggplot(aes(y = Method, x = Estimate, xmin = lo, xmax = hi, colour = Method)) +
  geom_vline(xintercept = 0.80, linetype = "dashed", colour = "grey40", linewidth = 0.9) +
  geom_pointrange(size = 0.9, linewidth = 1.1) +
  facet_wrap(~Instrument) +
  scale_colour_manual(values = c("OLS" = clr_ctrl, "IV / 2SLS" = clr_eco), guide = "none") +
  labs(y = NULL, x = "Estimated eco-label effect ($)",
       title = "Weak instruments: the CI explodes and the estimate may be worse than OLS",
       subtitle = "Dashed line = true effect ($0.80)  |  Weak IV amplifies noise and exclusion-restriction violations") +
  theme_mod3()

Weak instruments and discriminant validity — the same underlying problem

The weak instrument problem has a direct parallel in Module 1’s concept of discriminant validity. Recall that discriminant validity requires a scale to clearly differentiate its target construct from adjacent constructs. A scale with poor discriminant validity picks up variance from multiple constructs simultaneously — it cannot cleanly isolate what it is supposed to measure.

A weak instrument fails for the same structural reason, but at the causal level. The instrument $Z$ is supposed to isolate a specific pathway — $Z \rightarrow X \rightarrow Y$ — and exclude all others. When $Z$ weakly predicts $X$, it fails to adequately differentiate the intended causal channel from background noise. Just as a scale with a high HTMT ratio cannot clearly separate itself from an adjacent construct, a weak instrument cannot clearly separate the X-pathway from the confounding pathways. The clean variation in $X$ that $Z$ provides is so small that any contamination from unmeasured channels dominates the estimate.

The Wald estimator makes this explicit: whatever tiny amount of $Z$-driven variation exists in $X$, the second stage amplifies it to recover the causal effect — but the amplification is indiscriminate. It magnifies the causal signal and any violation of independence or exclusion equally. A negligible exclusion-restriction violation that would be safely ignorable with a strong instrument becomes the dominant source of bias when the instrument is weak. This mirrors what happens in Module 1 when a scale has poor discriminant validity: contamination from adjacent constructs overwhelms the target signal.

Practical takeaway: Test the first stage before trusting your IV estimate, just as you would run HTMT and CFA before trusting scale scores. An F-statistic below 10 is a warning that your instrument does not isolate its intended pathway clearly enough for IV to be informative.

What IV identifies — and what it does not

The IV estimate is the LATE: the causal effect of the eco-label for complier stores — those that installed the label because they were selected for the pilot, not those that would have installed it regardless. Always-takers (high-U stores that adopt with or without the pilot) and never-takers (stores that never adopt) do not contribute to the IV estimate.

This is identical to the LATE from Part 1 of this module. The instrument is a different mechanism — a government lottery rather than a laboratory randomisation — but the estimand is the same: the average effect for units whose treatment status was actually changed by the instrument.

Practical limits of IV:

Weak instruments ($F < 10$) produce severely biased estimates that can be worse than OLS
Instrument validity is untestable in full — independence and exclusion rest on argument, not data
LATE may not generalise — compliers near the instrument may differ systematically from the full population of treated units

From IV to Regression Discontinuity

The eco-label pilot above exploited randomness that a government deliberately created. In most secondary data settings, no one ran a lottery. But sometimes nature or policy creates an instrument-like discontinuity: a threshold that sharply determines treatment for units near it, and that is effectively random in a narrow window around the cutoff.

Regression discontinuity is local IV. The instrument is $Z_i = \mathbf{1}[\text{audit score}_i \geq 70]$ — being above the threshold. Near the cutoff, audit scores have measurement noise, so whether a product lands at 69 or 71 is close to random. This near-randomness makes $Z$ approximately independent of unobserved product quality $U$ in that local window. The exclusion restriction is that the audit score crossing 70 affects WTP only through badge receipt, not through any other route. And relevance is guaranteed — the badge is awarded precisely at 70.

The RDD estimate is therefore a Wald estimator applied locally. In a sharp RDD — where the threshold deterministically assigns treatment (every product at ≥ 70 receives the badge; none below do) — the jump in badge receipt at the cutoff is exactly 1, and the Wald ratio simplifies to the raw jump in WTP: a local average treatment effect (LATE) for products near the threshold.

In a fuzzy RDD, crossing the threshold changes the probability of receiving treatment but does not guarantee it. Some products above 70 may not display the badge; some below may receive it through other means. Here the threshold itself serves as an instrument — crossing 70 predicts badge receipt without perfectly determining it — and the Wald ratio identifies the LATE for cutoff compliers: products whose badge status would change if their observed score crossed the threshold. This is the same logic as the standard IV LATE: the effect is identified for the sub-population whose treatment was moved by the instrument (the threshold crossing), not the full population.

Regression Discontinuity Design

The Running Variable as a Measurement

The running variable is itself a measurement, and the Module 1 validity framework applies to it directly. But the measurement threats to RDD are more specific than the general idea that a noisy running variable is bad. It helps to distinguish three situations.

Recall the Classical Test Theory decomposition from Module 1, Part 1:

\[X_{\text{obs}} = T_{\text{true}} + \varepsilon\]

Applied to the audit score: the observed score equals true environmental quality plus auditor noise. Whether that noise creates problems for RDD depends on how the running variable is used and what it is actually measuring.

Case A — Observed score is the assignment rule (the typical situation). The badge is awarded to every product scoring ≥ 70 on the observed audit score. Here, random auditor noise does not automatically attenuate the estimated discontinuity or invalidate the design. The sharp step in treatment assignment exists at the observed threshold, and the RDD estimates the causal effect of crossing that threshold. What noise does change is the estimand: the effect is identified for products near 70 on the observed scale, which may include products with a range of true sustainability levels due to auditor imprecision. This is a consideration for interpretation — the LATE is for “products that scored near 70” rather than “products whose true sustainability is near 70” — but it is not a design failure.

Case B — True score determines treatment, but the researcher observes a noisy proxy. If the certifying body awards badges based on a latent quality standard and the observed audit score is an imperfect proxy for that standard, noise can genuinely blur the discontinuity. Misclassification in both directions near the cutoff — some high-quality products score below 70, some lower-quality products score above — produces a gradual slope rather than a sharp step, attenuating the estimated jump. This is the analogue of attenuation bias in OLS when predictors are measured with error (Module 1, Part 2). In practice, Case A is more common: the observed score is the actual assignment rule.

Case C — Construct contamination (the principal Module 1 threat). This is the most important case and the one most directly connected to Module 1’s concept of discriminant validity. It has nothing to do with random noise. If the audit score systematically reflects sustainability plus firm size, lobbying capacity, auditor relationships, or brand reputation, the threshold is a composite threshold rather than a clean sustainability threshold. Units just above and just below 70 will differ not only in their observed score but on every adjacent construct the score inadvertently absorbs — and some of those constructs independently drive WTP. The jump at the cutoff will then reflect both the badge effect and systematic differences in those confounded traits, with no way to separate them.

The principal measurement threat: construct contamination, not random noise

When the observed score is the assignment rule, classical random noise does not automatically destroy an RDD. The more fundamental threat is construct contamination: when the running variable loads on adjacent constructs, the cutoff is no longer a clean threshold on the intended construct.

If the audit score conflates environmental quality with firm size, lobbying capacity, or auditor relationships:

Near-threshold units differ on those adjacent constructs as well as on the measured score
The estimated jump captures the effect of crossing a composite threshold, not a clean sustainability threshold
Covariate balance checks cannot detect this if the contaminating constructs were never measured

This is the discriminant validity failure from Module 1, applied to causal identification: a scale that fails discriminant validity cannot cleanly isolate its target construct, and a running variable that fails discriminant validity cannot cleanly define what crossing its threshold means.

How contamination enables gaming. When brand reputation or lobbying capacity is embedded in the audit score, high-reputation brands near the threshold can exploit that contamination to justify re-auditing or appealing borderline scores. Their resources — not their environmental quality — give them repeated attempts to cross 70. The Monte Carlo simulation below shows how this dynamic generates severe Type I error even when active manipulation is relatively uncommon.

The Module 1 checklist applied to your running variable. Before running an RDD, ask the measurement validity questions from Module 1 about the running variable itself:

Content validity: Does the running variable actually cover the full domain of the construct it is supposed to measure? An environmental audit that focuses heavily on energy use and packaging but ignores supply chain labour practices, land-use impact, or end-of-life product handling has poor content validity. It measures some aspects of sustainability, but not the construct in full. A cutoff on this score is then a cutoff on a narrow slice of the construct, not on sustainability itself — and products that score well on the measured facets while performing poorly on the unmeasured ones receive the badge as if they were comprehensively sustainable.
Construct validity — especially discriminant validity: Does the running variable pick up primarily its intended construct, or does it also load on adjacent ones? This is the key question. Construct validity requires both convergent validity (the score correlates with other measures of the same construct) and — above all — discriminant validity (the score does not correlate highly with measures of different constructs). In the audit example: if a product’s audit score is partially determined by firm size, lobbying capacity, or the brand’s pre-existing relationship with the certifying body, the score fails discriminant validity. It conflates environmental quality with political and economic power. Products near the threshold will then differ not just in sustainability — they will differ in size, connections, and resources. The RDD is no longer estimating the effect of crossing a sustainability threshold; it is estimating the effect of crossing a composite threshold that mixes sustainability, size, and influence. This is precisely the Module 1 problem: a scale with a high HTMT ratio to an adjacent construct cannot cleanly isolate its target, and neither can a running variable that picks up adjacent constructs.
Measurement invariance: Are audit standards applied consistently across the types of products in your sample? If the same score of 70 means different things for small artisan producers vs. large multinational brands (a form of non-invariance from Module 1, Part 3), the threshold is not a uniform treatment assignment rule — it assigns badges based on different underlying quality levels for different types of firms.

The practical implication is uncomfortable but important: the interpretive precision of an RDD is bounded above by the discriminant validity of its running variable. An RDD on a construct-contaminated running variable can still estimate the causal effect of crossing that particular composite threshold — but it cannot cleanly estimate the effect of crossing a sustainability threshold. A high-quality regression discontinuity study begins with a running variable that demonstrably measures what it claims to measure and demonstrably does not measure what it claims not to measure.

Simulating an RDD

▶ Simulate 1,200 products near the 70-point certification threshold

set.seed(2024)
N_rdd <- 1200; cutoff <- 70; true_jump <- 0.80
# ── YOUR DATA: replace audit_score with your running variable (the continuous
#    score that determines treatment assignment), cutoff with the actual policy
#    threshold, and WTP_rdd with your outcome variable.
#    df_rdd must contain: the running variable, a 0/1 treatment indicator
#    (= 1 if running variable >= cutoff), the outcome, and running = running - cutoff.
audit_score <- runif(N_rdd, 30, 100)
treat_rdd   <- as.integer(audit_score >= cutoff)

WTP_rdd <- 4.0 + 0.02*(audit_score-cutoff) - 0.0003*(audit_score-cutoff)^2 +
           true_jump*treat_rdd + rnorm(N_rdd, 0, 0.60)
WTP_rdd <- pmax(1, pmin(10, WTP_rdd))
df_rdd  <- tibble(audit_score, treat_rdd, WTP_rdd, running=audit_score-cutoff)

df_rdd |>
  ggplot(aes(x=audit_score, y=WTP_rdd, colour=factor(treat_rdd))) +
  geom_point(alpha=0.30, size=0.9) +
  geom_vline(xintercept=cutoff, linewidth=1, colour="grey30") +
  geom_smooth(method="lm", formula=y~poly(x,2), se=TRUE, alpha=0.2) +
  scale_colour_manual(values=c("0"=clr_ctrl,"1"=clr_eco),
                      labels=c("Below 70 (no badge)","Above 70 (badge)")) +
  annotate("text", x=71, y=9.2, label="Cutoff = 70", hjust=0, size=3.5, fontface="bold") +
  annotate("segment", x=70, xend=70, y=3.85, yend=4.85, colour="firebrick",
           arrow=arrow(ends="both", length=unit(.2,"cm")), linewidth=0.8) +
  annotate("text", x=71, y=4.35,
           label=sprintf("~$%.2f jump\n(badge effect)", true_jump),
           hjust=0, size=3, colour="firebrick") +
  labs(x="Audit Score (running variable)", y="Consumer WTP ($)", colour=NULL,
       title="Regression Discontinuity: the jump at 70 estimates the badge's causal effect") +
  theme_mod3()

Key Assumptions and How to Test Them

Assumption 1: No Manipulation

Code

# ── YOUR DATA: replace df_rdd$audit_score with your running variable column
#    and c=cutoff with your policy threshold value.
dens_test <- rddensity(X=df_rdd$audit_score, c=cutoff)
summary(dens_test)


Manipulation testing using local polynomial density estimation.

Number of obs =       1200
Model =               unrestricted
Kernel =              triangular
BW method =           estimated
VCE method =          jackknife

c = 70                Left of c           Right of c          
Number of obs         677                 523                 
Eff. Number of obs    256                 209                 
Order est. (p)        2                   2                   
Order bias (q)        3                   3                   
BW est. (h)           13.827              12.101              

Method                T                   P > |T|             
Robust                0.6254              0.5317              


P-values of binomial tests (H0: p=0.5).

Window Length              <c     >=c    P>|T|
1.276     + 1.276          28      20    0.3123
2.553     + 2.479          44      41    0.8284
3.829     + 3.682          64      57    0.5856
5.105     + 4.885          89      80    0.5384
6.382     + 6.087         118     101    0.2796
7.658     + 7.290         140     117    0.1698
8.934     + 8.493         165     138    0.1351
10.211    + 9.696         185     168    0.3945
11.487    + 10.898        213     190    0.2731
12.763    + 12.101        235     209    0.2354

Code

invisible(rdplotdensity(dens_test, X=df_rdd$audit_score,
              title="Density test: checking for bunching at the 70-point cutoff",
              xlabel="Audit score", ylabel="Density"))

A significant density discontinuity is a serious warning that units may be sorting around the cutoff. A non-significant result is reassuring — but not proof of clean assignment. Density tests can have low power, especially when manipulation is subtle, spread over a wider window, or driven by construct contamination rather than sharp bunching. The Monte Carlo results below show Type I error rising faster than density-test power precisely because the gaming mechanism here is smooth enough to partially escape detection while still producing severe bias. With our simulated uniform running variable, the density is continuous and the test correctly finds no evidence of manipulation.

Assumption 2: Covariate Continuity

Code

env_rdd <- rnorm(N_rdd, 0 + 0.01*(audit_score-cutoff)/10, 1)
rdd_env <- rdrobust(y=env_rdd, x=df_rdd$audit_score, c=cutoff)
cat("Covariate balance — env. concern should NOT jump at the threshold:\n")

Covariate balance — env. concern should NOT jump at the threshold:

Code

summary(rdd_env)

Sharp RD estimates using local polynomial regression.

Number of Obs.                 1200
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                  677          523
Eff. Number of Obs.             136          120
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   7.411        7.411
BW bias (b)                  11.566       11.566
rho (h/b)                     0.641        0.641
Unique Obs.                     677          523

=====================================================================
                   Point    Robust Inference
                Estimate         z     P>|z|      [ 95% C.I. ]       
---------------------------------------------------------------------
     RD Effect    -0.091    -0.122     0.903    [-0.604 , 0.533]     
=====================================================================

Covariate continuity is the RDD analogue of a balance table from Module 2: it checks whether observed pre-treatment characteristics jump at the threshold. A significant discontinuity in a pre-treatment covariate is a red flag — something is sorting units at the cutoff beyond the treatment rule. A smooth covariate is reassuring, but it cannot demonstrate balance on unobserved characteristics, just as a Module 2 balance table cannot confirm exchangeability on unmeasured potential confounders. In an experiment, balance on observed covariates is reassuring because randomisation also balances unobserved ones (probabilistically); in RDD, that probabilistic guarantee is absent. Covariate continuity is necessary evidence, not sufficient proof.

Estimating the RDD Effect

Code

# ── YOUR DATA: replace y= with your outcome variable, x= with your running
#    variable, and c= with your cutoff value.
# ── KEY ARGS: rdrobust() selects bandwidth automatically (MSE-optimal rule);
#    you can override with h= to specify a fixed bandwidth, or kernel= to change
#    from the default triangular kernel.
rdd_est <- rdrobust(y=df_rdd$WTP_rdd, x=df_rdd$audit_score, c=cutoff)
summary(rdd_est)

Sharp RD estimates using local polynomial regression.

Number of Obs.                 1200
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                  677          523
Eff. Number of Obs.             181          172
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   9.821        9.821
BW bias (b)                  17.119       17.119
rho (h/b)                     0.574        0.574
Unique Obs.                     677          523

=====================================================================
                   Point    Robust Inference
                Estimate         z     P>|z|      [ 95% C.I. ]       
---------------------------------------------------------------------
     RD Effect     1.042     6.435     0.000     [0.762 , 1.429]     
=====================================================================

Code

# ── CHECK: use the "Robust" CI for inference — it is bias-corrected.
rdplot(y=df_rdd$WTP_rdd, x=df_rdd$audit_score, c=cutoff,
       title="RDD estimate: causal effect of sustainability badge on WTP",
       x.label="Audit Score", y.label="WTP ($)")

Code

bw_results <- map_dfr(c(5,8,10,15,20,25,30), function(bw) {
  est <- rdrobust(y=df_rdd$WTP_rdd, x=df_rdd$audit_score, c=cutoff, h=bw)
  tibble(bandwidth=bw, estimate=est$coef["Conventional",1], se=est$se["Robust",1])
})

bw_results |>
  mutate(lo=estimate-1.96*se, hi=estimate+1.96*se) |>
  ggplot(aes(x=bandwidth, y=estimate, ymin=lo, ymax=hi)) +
  geom_hline(yintercept=true_jump, linetype="dashed", colour="grey40", linewidth=1) +
  geom_ribbon(alpha=0.15, fill=clr_eco) +
  geom_line(colour=clr_eco, linewidth=1) + geom_point(colour=clr_eco, size=2.5) +
  annotate("text", x=5.5, y=true_jump+0.05, label=sprintf("True = $%.2f", true_jump),
           hjust=0, size=3.5) +
  labs(x="Bandwidth (audit score units)", y="Estimated badge effect ($)",
       title="RDD estimates stable across bandwidths h = 5 to 30",
       subtitle="Instability would indicate sensitivity to bandwidth — a red flag") +
  theme_mod3()

The bandwidth trade-off: bias vs. variance

The bandwidth $h$ controls which products are included in the RDD comparison — only those with audit scores in $[70-h,\ 70+h]$. This creates a fundamental tension:

Narrow $h$	Wide $h$
Products are very similar to each other near the cutoff ✔	More observations → tighter CI ✔
Few observations → wide CI ✖	Products far from 70 differ systematically from each other ✖

rdrobust selects the bandwidth automatically using a mean-squared-error optimal rule. The bandwidth-sensitivity plot above is a sanity check: stable estimates across a wide range of $h$ suggest the result is not an artefact of a specific choice.

Type I Error from Running Variable Manipulation

The density test and covariate continuity checks are the right diagnostics — but how much Type I error accumulates before those tests catch the problem? Here we simulate a concrete manipulation scenario and track false-positive rates as gaming becomes more common.

The mechanism. Two independent variables matter: (1) the audit score — operational compliance across energy use, waste, and supply chain — and (2) latent brand eco-reputation Z, an unobserved dimension reflecting genuine brand prestige and loyal eco-customer base. A brand can have strong reputation yet score just below 70: a complex supply chain, an unfamiliar auditor, or a borderline factory can drag the score down through measurement noise. Z does two things: it motivates gaming (high-reputation brands near the threshold believe they deserve the badge and have marketing resources to appeal, re-audit, or make cosmetic improvements) and it independently drives WTP (eco-conscious customers already follow high-reputation brands regardless of the badge). The true badge effect is zero.

▶ Monte Carlo: Type I error and density-test power vs. gaming rate (200 sims)

set.seed(2025)
N_SIM_RDD   <- 200
N_RDD_MC    <- 500
CUTOFF_MC   <- 70
GAME_WINDOW <- 10
BW_RDD      <- 10

rdd_mc_one <- function(p_game) {
  mat <- replicate(N_SIM_RDD, {
    Z_rep  <- rnorm(N_RDD_MC, 0, 1)
    audit  <- pmin(100, pmax(30, rnorm(N_RDD_MC, mean = CUTOFF_MC - 3, sd = 9)))
    near_below <- audit >= (CUTOFF_MC - GAME_WINDOW) & audit < CUTOFF_MC
    game_mask  <- near_below & (Z_rep > 0) & (runif(N_RDD_MC) < p_game)
    if (any(game_mask)) {
      cross_gap        <- CUTOFF_MC - audit[game_mask] + 0.5
      audit[game_mask] <- pmin(100, audit[game_mask] + cross_gap + runif(sum(game_mask), 0, 7))
    }
    WTP_mc <- pmax(1, pmin(10, 5.5 + 1.5 * Z_rep + rnorm(N_RDD_MC, 0, 0.30)))
    D     <- as.integer(audit >= CUTOFF_MC)
    in_bw <- abs(audit - CUTOFF_MC) <= BW_RDD
    rdd_p <- tryCatch({
      if (sum(in_bw) < 10 || !any(D[in_bw] == 0) || !any(D[in_bw] == 1)) NA_real_
      else {
        df_bw <- data.frame(y=WTP_mc[in_bw], xc=audit[in_bw]-CUTOFF_MC, D=D[in_bw])
        m  <- lm(y ~ D * xc, data = df_bw)
        2 * pt(-abs(summary(m)$coefficients["D","t value"]), df = m$df.residual)
      }
    }, error = function(e) NA_real_)
    dens_p <- tryCatch({
      rd <- rddensity(X = audit, c = CUTOFF_MC)
      p  <- rd$test$p_jk
      if (is.null(p) || !is.finite(p)) 1.0 else as.numeric(p)[1]
    }, error = function(e) 1.0)
    c(rdd_p, dens_p)
  })
  data.frame(p_game=p_game, rdd_pval=mat[1,], dens_pval=mat[2,])
}

game_levels <- c(0, 0.10, 0.20, 0.30, 0.50, 0.70)
sim_rdd_mc  <- map_dfr(game_levels, rdd_mc_one)
cat(sprintf("RDD simulation complete: %d conditions x %d sims = %d total.\n",
            length(game_levels), N_SIM_RDD, nrow(sim_rdd_mc)))

RDD simulation complete: 6 conditions x 200 sims = 1200 total.

RDD simulation: 200 sims per condition, N = 500 brands, local linear ±10 pts, true badge effect = $0.
Gaming rate	RDD Type I error	Density test power
0%	6%	5%
10%	9%	2%
20%	12%	6%
30%	37%	13%
50%	70%	24%
70%	96%	40%

The key pattern: RDD Type I error races toward 100% while the density test lags

As gaming becomes common, the density of audit scores just above 70 swells and the density just below depletes. The RDD treats this Z-sorted composition as a real treatment effect — Type I error climbs steeply toward 100%. The density test eventually picks up the bunching, but it has substantially less power than the spurious RDD finding itself. A researcher who checks rddensity, sees a borderline p-value, and proceeds will still produce highly confident false positives.

Why does Z go undetected? The audit score is observable; brand reputation is not. Standard covariate balance checks only test observed pre-treatment variables. If no measure of brand reputation was collected, the sorting is completely invisible — yet it inflates Type I error just as severely.

The link to the measurement discussion above. This is the discriminant validity failure made concrete. A running variable with strong discriminant validity would measure environmental quality and only environmental quality — cleanly separable from brand reputation, firm size, and lobbying capacity. When it fails discriminant validity, those adjacent constructs are embedded in the score itself, and near-threshold units differ not just on the measured dimension but on every construct the score inadvertently absorbs. The “covariate balance” check — the RDD’s primary diagnostic — then tests the wrong thing: it checks balance on the variables you measured, but it cannot detect imbalance on the latent constructs the running variable conflates with its target. An audit score that picks up brand reputation will produce near-threshold imbalance on brand reputation whether or not any active gaming occurred — simply because the score already sorted units by a contaminated composite. Better discriminant validity of the running variable is not just a measurement virtue; it is a precondition for the covariate-balance check to be meaningful at all.

Researcher Checklist: Regression Discontinuity Design

Key questions about your discontinuity

Can units manipulate their position on the running variable? If firms or individuals can precisely control their score to land just above the threshold, the local quasi-randomness assumption is violated. Run a density test (rddensity) to check for systematic bunching at the cutoff.
Are pre-treatment covariates smooth through the threshold? Run RDD regressions with pre-determined characteristics as the outcome. Any discontinuity in pre-treatment variables signals selection, not randomness.
Is the estimate stable across bandwidth choices? If the point estimate changes dramatically as you narrow the bandwidth from the MSE-optimal value, the result depends on extrapolating far from the cutoff — not on the local quasi-experiment.
Do you understand that your estimate is a LATE? RDD identifies the effect only for units near the threshold — those on the margin of qualifying. This may or may not generalize to the broader population of treated units.

From RDD to Difference-in-Differences

RDD requires a sharp threshold in a single continuous score. When no such threshold exists — but you have panel data spanning a policy change that affected some units and not others — difference-in-differences uses time as the source of identification.

The identifying assumption shifts: instead of “units just above and below the threshold are exchangeable” (RDD), DiD assumes “treated and control units would have followed the same trend absent the treatment” (parallel trends). Both assumptions are forms of local exchangeability — in RDD it is spatial (near the cutoff); in DiD it is temporal (in the pre-treatment period). Both are empirically testable in part, and both can fail in ways that the available tests cannot detect.

Difference-in-Differences

Simulating Panel Data

Code

set.seed(2024)
n_dutch <- 15; n_belgian <- 20; n_stores_did <- n_dutch+n_belgian
years <- 2018:2024; T_treat <- 2022
# ── YOUR DATA: replace n_dutch / n_belgian with your counts of treated and
#    control units; replace years with your time periods; replace T_treat with
#    the first period when treatment took effect.

panel_did <- expand.grid(store=1:n_stores_did, year=years) |>
  as_tibble() |>
  mutate(
    country      = ifelse(store<=n_dutch,"Netherlands","Belgium"),
    treated      = (country=="Netherlands"),
    post         = (year>=T_treat),
    did_indicator= treated & post,
    store_fe     = rnorm(n_stores_did, 0, 0.5)[store],
    time_trend   = 0.10*(year-2018),
    country_fe   = ifelse(country=="Netherlands", 0.30, 0),
    treat_effect = did_indicator * 0.70,
    WTP_did      = 5.20 + store_fe + time_trend + country_fe + treat_effect + rnorm(n(), 0, 0.35)
  )

panel_did |>
  group_by(country, year) |>
  summarise(mean_WTP=mean(WTP_did), .groups="drop") |>
  ggplot(aes(x=year, y=mean_WTP, colour=country, group=country)) +
  geom_vline(xintercept=T_treat-0.5, linetype="dashed", colour="grey50") +
  geom_line(linewidth=1.2) + geom_point(size=2.5) +
  annotate("text", x=T_treat+0.1, y=5.1,
           label="Dutch eco-label\nlaw takes effect", hjust=0, size=3.2) +
  scale_colour_manual(values=c("Netherlands"=clr_eco,"Belgium"=clr_ctrl)) +
  labs(x="Year", y="Mean WTP ($)", colour=NULL,
       title="DiD: parallel pre-trends are the key identifying assumption",
       subtitle="Dutch and Belgian stores track each other in slope before 2022; Dutch stores jump after") +
  theme_mod3()

What “parallel trends” actually means — and why Belgium is the key

DiD hinges on one unanswerable question: what would Dutch stores have done after 2022 if the eco-label law had never passed? We can’t observe this counterfactual. The parallel trends assumption supplies the answer: Belgian stores tell us.

More precisely, the assumption states that absent treatment, the Dutch–Belgian WTP gap would have remained constant — both series drifting up or down by identical amounts each year. Under this assumption:

\[\underbrace{\Delta Y_{\text{Dutch}}}_{\text{observed}} - \underbrace{\Delta Y_{\text{Belgian}}}_{\text{counterfactual proxy}} = \delta_{\text{DiD}}\]

What makes this assumption fail? Anything that makes Dutch and Belgian stores diverge for reasons unrelated to the law — e.g., Dutch shoppers becoming greener faster, or a Dutch-only economic shock. The parallel trends violation section below shows exactly this scenario.

The parallel trends assumption and Module 2: Parallel trends is to DiD what exchangeability is to experimentation. In Module 2, you saw that even carefully randomised experiments can fail exchangeability — attrition, demand effects, and differential compliance all erode it. In DiD, parallel trends is unverifiable for the post-treatment period. Pre-treatment trend tests provide some diagnostic evidence, but just as a successful manipulation check does not prove the exclusion restriction (Module 2), clean pre-trends do not prove that post-treatment trends would have remained parallel.

A Module 1 parallel: Fixed effects eliminate time-invariant confounders — but only if the measurement of the outcome variable is consistent across time and units. If the WTP question wording, scale, or survey protocol changes between periods (a measurement artefact from Module 1), the DiD estimate conflates treatment effects with measurement change.

Two-Way Fixed Effects (TWFE)

\[Y_{it} = \alpha_i + \lambda_t + \delta \cdot D_{it} + \varepsilon_{it}\]

Code

# ── YOUR DATA: replace WTP_did with your outcome, did_indicator with the
#    column you created as treated * post, and replace factor(store)/factor(year)
#    with your unit and time identifiers.
did_twfe <- lm_robust(WTP_did ~ did_indicator + factor(store) + factor(year),
                      data=panel_did, clusters=store)
did_coef <- tidy(did_twfe) |> filter(term=="did_indicatorTRUE")
cat(sprintf(
  "True DiD effect  = 0.70\nEstimated DiD    = %.3f  (SE = %.3f, p = %.4f)\n95%% CI: [%.3f, %.3f]\n",
  did_coef$estimate, did_coef$std.error, did_coef$p.value,
  did_coef$conf.low, did_coef$conf.high
))

True DiD effect  = 0.70
Estimated DiD    = 0.631  (SE = 0.088, p = 0.0000)
95% CI: [0.452, 0.811]

What the two sets of fixed effects absorb

Term	What it controls for	Example here
$\alpha_i$ — store fixed effects	Time-invariant differences between stores	Store size, neighbourhood demographics, pre-existing customer base
$\lambda_t$ — year fixed effects	Year-level shocks that hit all stores equally	Euro-area inflation, global supply chain costs, macro consumer sentiment
$\delta$ — the DiD estimator	How much more the treated group changed after treatment vs. the control group	Causal effect of the Dutch eco-label law

The critical residual $\varepsilon_{it}$: whatever the fixed effects can’t absorb — differential trends, time-varying store-specific shocks — ends up here. This is exactly where parallel-trends violations hide.

Testing Parallel Trends: Event Study

Code

panel_did <- panel_did |>
  mutate(year_rel=year-T_treat,
         year_rel_fac=relevel(factor(year_rel), ref="-1"))

event_formula <- as.formula("WTP_did ~ treated:year_rel_fac + factor(store) + factor(year)")
did_event     <- lm_robust(event_formula, data=panel_did, clusters=store)

event_tbl <- tidy(did_event) |>
  filter(str_detect(term,"year_rel_fac")) |>
  mutate(yr=as.integer(str_extract(term,"-?\\d+$")),
         period=if_else(yr<0,"Pre-treatment","Post-treatment")) |>
  add_row(yr=-1L, estimate=0, conf.low=0, conf.high=0, period="Reference (−1)")

event_tbl |>
  ggplot(aes(x=yr, y=estimate, ymin=conf.low, ymax=conf.high, colour=period)) +
  geom_hline(yintercept=0, linetype="dashed", colour="grey50") +
  geom_vline(xintercept=-0.5, linetype="dotted", colour="grey60") +
  geom_pointrange(size=0.7) +
  scale_colour_manual(values=c("Pre-treatment"=clr_ctrl,"Post-treatment"=clr_eco,
                               "Reference (−1)"="grey40")) +
  labs(x="Years relative to Dutch eco-label law (0 = 2022)", y="DiD coefficient ($)",
       colour=NULL,
       title="Event study: pre-treatment coefficients near zero — parallel trends supported",
       subtitle="Post-treatment estimates rise consistently — evidence of a sustained law effect") +
  theme_mod3()

Synthetic Difference-in-Differences

The synthetic counterfactual is unverifiable

Synthetic DiD — like classical synthetic controls — constructs a weighted combination of untreated units designed to reproduce the treated unit’s pre-treatment trajectory. The pre-period fit can look compelling, but there is no way to verify that the synthetic world it generates is a sufficiently representative stand-in for the real counterfactual post-treatment. Use this approach when you have a credible pool of donor units and a long pre-treatment window — and always report sensitivity analyses varying the donor pool.

Code

sdid_wide <- panel_did |>
  dplyr::select(store, year, WTP_did, treated) |>
  pivot_wider(names_from=year, values_from=WTP_did) |>
  arrange(treated)

N0 <- sum(!sdid_wide$treated); T0 <- sum(years < T_treat)
Y_matrix <- sdid_wide |> dplyr::select(-store,-treated) |> as.matrix()

sdid_est <- synthdid_estimate(Y_matrix, N0, T0)
sc_est   <- sc_estimate(Y_matrix, N0, T0)
did_est  <- did_estimate(Y_matrix, N0, T0)

tibble(
  Method       = c("Standard DiD","Synthetic Control","Synthetic DiD"),
  Estimate     = round(c(as.numeric(did_est),as.numeric(sc_est),as.numeric(sdid_est)), 3),
  `True effect`= 0.70,
  Bias         = round(c(as.numeric(did_est),as.numeric(sc_est),as.numeric(sdid_est))-0.70, 3)
) |> knitr::kable(caption="All three estimators vs. the true effect of 0.70")

All three estimators vs. the true effect of 0.70
Method	Estimate	True effect	Bias
Standard DiD	0.631	0.7	-0.069
Synthetic Control	0.618	0.7	-0.082
Synthetic DiD	0.618	0.7	-0.082

Code

plot(sdid_est, se.method="placebo")

When Parallel Trends Fails — and Pre-Testing Cannot Always Save You

Parallel trends are not directly testable — you never observe what Dutch stores would have done without the eco-label law. What you can test is whether pre-treatment trajectories were parallel. The event study above is exactly this test. But two practical problems cripple it in most real datasets:

Too few pre-treatment periods. Most DiD studies in marketing and management observe units for only 2–4 years before treatment. With so few periods, the event study has very low statistical power to detect violations.
You collect whatever data exists. Pre-treatment archives are often limited — the number of pre-periods is determined by data availability, not statistical power considerations.

The Module 2 power connection: In Module 2, you saw how underpowered studies inflate apparent effect sizes. The same logic applies here in reverse: a pre-trend test with only two pre-periods has low power to detect a violation, so a non-significant result tells you very little. A failure to find pre-trend divergence with few observations is not evidence that trends were parallel — it is evidence that your test had insufficient power.

What the pre-trend test does — and does not — establish

The event study tests whether observed pre-treatment outcomes moved in parallel. It does not test:

Whether unmeasured confounders were evolving differently for treated and control units
Whether the differential dynamic would have continued into the post-period absent treatment
Whether a violation is simply too small to detect given the available pre-periods

Passing the pre-trend test is necessary but far from sufficient for causal identification.

A Realistic Parallel Trends Violation

Suppose Dutch stores were already attracting more environmentally conscious shoppers whose WTP grew $0.15/year faster than Belgian shoppers — not because of any law, but because of who was already shopping there. The true treatment effect is zero.

▶ Same violation: unmistakeable with 8 pre-periods, invisible with 2

set.seed(2025)
n_dutch_v  <- 15; n_belgian_v <- 20; n_stores_v <- n_dutch_v + n_belgian_v
T_treat_v  <- 2022; n_post_v <- 3
diff_slope_v <- 0.15
store_fes_v  <- rnorm(n_stores_v, 0, 0.6)

make_violation_panel <- function(yrs, true_eff = 0) {
  expand.grid(store = 1:n_stores_v, year = yrs) |> as_tibble() |>
    mutate(
      treated       = as.numeric(store <= n_dutch_v),
      country       = ifelse(treated == 1, "Netherlands", "Belgium"),
      post          = as.numeric(year >= T_treat_v),
      did_indicator = treated * post,
      store_fe      = store_fes_v[store],
      diff_trend    = treated * diff_slope_v * (year - T_treat_v),
      WTP           = 5.20 + store_fe + 0.08 * (year - T_treat_v) +
                      diff_trend + true_eff * did_indicator + rnorm(n(), 0, 0.40)
    )
}

panel_many_v <- make_violation_panel(2014:2024)
panel_few_v  <- make_violation_panel(2020:2024)

plot_pt_panel <- function(panel, subtitle) {
  panel |> group_by(country, year) |>
    summarise(WTP = mean(WTP), .groups = "drop") |>
    ggplot(aes(x = year, y = WTP, colour = country, group = country)) +
    geom_vline(xintercept = T_treat_v - 0.5, linetype = "dashed",
               colour = "grey50", linewidth = 0.9) +
    geom_line(linewidth = 1.2) + geom_point(size = 2.5) +
    scale_colour_manual(values = c("Netherlands" = clr_eco, "Belgium" = clr_ctrl)) +
    scale_x_continuous(breaks = unique(panel$year)) +
    labs(x = NULL, y = "Mean WTP ($)", colour = NULL, subtitle = subtitle) +
    theme_mod3()
}

p_many_v <- plot_pt_panel(panel_many_v,
  "8 pre-periods: the differential trend is unmistakeable — you would never trust DiD here")
p_few_v  <- plot_pt_panel(panel_few_v,
  "2 pre-periods: SAME DGP — the violation is invisible and DiD proceeds unchallenged")

(p_many_v / p_few_v) +
  plot_annotation(
    title    = "Parallel trends violation: Dutch WTP grows $0.15/yr faster than Belgian (true effect = $0)",
    subtitle = "How many pre-periods you happen to collect determines whether you can even see the problem",
    theme    = theme(plot.title    = element_text(size = 13, face = "bold"),
                     plot.subtitle = element_text(size = 11))
  )

▶ DiD on 2-pre-period data: statistically significant, entirely spurious

did_viol <- lm_robust(WTP ~ did_indicator + factor(store) + factor(year),
                      data = panel_few_v, clusters = store)
dv <- tidy(did_viol) |> dplyr::filter(term == "did_indicator")
cat(sprintf(
  "True treatment effect  = $0.000\nDiD estimate           = $%.3f  (SE = %.3f,  p = %.4f)\n95%% CI: [$%.3f, $%.3f]\n\nThis looks like a meaningful, significant eco-label effect.\nIt is entirely spurious. The violation was undetectable with 2 pre-periods.\n",
  dv$estimate, dv$std.error, dv$p.value, dv$conf.low, dv$conf.high
))

True treatment effect  = $0.000
DiD estimate           = $0.376  (SE = 0.109,  p = 0.0017)
95% CI: [$0.153, $0.599]

This looks like a meaningful, significant eco-label effect.
It is entirely spurious. The violation was undetectable with 2 pre-periods.

Monte Carlo: Pre-Test Power and Type I Error

▶ Monte Carlo: 300 sims × 6 pre-period counts (true effect = 0)

set.seed(2025)
N_SIM_PT <- 300

pt_sim_one <- function(n_pre) {
  years_s <- c(seq(T_treat_v - n_pre, T_treat_v - 1),
               T_treat_v:(T_treat_v + n_post_v - 1))
  n_yrs <- length(years_s)
  mat <- replicate(N_SIM_PT, {
    n_obs     <- n_stores_v * n_yrs
    store_s   <- rep(1:n_stores_v, each = n_yrs)
    year_s    <- rep(years_s, times = n_stores_v)
    treated_s <- as.numeric(store_s <= n_dutch_v)
    post_s    <- as.numeric(year_s >= T_treat_v)
    did_s     <- treated_s * post_s
    diff_s    <- treated_s * diff_slope_v * (year_s - T_treat_v)
    fe_s      <- rnorm(n_stores_v, 0, 0.6)[store_s]
    WTP_s     <- 5.20 + fe_s + 0.08 * (year_s - T_treat_v) + diff_s + rnorm(n_obs, 0, 0.40)
    df_s <- data.frame(WTP=WTP_s, did=did_s, store=factor(store_s),
                       year=factor(year_s), treated=treated_s, post=post_s, year_num=year_s)
    fit_d  <- lm(WTP ~ did + store + year, data = df_s)
    pval_d <- coeftest(fit_d, vcovCL(fit_d, cluster = ~store))["did", 4]
    if (n_pre >= 2) {
      pre_s      <- df_s[df_s$post == 0, ]
      pre_s$yr_c <- pre_s$year_num - mean(pre_s$year_num)
      fit_p  <- lm(WTP ~ treated * yr_c + store, data = pre_s)
      pval_p <- tryCatch(
        coeftest(fit_p, vcovCL(fit_p, cluster = ~store))["treated:yr_c", 4],
        error = function(e) 1.0)
    } else {
      pval_p <- 1.0
    }
    c(pval_d, pval_p)
  })
  data.frame(n_pre=n_pre, did_pval=mat[1,], pre_pval=mat[2,])
}

sim_all_pt <- map_dfr(1:6, pt_sim_one)
cat(sprintf("Simulation complete: %d total runs across 6 pre-period conditions.\n", nrow(sim_all_pt)))

Simulation complete: 1800 total runs across 6 pre-period conditions.

Monte Carlo summary (300 sims per condition, true effect = $0, differential trend = $0.15/yr).
Pre-periods	Pre-test power	Sims passing	Overall Type I	Conditional Type I
1	0%	300 / 300	34%	34%
2	4%	288 / 300	78%	77%
3	23%	230 / 300	97%	98%
4	59%	122 / 300	100%	100%
5	92%	24 / 300	100%	100%
6	98%	5 / 300	100%	100%

What this means for practice

With 1 pre-period: the test is impossible; false-positive rates can exceed 40–70%.
With 2–3 pre-periods: most violations slip through; conditional Type I error remains far above 5%.
With 5–6 pre-periods: power improves substantially, but violations that slip through still produce inflated Type I error.

What to do: Report the number of pre-treatment periods and be honest about the test’s power. Run placebo DiDs. Use Rambachan & Roth’s (2023) HonestDiD sensitivity analysis, which reframes the question from “is there a violation?” to “how large a violation would matter?” Treat a passing event study as weak evidence, not proof.

How an Unobserved Variable Inflates Type I Error

A more dangerous violation involves an unobserved time-varying variable that only appears post-treatment, making pre-trend tests completely blind to it.

▶ Monte Carlo: Type I error vs. omitted variable strength γ (300 sims, 2 pre-periods)

set.seed(2026)
N_SIM_OV <- 300
gamma_grid <- seq(0, 0.80, by = 0.10)

pt_ov_one <- function(gamma) {
  years_ov <- c(2020, 2021, 2022, 2023, 2024)
  T_ov     <- 2022
  mat <- replicate(N_SIM_OV, {
    U_i <- c(rnorm(n_dutch_v,   +0.5, 1), rnorm(n_belgian_v, -0.5, 1))
    fe  <- rnorm(n_stores_v, 0, 0.5)
    panel <- expand.grid(store = seq_len(n_stores_v), year = years_ov) |>
      as_tibble() |>
      mutate(
        treated = as.integer(store <= n_dutch_v),
        post    = as.integer(year >= T_ov),
        did     = treated * post,
        U       = U_i[store],
        WTP     = 5.20 + fe[store] + 0.08 * (year - T_ov) +
                  gamma * U * post + rnorm(n(), 0, 0.40)
      )
    fit_d  <- lm(WTP ~ did + factor(store) + factor(year), data = panel)
    pval_d <- tryCatch(
      coeftest(fit_d, vcovCL(fit_d, cluster = ~store))["did", 4],
      error = function(e) NA_real_)
    pre      <- panel[panel$post == 0, ]
    pre$yr_c <- pre$year - mean(pre$year)
    fit_p    <- lm(WTP ~ treated * yr_c + factor(store), data = pre)
    pval_p   <- tryCatch(
      coeftest(fit_p, vcovCL(fit_p, cluster = ~store))["treated:yr_c", 4],
      error = function(e) 1.0)
    c(pval_d, pval_p)
  })
  data.frame(gamma=gamma, did_pval=mat[1,], pre_pval=mat[2,])
}

sim_ov <- map_dfr(gamma_grid, pt_ov_one)
cat(sprintf("Omitted variable simulation: %d conditions × %d sims = %d total.\n",
            length(gamma_grid), N_SIM_OV, nrow(sim_ov)))

Omitted variable simulation: 9 conditions × 300 sims = 2700 total.

The pre-trend test has zero power against this type of violation

DiD Type I error rises steeply as γ increases — reaching near-certainty of a false positive when γ = 0.70. But the pre-trend test power stays flat at the nominal 5%. Because U is a fixed store characteristic captured by store FEs, its influence is absorbed away in pre-treatment periods. Only when U × post materialises after the treatment date does the confounder do damage — and by then, the event study has nothing left to test.

The Parallel Trends Thought Experiment

The formal diagnostics above — event studies, omitted-variable simulations, HonestDiD sensitivity — are essential tools. But before reaching for any of them, it is worth stepping back and asking the foundational question with brutal honesty: how plausible is it that parallel trends holds at all?

Here is a thought experiment that makes the implausibility concrete.

The asteroid analogy. Think of the outcome variable — consumer WTP — as an asteroid moving through space. Time is the x-axis: the asteroid moves forward through time, and its trajectory encodes the trend in WTP. Now think of every factor that shapes WTP — consumer income growth, eco-consciousness trends, media coverage of sustainability, competitive dynamics, supply chain costs, regulatory climate, brand lifecycle — as additional dimensions. The asteroid is moving through a $k$-dimensional space, where $k$ equals the number of forces acting on WTP.

Now a collision happens. The Dutch government passes the eco-label law. This collision affects Dutch stores but not Belgian stores. The collision splits our single asteroid into two pieces that separate along one dimension — the treated group (Dutch stores) receives the treatment shock while the control group (Belgian stores) does not.

The parallel trends assumption asks: after the collision, would the two pieces have continued moving on exactly parallel trajectories through this $k$-dimensional space?

In three-dimensional physics — where $k = 3$ — this would almost never happen. A collision that separates two objects along one dimension will almost certainly impart different forces along the other dimensions as well. The pieces move apart not just in the y-direction but in x and z too. The probability that two colliding objects continue on perfectly parallel paths in three-dimensional space is essentially zero, because that would require the off-axis force components to be exactly equal and opposite.

Now move to $k$ dimensions. Every additional dimension multiplies the implausibility. Parallel movement requires equal rates of change along all $k - 1$ non-treatment dimensions simultaneously. If the probability of parallelism along each dimension is $p < 1$, the joint probability of parallelism across all $k - 1$ dimensions is $p^{k-1} \to 0$ as $k$ grows. The assumption becomes exponentially more implausible as the outcome variable is richer and more multi-determined.

The parallel trends assumption is asking for something remarkable

Think about what DiD is actually claiming: that the only thing that made Dutch and Belgian stores diverge after 2022 was the eco-label law — and that, absent the law, every other force acting on WTP (income shocks, consumer attitude trends, competitive retailer behaviour, media effects, seasonal patterns, product mix evolution) would have moved Dutch and Belgian stores at exactly the same rate.

This is not a modest claim. It requires that every dimension of the data-generating process for WTP was either (a) equally shared by Dutch and Belgian stores, or (b) perfectly cancelled by other forces. In reality, the Dutch law was passed because Dutch consumers were already moving in a particular direction. The very political economy that produced the treatment is almost certainly correlated with the trajectory of WTP that would have occurred without the treatment.

The connection to Module 2, Part 2. In Module 2, Part 2, you saw that even in a carefully controlled laboratory experiment, treatments almost never input only a single latent construct. A persuasive message designed to target environmental concern will simultaneously activate social norms, identity, guilt, and product-quality inferences. A pricing manipulation targeting perceived value also inputs feelings of financial constraint and quality signalling. Module 2, Part 2 demonstrated this through the exclusion restriction: a treatment variable is like an instrument — it is valid only if it operates through a single pathway, and that is extremely difficult to guarantee even under tight experimental control.

Now consider what a natural policy shock does. The Dutch eco-label law was not designed by researchers to isolate a single causal pathway. It was passed by a legislature responding to voter sentiment, industry lobbying, coalition negotiations, and macroeconomic conditions. That single event simultaneously changed the information environment for eco-conscious consumers, the competitive dynamics among Dutch retailers, the media and advertising landscape around sustainability, and the social-norm signal about what kind of purchasing is expected. Each of these changes operates on WTP through a different causal pathway. The probability that a policy shock in a complex, endogenous social system inputs only one dimension of the outcome is, practically speaking, zero.

This is the deepest reason to distrust parallel trends: it is not just that treated and control groups might have had different trends — it is that the treatment itself almost certainly influenced multiple dimensions of WTP simultaneously, and each of those dimensions evolves differently in Dutch and Belgian stores. Parallel trends requires that all of these multi-dimensional influences happened to cancel out perfectly in the counterfactual, leaving only the badge-receipt pathway as the “true” effect. That is an extraordinarily demanding assumption.

A secondary connection to Module 2, Part 3. The exchangeability argument from Module 2, Part 3 reinforces the point from a different angle: even deliberate random assignment does not guarantee exchangeability as the number of dimensions of the outcome construct grows. DiD is in a harder position still — the “assignment mechanism” (which government passed a law, and when) was driven by the same forces that shape WTP trajectories, not by a randomisation procedure. The dimensions of WTP and the determinants of policy timing are not independent. An unrandom shuffle with a large deck is not going to produce anything close to exchangeability.

What this means for practice. This is not an argument that DiD is never useful — it is an argument for intellectual honesty about the assumption that underlies it.

Name the dimensions explicitly. For your specific context, list the forces that could plausibly drive your outcome differently in treated and control groups over time. Each one is a dimension along which the parallel trends assumption could fail.
Ask: why was this policy passed here and not there? The answer will almost always reveal correlates with the outcome trajectory. Those correlates are the potential violations.
Treat parallel trends as a maintained assumption, not a verified fact. No test can establish it; it can only be falsified. Reporting a non-significant event study with two pre-periods as evidence of parallel trends is like reporting one shuffle of a 52-card deck as evidence of randomness — the test had no power to detect the problem.
Use DiD for what it is: a productive disciplining device, not a causal proof machine. The DiD framework forces you to be explicit about what the counterfactual is, who the control group is, and what parallel trends means for your context. That discipline has value even when the assumption is imperfect. But the conclusions should be presented as conditional on the assumption — not as established causal facts.

Researcher Checklist: Difference-in-Differences

Key questions about your parallel trends assumption

Would the treated and control groups have followed the same trend absent the treatment? This is the parallel trends assumption — it is not testable in the post-period, only arguable from theory and pre-treatment data.
Did you run an event study with pre-treatment leads? Significant pre-trend coefficients (“leads”) are direct evidence of a violation. Be honest about the power of this test — with 1–2 pre-periods, most violations will slip through.
Is the control group genuinely unaffected by the treatment? Spillover effects (competitive responses, indirect exposure) make control units invalid counterfactuals and bias the DiD estimate.
Is treatment staggered across units? Standard two-way fixed effects DiD produces biased estimates under staggered timing with heterogeneous effects. Use a Callaway-Sant’Anna or Sun-Abraham estimator instead.