Part 3: Secondary Data Tools for Causal Identification
This part covers three secondary-data tools for causal identification — instrumental variables (IV), regression discontinuity design (RDD), and difference-in-differences (DiD). They are not three unrelated methods. They share a common logic: find variation in the treatment that is as good as random — variation that is unrelated to the unobserved confounders that make simple comparisons misleading. IV finds that variation through an external lever; RDD finds it at a sharp threshold; DiD finds it in a policy timing that differs across groups. Understanding IV first makes RDD and DiD easier to see as extensions of the same core idea.
Instrumental Variables
Scott Cunningham’s Causal Inference: The Mixtape has a fantastic, and more thorough, deep dive into instrumental variables at mixtape.scunning.com/07-instrumental_variables.
The core problem
Part 2 showed that when an unobserved variable \(U\) causes both treatment \(X\) and outcome \(Y\), no amount of regression adjustment on observed covariates can recover the causal effect of \(X\) on \(Y\). The path \(X \leftarrow U \rightarrow Y\) is a backdoor that remains open whenever \(U\) is unmeasured.
In the eco-label context: suppose stores that display eco-labels (\(X = 1\)) tend to be in neighbourhoods with higher baseline environmental orientation (\(U\)). \(U\) independently raises consumer WTP (\(Y\)). A naive comparison of WTP between labelling and non-labelling stores overstates the label’s causal effect, because part of the WTP gap is driven by neighbourhood composition, not the label.
The IV solution
An instrument \(Z\) provides a second source of variation in \(X\) — one that is unrelated to \(U\) and therefore free of confounding. Instead of using all variation in \(X\) to estimate its effect on \(Y\), IV uses only the variation in \(X\) that comes from \(Z\). Because \(Z\) is unrelated to \(U\), this slice of variation in \(X\) is clean.
For \(Z\) to be a valid instrument, three conditions must hold:
| Condition | What it requires | Testable? |
|---|---|---|
| Relevance | \(Z\) predicts \(X\) (strong first stage) | Yes — run the first-stage regression and check \(F > 10\) |
| Independence | \(Z\) is unrelated to \(U\) (and to everything else that affects \(Y\) except through \(X\)) | No — requires theory or design |
| Exclusion | \(Z\) affects \(Y\) only through \(X\), not through any direct path | No — requires theory |
Relevance is the only condition you can test directly. Independence and exclusion are assumptions — they must be argued from the design, not demonstrated from the data.
The eco-label pilot: a concrete instrument
The government randomly selects stores to participate in a subsidized eco-labelling pilot program (\(Z = 1\) if selected). Not every selected store actually installs the label — some lack the shelf space, staff time, or interest — so \(X \neq Z\). The instrument works as follows:
- Relevance: being selected for the pilot substantially raises the probability of displaying a label (first stage)
- Independence: pilot selection is random, so \(Z\) is unrelated to neighbourhood eco-orientation \(U\)
- Exclusion: being selected for the pilot affects consumer WTP only through the store actually displaying the label, not through any other route
This is precisely the ITT / LATE framework from Part 1. Pilot selection is the encouragement (\(Z\)), label display is compliance (\(X\)), and the Wald estimator gives the causal effect for compliers — stores that display the label because they were selected for the pilot, not those that would have displayed it anyway.
The Wald estimator
\[\hat{\beta}_{IV} = \frac{\widehat{\text{Cov}}(Y, Z)}{\widehat{\text{Cov}}(X, Z)} = \frac{\text{Effect of } Z \text{ on } Y}{\text{Effect of } Z \text{ on } X} = \frac{\text{ITT}}{\text{First stage}} = \text{LATE}\]
This is also called two-stage least squares (2SLS): first regress \(X\) on \(Z\) to get predicted values \(\hat{X}\); then regress \(Y\) on \(\hat{X}\). The second stage coefficient is \(\hat{\beta}_{IV}\).
Simulation
▶ Simulate IV setting: U confounds X → Y; Z is a valid instrument
set.seed(2024)
N_iv <- 600
# ── YOUR DATA: U is the unobserved confounder (not in your dataset);
# Z is your instrument (must be in your dataset); X is your endogenous treatment;
# Y is your outcome. The structural equations below define the DGP.
U <- rnorm(N_iv) # neighbourhood eco-orientation: UNOBSERVED
Z <- rbinom(N_iv, 1, 0.5) # random pilot selection: instrument
# Compliance: high-U stores adopt even without the pilot (always-takers);
# pilot selection tilts borderline stores into adoption (compliers)
X <- rbinom(N_iv, 1, plogis(-2.0 + 2.5 * Z + 1.0 * U)) # label adoption
# True causal effect of the label = $0.80; U creates confounding
Y <- 5.00 + 0.80 * X + 0.60 * U + rnorm(N_iv, 0, 0.50) # WTP
df_iv <- tibble(Y, X, Z, U_true = U)
# Compliance breakdown (visible only in simulation)
compliance <- case_when(
X == 1 & Z == 0 ~ "Always-taker",
X == 0 & Z == 1 ~ "Never-taker",
X == Z ~ "Complier",
TRUE ~ "Defier"
)
cat("Compliance breakdown:\n")Compliance breakdown:
▶ Simulate IV setting: U confounds X → Y; Z is a valid instrument
print(table(compliance))compliance
Always-taker Complier Never-taker
50 434 116
▶ Simulate IV setting: U confounds X → Y; Z is a valid instrument
cat(sprintf("\nFirst-stage compliance rate: %.1f%%\n", 100 * (mean(X[Z==1]) - mean(X[Z==0]))))
First-stage compliance rate: 43.8%
▶ OLS vs. IV: OLS is biased; IV recovers the true effect
# ── YOUR DATA: replace Y ~ X with your outcome ~ treatment; replace | Z with
# | your_instrument. The pipe syntax means "instrument for X using Z".
# Add covariates on both sides of | to control for observed confounders
# (e.g., Y ~ X + age + income | Z + age + income).
# OLS: biased — uses all variation in X, including the U-driven part
ols_iv <- lm_robust(Y ~ X, data = df_iv)
# IV (2SLS): uses only the Z-driven variation in X
iv_est <- iv_robust(Y ~ X | Z, data = df_iv)
# ── CHECK: first-stage F-statistic should exceed 10 (rule of thumb for strong
# instruments). Weak instruments (F < 10) produce severely biased IV estimates
# and wide confidence intervals.
fs_model <- lm_robust(X ~ Z, data = df_iv)
fs_F <- summary(lm(X ~ Z, data = df_iv))$fstatistic[1]
tibble(
Method = c("OLS (biased — ignores U)", "IV / 2SLS (valid — uses Z only)"),
Estimate = round(c(coef(ols_iv)["X"], coef(iv_est)["X"]), 3),
SE = round(c(ols_iv$std.error["X"], iv_est$std.error["X"]), 3),
`95% CI` = sprintf("[%.3f, %.3f]",
c(ols_iv$conf.low["X"], iv_est$conf.low["X"]),
c(ols_iv$conf.high["X"], iv_est$conf.high["X"])),
`True effect` = 0.80
) |> knitr::kable(caption = sprintf(
"OLS vs. IV: true eco-label effect = $0.80 | first-stage F = %.1f | N = %d", fs_F, N_iv))| Method | Estimate | SE | 95% CI | True effect |
|---|---|---|---|---|
| OLS (biased — ignores U) | 1.151 | 0.060 | [1.033, 1.270] | 0.8 |
| IV / 2SLS (valid — uses Z only) | 0.734 | 0.138 | [0.462, 1.006] | 0.8 |
▶ First stage: does Z predict X?
df_iv |>
group_by(Z) |>
summarise(P_X = mean(X), .groups="drop") |>
ggplot(aes(x = factor(Z, labels=c("Not selected\n(Z = 0)", "Selected\n(Z = 1)")),
y = P_X, fill = factor(Z))) +
geom_col(width = 0.5, alpha = 0.85) +
geom_text(aes(label = sprintf("%.0f%%", 100 * P_X)), vjust = -0.4, size = 4) +
scale_fill_manual(values = c("0" = clr_ctrl, "1" = clr_eco), guide = "none") +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
labs(x = "Pilot selection (instrument Z)", y = "Proportion implementing label (X)",
title = "First stage: pilot selection strongly predicts label adoption",
subtitle = sprintf("F = %.1f — well above the weak-instrument threshold of 10", fs_F)) +
theme_mod3()▶ OLS vs. IV coefficient comparison
tibble(
Method = factor(c("OLS (biased)", "IV / 2SLS (valid)"),
levels = c("OLS (biased)", "IV / 2SLS (valid)")),
Estimate = c(coef(ols_iv)["X"], coef(iv_est)["X"]),
lo = c(ols_iv$conf.low["X"], iv_est$conf.low["X"]),
hi = c(ols_iv$conf.high["X"], iv_est$conf.high["X"])
) |>
ggplot(aes(y = Method, x = Estimate, xmin = lo, xmax = hi, colour = Method)) +
geom_vline(xintercept = 0.80, linetype = "dashed", colour = "grey40", linewidth = 0.9) +
geom_pointrange(size = 1, linewidth = 1.2) +
scale_colour_manual(values = c("OLS (biased)" = clr_ctrl, "IV / 2SLS (valid)" = clr_eco),
guide = "none") +
scale_x_continuous(limits = c(0.5, 1.6),
labels = function(x) sprintf("$%.2f", x)) +
labs(y = NULL, x = "Estimated eco-label effect",
title = "OLS overstates the effect; IV recovers the truth",
subtitle = "Dashed line = true effect ($0.80) | OLS biased upward by U") +
theme_mod3()The Danger of a Weak Instrument
The simulation above uses a strong instrument — pilot selection has a large, reliable first-stage relationship with label adoption. Real instruments are rarely so clean. A weak instrument is one where \(Z\) barely predicts \(X\): the first-stage \(F\) is low, and only a small fraction of the variation in \(X\) can be traced to \(Z\).
The consequence is severe and somewhat counterintuitive. Recall the Wald estimator:
\[\hat{\beta}_{IV} = \frac{\text{Cov}(Y,\, Z)}{\text{Cov}(X,\, Z)}\]
When the denominator — the first stage — is close to zero, two things go wrong simultaneously. First, the estimator’s variance explodes: dividing by a small number amplifies every source of noise, producing confidence intervals that are orders of magnitude wider than OLS. Second, any slight violation of the exclusion restriction gets amplified by the same factor. Even if \(Z\) has a tiny direct effect on \(Y\) that would normally be negligible — say, the announcement of pilot selection itself slightly changed consumer expectations — a weak denominator turns that small violation into a large bias in the numerator. The result is an IV estimate that can be more biased than OLS, not less, accompanied by wide confidence intervals that give a false impression of informative inference.
▶ Simulate a weak instrument: low first-stage F leads to biased, imprecise IV
set.seed(2024)
N_wk <- 600
U_wk <- rnorm(N_wk)
Z_wk <- rbinom(N_wk, 1, 0.5)
# Weak compliance: pilot selection barely shifts adoption probability
X_wk <- rbinom(N_wk, 1, plogis(-1.5 + 0.30 * Z_wk + 1.2 * U_wk))
Y_wk <- 5.00 + 0.80 * X_wk + 0.60 * U_wk + rnorm(N_wk, 0, 0.50)
df_wk <- tibble(Y = Y_wk, X = X_wk, Z = Z_wk)
ols_wk <- lm_robust(Y ~ X, data = df_wk)
iv_wk <- iv_robust(Y ~ X | Z, data = df_wk)
fs_wk <- summary(lm(X ~ Z, data = df_wk))$fstatistic[1]
cat(sprintf("Weak first-stage F = %.1f (well below the threshold of 10)\n", fs_wk))Weak first-stage F = 0.2 (well below the threshold of 10)
▶ Strong vs. weak instrument: estimates, SEs, and confidence intervals
bind_rows(
tibble(Instrument = sprintf("Strong (F = %.0f)", fs_F),
Method = c("OLS", "IV / 2SLS"),
Estimate = c(coef(ols_iv)["X"], coef(iv_est)["X"]),
SE = c(ols_iv$std.error["X"], iv_est$std.error["X"]),
lo = c(ols_iv$conf.low["X"], iv_est$conf.low["X"]),
hi = c(ols_iv$conf.high["X"], iv_est$conf.high["X"])),
tibble(Instrument = sprintf("Weak (F = %.0f)", fs_wk),
Method = c("OLS", "IV / 2SLS"),
Estimate = c(coef(ols_wk)["X"], coef(iv_wk)["X"]),
SE = c(ols_wk$std.error["X"], iv_wk$std.error["X"]),
lo = c(ols_wk$conf.low["X"], iv_wk$conf.low["X"]),
hi = c(ols_wk$conf.high["X"], iv_wk$conf.high["X"]))
) |>
mutate(`95% CI` = sprintf("[%.2f, %.2f]", lo, hi),
Estimate = round(Estimate, 3),
SE = round(SE, 3)) |>
select(Instrument, Method, Estimate, SE, `95% CI`) |>
knitr::kable(caption = "Strong vs. weak instrument: true eco-label effect = $0.80. Weak IV is imprecise and biased.")| Instrument | Method | Estimate | SE | 95% CI |
|---|---|---|---|---|
| Strong (F = 154) | OLS | 1.151 | 0.060 | [1.03, 1.27] |
| Strong (F = 154) | IV / 2SLS | 0.734 | 0.138 | [0.46, 1.01] |
| Weak (F = 0) | OLS | 1.359 | 0.061 | [1.24, 1.48] |
| Weak (F = 0) | IV / 2SLS | -1.077 | 6.737 | [-14.31, 12.15] |
▶ Visualise strong vs. weak IV: coefficient estimates and CIs
bind_rows(
tibble(Instrument = factor("Strong instrument", levels = c("Strong instrument","Weak instrument")),
Method = factor(c("OLS","IV / 2SLS"), levels = c("OLS","IV / 2SLS")),
Estimate = c(coef(ols_iv)["X"], coef(iv_est)["X"]),
lo = c(ols_iv$conf.low["X"], iv_est$conf.low["X"]),
hi = c(ols_iv$conf.high["X"], iv_est$conf.high["X"])),
tibble(Instrument = factor("Weak instrument", levels = c("Strong instrument","Weak instrument")),
Method = factor(c("OLS","IV / 2SLS"), levels = c("OLS","IV / 2SLS")),
Estimate = c(coef(ols_wk)["X"], coef(iv_wk)["X"]),
lo = c(ols_wk$conf.low["X"], iv_wk$conf.low["X"]),
hi = c(ols_wk$conf.high["X"], iv_wk$conf.high["X"]))
) |>
ggplot(aes(y = Method, x = Estimate, xmin = lo, xmax = hi, colour = Method)) +
geom_vline(xintercept = 0.80, linetype = "dashed", colour = "grey40", linewidth = 0.9) +
geom_pointrange(size = 0.9, linewidth = 1.1) +
facet_wrap(~Instrument) +
scale_colour_manual(values = c("OLS" = clr_ctrl, "IV / 2SLS" = clr_eco), guide = "none") +
labs(y = NULL, x = "Estimated eco-label effect ($)",
title = "Weak instruments: the CI explodes and the estimate may be worse than OLS",
subtitle = "Dashed line = true effect ($0.80) | Weak IV amplifies noise and exclusion-restriction violations") +
theme_mod3()The weak instrument problem has a direct parallel in Module 1’s concept of discriminant validity. Recall that discriminant validity requires a scale to clearly differentiate its target construct from adjacent constructs. A scale with poor discriminant validity picks up variance from multiple constructs simultaneously — it cannot cleanly isolate what it is supposed to measure.
A weak instrument fails for the same structural reason, but at the causal level. The instrument \(Z\) is supposed to isolate a specific pathway — \(Z \rightarrow X \rightarrow Y\) — and exclude all others. When \(Z\) weakly predicts \(X\), it fails to adequately differentiate the intended causal channel from background noise. Just as a scale with a high HTMT ratio cannot clearly separate itself from an adjacent construct, a weak instrument cannot clearly separate the X-pathway from the confounding pathways. The clean variation in \(X\) that \(Z\) provides is so small that any contamination from unmeasured channels dominates the estimate.
The Wald estimator makes this explicit: whatever tiny amount of \(Z\)-driven variation exists in \(X\), the second stage amplifies it to recover the causal effect — but the amplification is indiscriminate. It magnifies the causal signal and any violation of independence or exclusion equally. A negligible exclusion-restriction violation that would be safely ignorable with a strong instrument becomes the dominant source of bias when the instrument is weak. This mirrors what happens in Module 1 when a scale has poor discriminant validity: contamination from adjacent constructs overwhelms the target signal.
Practical takeaway: Test the first stage before trusting your IV estimate, just as you would run HTMT and CFA before trusting scale scores. An F-statistic below 10 is a warning that your instrument does not isolate its intended pathway clearly enough for IV to be informative.
The IV estimate is the LATE: the causal effect of the eco-label for complier stores — those that installed the label because they were selected for the pilot, not those that would have installed it regardless. Always-takers (high-U stores that adopt with or without the pilot) and never-takers (stores that never adopt) do not contribute to the IV estimate.
This is identical to the LATE from Part 1 of this module. The instrument is a different mechanism — a government lottery rather than a laboratory randomisation — but the estimand is the same: the average effect for units whose treatment status was actually changed by the instrument.
Practical limits of IV:
- Weak instruments (\(F < 10\)) produce severely biased estimates that can be worse than OLS
- Instrument validity is untestable in full — independence and exclusion rest on argument, not data
- LATE may not generalise — compliers near the instrument may differ systematically from the full population of treated units
From IV to Regression Discontinuity
The eco-label pilot above exploited randomness that a government deliberately created. In most secondary data settings, no one ran a lottery. But sometimes nature or policy creates an instrument-like discontinuity: a threshold that sharply determines treatment for units near it, and that is effectively random in a narrow window around the cutoff.
Regression discontinuity is local IV. The instrument is \(Z_i = \mathbf{1}[\text{audit score}_i \geq 70]\) — being above the threshold. Near the cutoff, audit scores have measurement noise, so whether a product lands at 69 or 71 is close to random. This near-randomness makes \(Z\) approximately independent of unobserved product quality \(U\) in that local window. The exclusion restriction is that the audit score crossing 70 affects WTP only through badge receipt, not through any other route. And relevance is guaranteed — the badge is awarded precisely at 70.
The RDD estimate is therefore a Wald estimator applied locally. In a sharp RDD — where the threshold deterministically assigns treatment (every product at ≥ 70 receives the badge; none below do) — the jump in badge receipt at the cutoff is exactly 1, and the Wald ratio simplifies to the raw jump in WTP: a local average treatment effect (LATE) for products near the threshold.
In a fuzzy RDD, crossing the threshold changes the probability of receiving treatment but does not guarantee it. Some products above 70 may not display the badge; some below may receive it through other means. Here the threshold itself serves as an instrument — crossing 70 predicts badge receipt without perfectly determining it — and the Wald ratio identifies the LATE for cutoff compliers: products whose badge status would change if their observed score crossed the threshold. This is the same logic as the standard IV LATE: the effect is identified for the sub-population whose treatment was moved by the instrument (the threshold crossing), not the full population.
Regression Discontinuity Design
Scott Cunningham’s Causal Inference: The Mixtape has a fantastic, and more thorough, deep dive into regression discontinuity design at mixtape.scunning.com/06-regression_discontinuity.
AlterEco’s retailer awards a “Certified Sustainable” badge to any coffee product scoring 70 or above on a third-party environmental audit (scored 0–100). Products at 69 miss the badge; products at 71 get it. Close to the cutoff, the small score differences that separate receiving from not receiving the badge are plausibly driven more by auditing variability than by genuine differences in underlying quality — making near-threshold assignment close to random. We exploit this local near-randomness.
The identification logic rests on a continuity assumption: in expectation, products just above and just below 70 would have had similar WTP absent the badge. The two-point audit score difference is small enough that, near the cutoff, it is more likely driven by auditing variability than by genuine differences in underlying sustainability. This is an assumption about the smoothness of potential outcomes near the threshold, not a claim that any two adjacent products are physically identical. It must be argued from context and verified with diagnostics.
This local near-randomness is what makes the jump causal: in expectation, products just above and just below 70 are approximately exchangeable in everything except badge receipt, so the jump in WTP is attributable to the badge.
The cost of this elegance: the estimate is a local average treatment effect (LATE). It tells you the causal effect of receiving the badge for products hovering around a 70-point score — not for top-scoring 90s or low-scoring 40s, which may respond very differently. This LATE is conceptually identical to the LATE from Part 1 of this module and from the IV section above — all three arise because the causal estimate is anchored to a specific sub-population near the cutoff or instrument, not the full population.
A Module 2 connection: The “as-good-as-random near the cutoff” logic is local randomisation — a naturally occurring version of the randomised experiment studied in Module 2. The same assumption that makes experiments valid (exchangeability of treated and control) holds here, but only locally. In Module 2, randomisation made exchangeability a design guarantee; here, it is an empirical claim that must be verified through the diagnostics below.
A Module 1 connection: The running variable itself is a measurement, and the Module 1 validity framework applies directly. But the key threat is not simply random noise. If the observed audit score is the rule that assigns the badge, then auditor imprecision in the score does not automatically attenuate the estimated jump — it shifts the estimand (the effect is for products near 70 on the observed scale, not necessarily on the latent sustainability scale). The more fundamental threat is construct contamination: if the audit score reflects sustainability plus firm size, lobbying capacity, or brand reputation, crossing 70 is not a clean sustainability threshold but a composite one. Units just above and below may differ on those adjacent constructs in ways that independently drive WTP. This is developed in the measurement section below.
A practical check to run first: If brands can game the audit score to land just above 70, the continuity assumption breaks down. The density test below checks for suspicious bunching at the cutoff.
The Running Variable as a Measurement
The running variable is itself a measurement, and the Module 1 validity framework applies to it directly. But the measurement threats to RDD are more specific than the general idea that a noisy running variable is bad. It helps to distinguish three situations.
Recall the Classical Test Theory decomposition from Module 1, Part 1:
\[X_{\text{obs}} = T_{\text{true}} + \varepsilon\]
Applied to the audit score: the observed score equals true environmental quality plus auditor noise. Whether that noise creates problems for RDD depends on how the running variable is used and what it is actually measuring.
Case A — Observed score is the assignment rule (the typical situation). The badge is awarded to every product scoring ≥ 70 on the observed audit score. Here, random auditor noise does not automatically attenuate the estimated discontinuity or invalidate the design. The sharp step in treatment assignment exists at the observed threshold, and the RDD estimates the causal effect of crossing that threshold. What noise does change is the estimand: the effect is identified for products near 70 on the observed scale, which may include products with a range of true sustainability levels due to auditor imprecision. This is a consideration for interpretation — the LATE is for “products that scored near 70” rather than “products whose true sustainability is near 70” — but it is not a design failure.
Case B — True score determines treatment, but the researcher observes a noisy proxy. If the certifying body awards badges based on a latent quality standard and the observed audit score is an imperfect proxy for that standard, noise can genuinely blur the discontinuity. Misclassification in both directions near the cutoff — some high-quality products score below 70, some lower-quality products score above — produces a gradual slope rather than a sharp step, attenuating the estimated jump. This is the analogue of attenuation bias in OLS when predictors are measured with error (Module 1, Part 2). In practice, Case A is more common: the observed score is the actual assignment rule.
Case C — Construct contamination (the principal Module 1 threat). This is the most important case and the one most directly connected to Module 1’s concept of discriminant validity. It has nothing to do with random noise. If the audit score systematically reflects sustainability plus firm size, lobbying capacity, auditor relationships, or brand reputation, the threshold is a composite threshold rather than a clean sustainability threshold. Units just above and just below 70 will differ not only in their observed score but on every adjacent construct the score inadvertently absorbs — and some of those constructs independently drive WTP. The jump at the cutoff will then reflect both the badge effect and systematic differences in those confounded traits, with no way to separate them.
When the observed score is the assignment rule, classical random noise does not automatically destroy an RDD. The more fundamental threat is construct contamination: when the running variable loads on adjacent constructs, the cutoff is no longer a clean threshold on the intended construct.
If the audit score conflates environmental quality with firm size, lobbying capacity, or auditor relationships:
- Near-threshold units differ on those adjacent constructs as well as on the measured score
- The estimated jump captures the effect of crossing a composite threshold, not a clean sustainability threshold
- Covariate balance checks cannot detect this if the contaminating constructs were never measured
This is the discriminant validity failure from Module 1, applied to causal identification: a scale that fails discriminant validity cannot cleanly isolate its target construct, and a running variable that fails discriminant validity cannot cleanly define what crossing its threshold means.
How contamination enables gaming. When brand reputation or lobbying capacity is embedded in the audit score, high-reputation brands near the threshold can exploit that contamination to justify re-auditing or appealing borderline scores. Their resources — not their environmental quality — give them repeated attempts to cross 70. The Monte Carlo simulation below shows how this dynamic generates severe Type I error even when active manipulation is relatively uncommon.
The Module 1 checklist applied to your running variable. Before running an RDD, ask the measurement validity questions from Module 1 about the running variable itself:
- Content validity: Does the running variable actually cover the full domain of the construct it is supposed to measure? An environmental audit that focuses heavily on energy use and packaging but ignores supply chain labour practices, land-use impact, or end-of-life product handling has poor content validity. It measures some aspects of sustainability, but not the construct in full. A cutoff on this score is then a cutoff on a narrow slice of the construct, not on sustainability itself — and products that score well on the measured facets while performing poorly on the unmeasured ones receive the badge as if they were comprehensively sustainable.
- Construct validity — especially discriminant validity: Does the running variable pick up primarily its intended construct, or does it also load on adjacent ones? This is the key question. Construct validity requires both convergent validity (the score correlates with other measures of the same construct) and — above all — discriminant validity (the score does not correlate highly with measures of different constructs). In the audit example: if a product’s audit score is partially determined by firm size, lobbying capacity, or the brand’s pre-existing relationship with the certifying body, the score fails discriminant validity. It conflates environmental quality with political and economic power. Products near the threshold will then differ not just in sustainability — they will differ in size, connections, and resources. The RDD is no longer estimating the effect of crossing a sustainability threshold; it is estimating the effect of crossing a composite threshold that mixes sustainability, size, and influence. This is precisely the Module 1 problem: a scale with a high HTMT ratio to an adjacent construct cannot cleanly isolate its target, and neither can a running variable that picks up adjacent constructs.
- Measurement invariance: Are audit standards applied consistently across the types of products in your sample? If the same score of 70 means different things for small artisan producers vs. large multinational brands (a form of non-invariance from Module 1, Part 3), the threshold is not a uniform treatment assignment rule — it assigns badges based on different underlying quality levels for different types of firms.
The practical implication is uncomfortable but important: the interpretive precision of an RDD is bounded above by the discriminant validity of its running variable. An RDD on a construct-contaminated running variable can still estimate the causal effect of crossing that particular composite threshold — but it cannot cleanly estimate the effect of crossing a sustainability threshold. A high-quality regression discontinuity study begins with a running variable that demonstrably measures what it claims to measure and demonstrably does not measure what it claims not to measure.
Simulating an RDD
▶ Simulate 1,200 products near the 70-point certification threshold
set.seed(2024)
N_rdd <- 1200; cutoff <- 70; true_jump <- 0.80
# ── YOUR DATA: replace audit_score with your running variable (the continuous
# score that determines treatment assignment), cutoff with the actual policy
# threshold, and WTP_rdd with your outcome variable.
# df_rdd must contain: the running variable, a 0/1 treatment indicator
# (= 1 if running variable >= cutoff), the outcome, and running = running - cutoff.
audit_score <- runif(N_rdd, 30, 100)
treat_rdd <- as.integer(audit_score >= cutoff)
WTP_rdd <- 4.0 + 0.02*(audit_score-cutoff) - 0.0003*(audit_score-cutoff)^2 +
true_jump*treat_rdd + rnorm(N_rdd, 0, 0.60)
WTP_rdd <- pmax(1, pmin(10, WTP_rdd))
df_rdd <- tibble(audit_score, treat_rdd, WTP_rdd, running=audit_score-cutoff)
df_rdd |>
ggplot(aes(x=audit_score, y=WTP_rdd, colour=factor(treat_rdd))) +
geom_point(alpha=0.30, size=0.9) +
geom_vline(xintercept=cutoff, linewidth=1, colour="grey30") +
geom_smooth(method="lm", formula=y~poly(x,2), se=TRUE, alpha=0.2) +
scale_colour_manual(values=c("0"=clr_ctrl,"1"=clr_eco),
labels=c("Below 70 (no badge)","Above 70 (badge)")) +
annotate("text", x=71, y=9.2, label="Cutoff = 70", hjust=0, size=3.5, fontface="bold") +
annotate("segment", x=70, xend=70, y=3.85, yend=4.85, colour="firebrick",
arrow=arrow(ends="both", length=unit(.2,"cm")), linewidth=0.8) +
annotate("text", x=71, y=4.35,
label=sprintf("~$%.2f jump\n(badge effect)", true_jump),
hjust=0, size=3, colour="firebrick") +
labs(x="Audit Score (running variable)", y="Consumer WTP ($)", colour=NULL,
title="Regression Discontinuity: the jump at 70 estimates the badge's causal effect") +
theme_mod3()Key Assumptions and How to Test Them
Assumption 1: No Manipulation
Code
# ── YOUR DATA: replace df_rdd$audit_score with your running variable column
# and c=cutoff with your policy threshold value.
dens_test <- rddensity(X=df_rdd$audit_score, c=cutoff)
summary(dens_test)
Manipulation testing using local polynomial density estimation.
Number of obs = 1200
Model = unrestricted
Kernel = triangular
BW method = estimated
VCE method = jackknife
c = 70 Left of c Right of c
Number of obs 677 523
Eff. Number of obs 256 209
Order est. (p) 2 2
Order bias (q) 3 3
BW est. (h) 13.827 12.101
Method T P > |T|
Robust 0.6254 0.5317
P-values of binomial tests (H0: p=0.5).
Window Length <c >=c P>|T|
1.276 + 1.276 28 20 0.3123
2.553 + 2.479 44 41 0.8284
3.829 + 3.682 64 57 0.5856
5.105 + 4.885 89 80 0.5384
6.382 + 6.087 118 101 0.2796
7.658 + 7.290 140 117 0.1698
8.934 + 8.493 165 138 0.1351
10.211 + 9.696 185 168 0.3945
11.487 + 10.898 213 190 0.2731
12.763 + 12.101 235 209 0.2354
Code
invisible(rdplotdensity(dens_test, X=df_rdd$audit_score,
title="Density test: checking for bunching at the 70-point cutoff",
xlabel="Audit score", ylabel="Density"))A significant density discontinuity is a serious warning that units may be sorting around the cutoff. A non-significant result is reassuring — but not proof of clean assignment. Density tests can have low power, especially when manipulation is subtle, spread over a wider window, or driven by construct contamination rather than sharp bunching. The Monte Carlo results below show Type I error rising faster than density-test power precisely because the gaming mechanism here is smooth enough to partially escape detection while still producing severe bias. With our simulated uniform running variable, the density is continuous and the test correctly finds no evidence of manipulation.
Assumption 2: Covariate Continuity
Code
env_rdd <- rnorm(N_rdd, 0 + 0.01*(audit_score-cutoff)/10, 1)
rdd_env <- rdrobust(y=env_rdd, x=df_rdd$audit_score, c=cutoff)
cat("Covariate balance — env. concern should NOT jump at the threshold:\n")Covariate balance — env. concern should NOT jump at the threshold:
Code
summary(rdd_env)Sharp RD estimates using local polynomial regression.
Number of Obs. 1200
BW type mserd
Kernel Triangular
VCE method NN
Number of Obs. 677 523
Eff. Number of Obs. 136 120
Order est. (p) 1 1
Order bias (q) 2 2
BW est. (h) 7.411 7.411
BW bias (b) 11.566 11.566
rho (h/b) 0.641 0.641
Unique Obs. 677 523
=====================================================================
Point Robust Inference
Estimate z P>|z| [ 95% C.I. ]
---------------------------------------------------------------------
RD Effect -0.091 -0.122 0.903 [-0.604 , 0.533]
=====================================================================
Covariate continuity is the RDD analogue of a balance table from Module 2: it checks whether observed pre-treatment characteristics jump at the threshold. A significant discontinuity in a pre-treatment covariate is a red flag — something is sorting units at the cutoff beyond the treatment rule. A smooth covariate is reassuring, but it cannot demonstrate balance on unobserved characteristics, just as a Module 2 balance table cannot confirm exchangeability on unmeasured potential confounders. In an experiment, balance on observed covariates is reassuring because randomisation also balances unobserved ones (probabilistically); in RDD, that probabilistic guarantee is absent. Covariate continuity is necessary evidence, not sufficient proof.
Estimating the RDD Effect
Code
# ── YOUR DATA: replace y= with your outcome variable, x= with your running
# variable, and c= with your cutoff value.
# ── KEY ARGS: rdrobust() selects bandwidth automatically (MSE-optimal rule);
# you can override with h= to specify a fixed bandwidth, or kernel= to change
# from the default triangular kernel.
rdd_est <- rdrobust(y=df_rdd$WTP_rdd, x=df_rdd$audit_score, c=cutoff)
summary(rdd_est)Sharp RD estimates using local polynomial regression.
Number of Obs. 1200
BW type mserd
Kernel Triangular
VCE method NN
Number of Obs. 677 523
Eff. Number of Obs. 181 172
Order est. (p) 1 1
Order bias (q) 2 2
BW est. (h) 9.821 9.821
BW bias (b) 17.119 17.119
rho (h/b) 0.574 0.574
Unique Obs. 677 523
=====================================================================
Point Robust Inference
Estimate z P>|z| [ 95% C.I. ]
---------------------------------------------------------------------
RD Effect 1.042 6.435 0.000 [0.762 , 1.429]
=====================================================================
Code
# ── CHECK: use the "Robust" CI for inference — it is bias-corrected.
rdplot(y=df_rdd$WTP_rdd, x=df_rdd$audit_score, c=cutoff,
title="RDD estimate: causal effect of sustainability badge on WTP",
x.label="Audit Score", y.label="WTP ($)")Code
bw_results <- map_dfr(c(5,8,10,15,20,25,30), function(bw) {
est <- rdrobust(y=df_rdd$WTP_rdd, x=df_rdd$audit_score, c=cutoff, h=bw)
tibble(bandwidth=bw, estimate=est$coef["Conventional",1], se=est$se["Robust",1])
})
bw_results |>
mutate(lo=estimate-1.96*se, hi=estimate+1.96*se) |>
ggplot(aes(x=bandwidth, y=estimate, ymin=lo, ymax=hi)) +
geom_hline(yintercept=true_jump, linetype="dashed", colour="grey40", linewidth=1) +
geom_ribbon(alpha=0.15, fill=clr_eco) +
geom_line(colour=clr_eco, linewidth=1) + geom_point(colour=clr_eco, size=2.5) +
annotate("text", x=5.5, y=true_jump+0.05, label=sprintf("True = $%.2f", true_jump),
hjust=0, size=3.5) +
labs(x="Bandwidth (audit score units)", y="Estimated badge effect ($)",
title="RDD estimates stable across bandwidths h = 5 to 30",
subtitle="Instability would indicate sensitivity to bandwidth — a red flag") +
theme_mod3()The bandwidth \(h\) controls which products are included in the RDD comparison — only those with audit scores in \([70-h,\ 70+h]\). This creates a fundamental tension:
| Narrow \(h\) | Wide \(h\) |
|---|---|
| Products are very similar to each other near the cutoff ✔ | More observations → tighter CI ✔ |
| Few observations → wide CI ✖ | Products far from 70 differ systematically from each other ✖ |
rdrobust selects the bandwidth automatically using a mean-squared-error optimal rule. The bandwidth-sensitivity plot above is a sanity check: stable estimates across a wide range of \(h\) suggest the result is not an artefact of a specific choice.
Type I Error from Running Variable Manipulation
The density test and covariate continuity checks are the right diagnostics — but how much Type I error accumulates before those tests catch the problem? Here we simulate a concrete manipulation scenario and track false-positive rates as gaming becomes more common.
The mechanism. Two independent variables matter: (1) the audit score — operational compliance across energy use, waste, and supply chain — and (2) latent brand eco-reputation Z, an unobserved dimension reflecting genuine brand prestige and loyal eco-customer base. A brand can have strong reputation yet score just below 70: a complex supply chain, an unfamiliar auditor, or a borderline factory can drag the score down through measurement noise. Z does two things: it motivates gaming (high-reputation brands near the threshold believe they deserve the badge and have marketing resources to appeal, re-audit, or make cosmetic improvements) and it independently drives WTP (eco-conscious customers already follow high-reputation brands regardless of the badge). The true badge effect is zero.
▶ Monte Carlo: Type I error and density-test power vs. gaming rate (200 sims)
set.seed(2025)
N_SIM_RDD <- 200
N_RDD_MC <- 500
CUTOFF_MC <- 70
GAME_WINDOW <- 10
BW_RDD <- 10
rdd_mc_one <- function(p_game) {
mat <- replicate(N_SIM_RDD, {
Z_rep <- rnorm(N_RDD_MC, 0, 1)
audit <- pmin(100, pmax(30, rnorm(N_RDD_MC, mean = CUTOFF_MC - 3, sd = 9)))
near_below <- audit >= (CUTOFF_MC - GAME_WINDOW) & audit < CUTOFF_MC
game_mask <- near_below & (Z_rep > 0) & (runif(N_RDD_MC) < p_game)
if (any(game_mask)) {
cross_gap <- CUTOFF_MC - audit[game_mask] + 0.5
audit[game_mask] <- pmin(100, audit[game_mask] + cross_gap + runif(sum(game_mask), 0, 7))
}
WTP_mc <- pmax(1, pmin(10, 5.5 + 1.5 * Z_rep + rnorm(N_RDD_MC, 0, 0.30)))
D <- as.integer(audit >= CUTOFF_MC)
in_bw <- abs(audit - CUTOFF_MC) <= BW_RDD
rdd_p <- tryCatch({
if (sum(in_bw) < 10 || !any(D[in_bw] == 0) || !any(D[in_bw] == 1)) NA_real_
else {
df_bw <- data.frame(y=WTP_mc[in_bw], xc=audit[in_bw]-CUTOFF_MC, D=D[in_bw])
m <- lm(y ~ D * xc, data = df_bw)
2 * pt(-abs(summary(m)$coefficients["D","t value"]), df = m$df.residual)
}
}, error = function(e) NA_real_)
dens_p <- tryCatch({
rd <- rddensity(X = audit, c = CUTOFF_MC)
p <- rd$test$p_jk
if (is.null(p) || !is.finite(p)) 1.0 else as.numeric(p)[1]
}, error = function(e) 1.0)
c(rdd_p, dens_p)
})
data.frame(p_game=p_game, rdd_pval=mat[1,], dens_pval=mat[2,])
}
game_levels <- c(0, 0.10, 0.20, 0.30, 0.50, 0.70)
sim_rdd_mc <- map_dfr(game_levels, rdd_mc_one)
cat(sprintf("RDD simulation complete: %d conditions x %d sims = %d total.\n",
length(game_levels), N_SIM_RDD, nrow(sim_rdd_mc)))RDD simulation complete: 6 conditions x 200 sims = 1200 total.
| Gaming rate | RDD Type I error | Density test power |
|---|---|---|
| 0% | 6% | 5% |
| 10% | 9% | 2% |
| 20% | 12% | 6% |
| 30% | 37% | 13% |
| 50% | 70% | 24% |
| 70% | 96% | 40% |
As gaming becomes common, the density of audit scores just above 70 swells and the density just below depletes. The RDD treats this Z-sorted composition as a real treatment effect — Type I error climbs steeply toward 100%. The density test eventually picks up the bunching, but it has substantially less power than the spurious RDD finding itself. A researcher who checks rddensity, sees a borderline p-value, and proceeds will still produce highly confident false positives.
Why does Z go undetected? The audit score is observable; brand reputation is not. Standard covariate balance checks only test observed pre-treatment variables. If no measure of brand reputation was collected, the sorting is completely invisible — yet it inflates Type I error just as severely.
The link to the measurement discussion above. This is the discriminant validity failure made concrete. A running variable with strong discriminant validity would measure environmental quality and only environmental quality — cleanly separable from brand reputation, firm size, and lobbying capacity. When it fails discriminant validity, those adjacent constructs are embedded in the score itself, and near-threshold units differ not just on the measured dimension but on every construct the score inadvertently absorbs. The “covariate balance” check — the RDD’s primary diagnostic — then tests the wrong thing: it checks balance on the variables you measured, but it cannot detect imbalance on the latent constructs the running variable conflates with its target. An audit score that picks up brand reputation will produce near-threshold imbalance on brand reputation whether or not any active gaming occurred — simply because the score already sorted units by a contaminated composite. Better discriminant validity of the running variable is not just a measurement virtue; it is a precondition for the covariate-balance check to be meaningful at all.
Researcher Checklist: Regression Discontinuity Design
From RDD to Difference-in-Differences
RDD requires a sharp threshold in a single continuous score. When no such threshold exists — but you have panel data spanning a policy change that affected some units and not others — difference-in-differences uses time as the source of identification.
The identifying assumption shifts: instead of “units just above and below the threshold are exchangeable” (RDD), DiD assumes “treated and control units would have followed the same trend absent the treatment” (parallel trends). Both assumptions are forms of local exchangeability — in RDD it is spatial (near the cutoff); in DiD it is temporal (in the pre-treatment period). Both are empirically testable in part, and both can fail in ways that the available tests cannot detect.
Difference-in-Differences
Scott Cunningham’s Causal Inference: The Mixtape has a fantastic, and more thorough, deep dive into difference-in-differences — including staggered adoption and the recent critiques of TWFE — at mixtape.scunning.com/09-difference_in_differences.
The Dutch government introduces a mandatory eco-labelling law in 2022. Belgian supermarkets are not subject to it. We observe mean WTP in 15 Dutch stores (treated) and 20 Belgian stores (control).
Simulating Panel Data
Code
set.seed(2024)
n_dutch <- 15; n_belgian <- 20; n_stores_did <- n_dutch+n_belgian
years <- 2018:2024; T_treat <- 2022
# ── YOUR DATA: replace n_dutch / n_belgian with your counts of treated and
# control units; replace years with your time periods; replace T_treat with
# the first period when treatment took effect.
panel_did <- expand.grid(store=1:n_stores_did, year=years) |>
as_tibble() |>
mutate(
country = ifelse(store<=n_dutch,"Netherlands","Belgium"),
treated = (country=="Netherlands"),
post = (year>=T_treat),
did_indicator= treated & post,
store_fe = rnorm(n_stores_did, 0, 0.5)[store],
time_trend = 0.10*(year-2018),
country_fe = ifelse(country=="Netherlands", 0.30, 0),
treat_effect = did_indicator * 0.70,
WTP_did = 5.20 + store_fe + time_trend + country_fe + treat_effect + rnorm(n(), 0, 0.35)
)
panel_did |>
group_by(country, year) |>
summarise(mean_WTP=mean(WTP_did), .groups="drop") |>
ggplot(aes(x=year, y=mean_WTP, colour=country, group=country)) +
geom_vline(xintercept=T_treat-0.5, linetype="dashed", colour="grey50") +
geom_line(linewidth=1.2) + geom_point(size=2.5) +
annotate("text", x=T_treat+0.1, y=5.1,
label="Dutch eco-label\nlaw takes effect", hjust=0, size=3.2) +
scale_colour_manual(values=c("Netherlands"=clr_eco,"Belgium"=clr_ctrl)) +
labs(x="Year", y="Mean WTP ($)", colour=NULL,
title="DiD: parallel pre-trends are the key identifying assumption",
subtitle="Dutch and Belgian stores track each other in slope before 2022; Dutch stores jump after") +
theme_mod3()DiD hinges on one unanswerable question: what would Dutch stores have done after 2022 if the eco-label law had never passed? We can’t observe this counterfactual. The parallel trends assumption supplies the answer: Belgian stores tell us.
More precisely, the assumption states that absent treatment, the Dutch–Belgian WTP gap would have remained constant — both series drifting up or down by identical amounts each year. Under this assumption:
\[\underbrace{\Delta Y_{\text{Dutch}}}_{\text{observed}} - \underbrace{\Delta Y_{\text{Belgian}}}_{\text{counterfactual proxy}} = \delta_{\text{DiD}}\]
What makes this assumption fail? Anything that makes Dutch and Belgian stores diverge for reasons unrelated to the law — e.g., Dutch shoppers becoming greener faster, or a Dutch-only economic shock. The parallel trends violation section below shows exactly this scenario.
The parallel trends assumption and Module 2: Parallel trends is to DiD what exchangeability is to experimentation. In Module 2, you saw that even carefully randomised experiments can fail exchangeability — attrition, demand effects, and differential compliance all erode it. In DiD, parallel trends is unverifiable for the post-treatment period. Pre-treatment trend tests provide some diagnostic evidence, but just as a successful manipulation check does not prove the exclusion restriction (Module 2), clean pre-trends do not prove that post-treatment trends would have remained parallel.
A Module 1 parallel: Fixed effects eliminate time-invariant confounders — but only if the measurement of the outcome variable is consistent across time and units. If the WTP question wording, scale, or survey protocol changes between periods (a measurement artefact from Module 1), the DiD estimate conflates treatment effects with measurement change.
Two-Way Fixed Effects (TWFE)
\[Y_{it} = \alpha_i + \lambda_t + \delta \cdot D_{it} + \varepsilon_{it}\]
Code
# ── YOUR DATA: replace WTP_did with your outcome, did_indicator with the
# column you created as treated * post, and replace factor(store)/factor(year)
# with your unit and time identifiers.
did_twfe <- lm_robust(WTP_did ~ did_indicator + factor(store) + factor(year),
data=panel_did, clusters=store)
did_coef <- tidy(did_twfe) |> filter(term=="did_indicatorTRUE")
cat(sprintf(
"True DiD effect = 0.70\nEstimated DiD = %.3f (SE = %.3f, p = %.4f)\n95%% CI: [%.3f, %.3f]\n",
did_coef$estimate, did_coef$std.error, did_coef$p.value,
did_coef$conf.low, did_coef$conf.high
))True DiD effect = 0.70
Estimated DiD = 0.631 (SE = 0.088, p = 0.0000)
95% CI: [0.452, 0.811]
| Term | What it controls for | Example here |
|---|---|---|
| \(\alpha_i\) — store fixed effects | Time-invariant differences between stores | Store size, neighbourhood demographics, pre-existing customer base |
| \(\lambda_t\) — year fixed effects | Year-level shocks that hit all stores equally | Euro-area inflation, global supply chain costs, macro consumer sentiment |
| \(\delta\) — the DiD estimator | How much more the treated group changed after treatment vs. the control group | Causal effect of the Dutch eco-label law |
The critical residual \(\varepsilon_{it}\): whatever the fixed effects can’t absorb — differential trends, time-varying store-specific shocks — ends up here. This is exactly where parallel-trends violations hide.
Testing Parallel Trends: Event Study
Code
panel_did <- panel_did |>
mutate(year_rel=year-T_treat,
year_rel_fac=relevel(factor(year_rel), ref="-1"))
event_formula <- as.formula("WTP_did ~ treated:year_rel_fac + factor(store) + factor(year)")
did_event <- lm_robust(event_formula, data=panel_did, clusters=store)
event_tbl <- tidy(did_event) |>
filter(str_detect(term,"year_rel_fac")) |>
mutate(yr=as.integer(str_extract(term,"-?\\d+$")),
period=if_else(yr<0,"Pre-treatment","Post-treatment")) |>
add_row(yr=-1L, estimate=0, conf.low=0, conf.high=0, period="Reference (−1)")
event_tbl |>
ggplot(aes(x=yr, y=estimate, ymin=conf.low, ymax=conf.high, colour=period)) +
geom_hline(yintercept=0, linetype="dashed", colour="grey50") +
geom_vline(xintercept=-0.5, linetype="dotted", colour="grey60") +
geom_pointrange(size=0.7) +
scale_colour_manual(values=c("Pre-treatment"=clr_ctrl,"Post-treatment"=clr_eco,
"Reference (−1)"="grey40")) +
labs(x="Years relative to Dutch eco-label law (0 = 2022)", y="DiD coefficient ($)",
colour=NULL,
title="Event study: pre-treatment coefficients near zero — parallel trends supported",
subtitle="Post-treatment estimates rise consistently — evidence of a sustained law effect") +
theme_mod3()Synthetic Difference-in-Differences
Synthetic DiD — like classical synthetic controls — constructs a weighted combination of untreated units designed to reproduce the treated unit’s pre-treatment trajectory. The pre-period fit can look compelling, but there is no way to verify that the synthetic world it generates is a sufficiently representative stand-in for the real counterfactual post-treatment. Use this approach when you have a credible pool of donor units and a long pre-treatment window — and always report sensitivity analyses varying the donor pool.
Code
sdid_wide <- panel_did |>
dplyr::select(store, year, WTP_did, treated) |>
pivot_wider(names_from=year, values_from=WTP_did) |>
arrange(treated)
N0 <- sum(!sdid_wide$treated); T0 <- sum(years < T_treat)
Y_matrix <- sdid_wide |> dplyr::select(-store,-treated) |> as.matrix()
sdid_est <- synthdid_estimate(Y_matrix, N0, T0)
sc_est <- sc_estimate(Y_matrix, N0, T0)
did_est <- did_estimate(Y_matrix, N0, T0)
tibble(
Method = c("Standard DiD","Synthetic Control","Synthetic DiD"),
Estimate = round(c(as.numeric(did_est),as.numeric(sc_est),as.numeric(sdid_est)), 3),
`True effect`= 0.70,
Bias = round(c(as.numeric(did_est),as.numeric(sc_est),as.numeric(sdid_est))-0.70, 3)
) |> knitr::kable(caption="All three estimators vs. the true effect of 0.70")| Method | Estimate | True effect | Bias |
|---|---|---|---|
| Standard DiD | 0.631 | 0.7 | -0.069 |
| Synthetic Control | 0.618 | 0.7 | -0.082 |
| Synthetic DiD | 0.618 | 0.7 | -0.082 |
Code
plot(sdid_est, se.method="placebo")When Parallel Trends Fails — and Pre-Testing Cannot Always Save You
Parallel trends are not directly testable — you never observe what Dutch stores would have done without the eco-label law. What you can test is whether pre-treatment trajectories were parallel. The event study above is exactly this test. But two practical problems cripple it in most real datasets:
- Too few pre-treatment periods. Most DiD studies in marketing and management observe units for only 2–4 years before treatment. With so few periods, the event study has very low statistical power to detect violations.
- You collect whatever data exists. Pre-treatment archives are often limited — the number of pre-periods is determined by data availability, not statistical power considerations.
The Module 2 power connection: In Module 2, you saw how underpowered studies inflate apparent effect sizes. The same logic applies here in reverse: a pre-trend test with only two pre-periods has low power to detect a violation, so a non-significant result tells you very little. A failure to find pre-trend divergence with few observations is not evidence that trends were parallel — it is evidence that your test had insufficient power.
The event study tests whether observed pre-treatment outcomes moved in parallel. It does not test:
- Whether unmeasured confounders were evolving differently for treated and control units
- Whether the differential dynamic would have continued into the post-period absent treatment
- Whether a violation is simply too small to detect given the available pre-periods
Passing the pre-trend test is necessary but far from sufficient for causal identification.
A Realistic Parallel Trends Violation
Suppose Dutch stores were already attracting more environmentally conscious shoppers whose WTP grew $0.15/year faster than Belgian shoppers — not because of any law, but because of who was already shopping there. The true treatment effect is zero.
▶ Same violation: unmistakeable with 8 pre-periods, invisible with 2
set.seed(2025)
n_dutch_v <- 15; n_belgian_v <- 20; n_stores_v <- n_dutch_v + n_belgian_v
T_treat_v <- 2022; n_post_v <- 3
diff_slope_v <- 0.15
store_fes_v <- rnorm(n_stores_v, 0, 0.6)
make_violation_panel <- function(yrs, true_eff = 0) {
expand.grid(store = 1:n_stores_v, year = yrs) |> as_tibble() |>
mutate(
treated = as.numeric(store <= n_dutch_v),
country = ifelse(treated == 1, "Netherlands", "Belgium"),
post = as.numeric(year >= T_treat_v),
did_indicator = treated * post,
store_fe = store_fes_v[store],
diff_trend = treated * diff_slope_v * (year - T_treat_v),
WTP = 5.20 + store_fe + 0.08 * (year - T_treat_v) +
diff_trend + true_eff * did_indicator + rnorm(n(), 0, 0.40)
)
}
panel_many_v <- make_violation_panel(2014:2024)
panel_few_v <- make_violation_panel(2020:2024)
plot_pt_panel <- function(panel, subtitle) {
panel |> group_by(country, year) |>
summarise(WTP = mean(WTP), .groups = "drop") |>
ggplot(aes(x = year, y = WTP, colour = country, group = country)) +
geom_vline(xintercept = T_treat_v - 0.5, linetype = "dashed",
colour = "grey50", linewidth = 0.9) +
geom_line(linewidth = 1.2) + geom_point(size = 2.5) +
scale_colour_manual(values = c("Netherlands" = clr_eco, "Belgium" = clr_ctrl)) +
scale_x_continuous(breaks = unique(panel$year)) +
labs(x = NULL, y = "Mean WTP ($)", colour = NULL, subtitle = subtitle) +
theme_mod3()
}
p_many_v <- plot_pt_panel(panel_many_v,
"8 pre-periods: the differential trend is unmistakeable — you would never trust DiD here")
p_few_v <- plot_pt_panel(panel_few_v,
"2 pre-periods: SAME DGP — the violation is invisible and DiD proceeds unchallenged")
(p_many_v / p_few_v) +
plot_annotation(
title = "Parallel trends violation: Dutch WTP grows $0.15/yr faster than Belgian (true effect = $0)",
subtitle = "How many pre-periods you happen to collect determines whether you can even see the problem",
theme = theme(plot.title = element_text(size = 13, face = "bold"),
plot.subtitle = element_text(size = 11))
)▶ DiD on 2-pre-period data: statistically significant, entirely spurious
did_viol <- lm_robust(WTP ~ did_indicator + factor(store) + factor(year),
data = panel_few_v, clusters = store)
dv <- tidy(did_viol) |> dplyr::filter(term == "did_indicator")
cat(sprintf(
"True treatment effect = $0.000\nDiD estimate = $%.3f (SE = %.3f, p = %.4f)\n95%% CI: [$%.3f, $%.3f]\n\nThis looks like a meaningful, significant eco-label effect.\nIt is entirely spurious. The violation was undetectable with 2 pre-periods.\n",
dv$estimate, dv$std.error, dv$p.value, dv$conf.low, dv$conf.high
))True treatment effect = $0.000
DiD estimate = $0.376 (SE = 0.109, p = 0.0017)
95% CI: [$0.153, $0.599]
This looks like a meaningful, significant eco-label effect.
It is entirely spurious. The violation was undetectable with 2 pre-periods.
Monte Carlo: Pre-Test Power and Type I Error
▶ Monte Carlo: 300 sims × 6 pre-period counts (true effect = 0)
set.seed(2025)
N_SIM_PT <- 300
pt_sim_one <- function(n_pre) {
years_s <- c(seq(T_treat_v - n_pre, T_treat_v - 1),
T_treat_v:(T_treat_v + n_post_v - 1))
n_yrs <- length(years_s)
mat <- replicate(N_SIM_PT, {
n_obs <- n_stores_v * n_yrs
store_s <- rep(1:n_stores_v, each = n_yrs)
year_s <- rep(years_s, times = n_stores_v)
treated_s <- as.numeric(store_s <= n_dutch_v)
post_s <- as.numeric(year_s >= T_treat_v)
did_s <- treated_s * post_s
diff_s <- treated_s * diff_slope_v * (year_s - T_treat_v)
fe_s <- rnorm(n_stores_v, 0, 0.6)[store_s]
WTP_s <- 5.20 + fe_s + 0.08 * (year_s - T_treat_v) + diff_s + rnorm(n_obs, 0, 0.40)
df_s <- data.frame(WTP=WTP_s, did=did_s, store=factor(store_s),
year=factor(year_s), treated=treated_s, post=post_s, year_num=year_s)
fit_d <- lm(WTP ~ did + store + year, data = df_s)
pval_d <- coeftest(fit_d, vcovCL(fit_d, cluster = ~store))["did", 4]
if (n_pre >= 2) {
pre_s <- df_s[df_s$post == 0, ]
pre_s$yr_c <- pre_s$year_num - mean(pre_s$year_num)
fit_p <- lm(WTP ~ treated * yr_c + store, data = pre_s)
pval_p <- tryCatch(
coeftest(fit_p, vcovCL(fit_p, cluster = ~store))["treated:yr_c", 4],
error = function(e) 1.0)
} else {
pval_p <- 1.0
}
c(pval_d, pval_p)
})
data.frame(n_pre=n_pre, did_pval=mat[1,], pre_pval=mat[2,])
}
sim_all_pt <- map_dfr(1:6, pt_sim_one)
cat(sprintf("Simulation complete: %d total runs across 6 pre-period conditions.\n", nrow(sim_all_pt)))Simulation complete: 1800 total runs across 6 pre-period conditions.
| Pre-periods | Pre-test power | Sims passing | Overall Type I | Conditional Type I |
|---|---|---|---|---|
| 1 | 0% | 300 / 300 | 34% | 34% |
| 2 | 4% | 288 / 300 | 78% | 77% |
| 3 | 23% | 230 / 300 | 97% | 98% |
| 4 | 59% | 122 / 300 | 100% | 100% |
| 5 | 92% | 24 / 300 | 100% | 100% |
| 6 | 98% | 5 / 300 | 100% | 100% |
- With 1 pre-period: the test is impossible; false-positive rates can exceed 40–70%.
- With 2–3 pre-periods: most violations slip through; conditional Type I error remains far above 5%.
- With 5–6 pre-periods: power improves substantially, but violations that slip through still produce inflated Type I error.
What to do: Report the number of pre-treatment periods and be honest about the test’s power. Run placebo DiDs. Use Rambachan & Roth’s (2023) HonestDiD sensitivity analysis, which reframes the question from “is there a violation?” to “how large a violation would matter?” Treat a passing event study as weak evidence, not proof.
How an Unobserved Variable Inflates Type I Error
A more dangerous violation involves an unobserved time-varying variable that only appears post-treatment, making pre-trend tests completely blind to it.
▶ Monte Carlo: Type I error vs. omitted variable strength γ (300 sims, 2 pre-periods)
set.seed(2026)
N_SIM_OV <- 300
gamma_grid <- seq(0, 0.80, by = 0.10)
pt_ov_one <- function(gamma) {
years_ov <- c(2020, 2021, 2022, 2023, 2024)
T_ov <- 2022
mat <- replicate(N_SIM_OV, {
U_i <- c(rnorm(n_dutch_v, +0.5, 1), rnorm(n_belgian_v, -0.5, 1))
fe <- rnorm(n_stores_v, 0, 0.5)
panel <- expand.grid(store = seq_len(n_stores_v), year = years_ov) |>
as_tibble() |>
mutate(
treated = as.integer(store <= n_dutch_v),
post = as.integer(year >= T_ov),
did = treated * post,
U = U_i[store],
WTP = 5.20 + fe[store] + 0.08 * (year - T_ov) +
gamma * U * post + rnorm(n(), 0, 0.40)
)
fit_d <- lm(WTP ~ did + factor(store) + factor(year), data = panel)
pval_d <- tryCatch(
coeftest(fit_d, vcovCL(fit_d, cluster = ~store))["did", 4],
error = function(e) NA_real_)
pre <- panel[panel$post == 0, ]
pre$yr_c <- pre$year - mean(pre$year)
fit_p <- lm(WTP ~ treated * yr_c + factor(store), data = pre)
pval_p <- tryCatch(
coeftest(fit_p, vcovCL(fit_p, cluster = ~store))["treated:yr_c", 4],
error = function(e) 1.0)
c(pval_d, pval_p)
})
data.frame(gamma=gamma, did_pval=mat[1,], pre_pval=mat[2,])
}
sim_ov <- map_dfr(gamma_grid, pt_ov_one)
cat(sprintf("Omitted variable simulation: %d conditions × %d sims = %d total.\n",
length(gamma_grid), N_SIM_OV, nrow(sim_ov)))Omitted variable simulation: 9 conditions × 300 sims = 2700 total.
DiD Type I error rises steeply as γ increases — reaching near-certainty of a false positive when γ = 0.70. But the pre-trend test power stays flat at the nominal 5%. Because U is a fixed store characteristic captured by store FEs, its influence is absorbed away in pre-treatment periods. Only when U × post materialises after the treatment date does the confounder do damage — and by then, the event study has nothing left to test.
The Parallel Trends Thought Experiment
The formal diagnostics above — event studies, omitted-variable simulations, HonestDiD sensitivity — are essential tools. But before reaching for any of them, it is worth stepping back and asking the foundational question with brutal honesty: how plausible is it that parallel trends holds at all?
Here is a thought experiment that makes the implausibility concrete.
The asteroid analogy. Think of the outcome variable — consumer WTP — as an asteroid moving through space. Time is the x-axis: the asteroid moves forward through time, and its trajectory encodes the trend in WTP. Now think of every factor that shapes WTP — consumer income growth, eco-consciousness trends, media coverage of sustainability, competitive dynamics, supply chain costs, regulatory climate, brand lifecycle — as additional dimensions. The asteroid is moving through a \(k\)-dimensional space, where \(k\) equals the number of forces acting on WTP.
Now a collision happens. The Dutch government passes the eco-label law. This collision affects Dutch stores but not Belgian stores. The collision splits our single asteroid into two pieces that separate along one dimension — the treated group (Dutch stores) receives the treatment shock while the control group (Belgian stores) does not.
The parallel trends assumption asks: after the collision, would the two pieces have continued moving on exactly parallel trajectories through this \(k\)-dimensional space?
In three-dimensional physics — where \(k = 3\) — this would almost never happen. A collision that separates two objects along one dimension will almost certainly impart different forces along the other dimensions as well. The pieces move apart not just in the y-direction but in x and z too. The probability that two colliding objects continue on perfectly parallel paths in three-dimensional space is essentially zero, because that would require the off-axis force components to be exactly equal and opposite.
Now move to \(k\) dimensions. Every additional dimension multiplies the implausibility. Parallel movement requires equal rates of change along all \(k - 1\) non-treatment dimensions simultaneously. If the probability of parallelism along each dimension is \(p < 1\), the joint probability of parallelism across all \(k - 1\) dimensions is \(p^{k-1} \to 0\) as \(k\) grows. The assumption becomes exponentially more implausible as the outcome variable is richer and more multi-determined.
Think about what DiD is actually claiming: that the only thing that made Dutch and Belgian stores diverge after 2022 was the eco-label law — and that, absent the law, every other force acting on WTP (income shocks, consumer attitude trends, competitive retailer behaviour, media effects, seasonal patterns, product mix evolution) would have moved Dutch and Belgian stores at exactly the same rate.
This is not a modest claim. It requires that every dimension of the data-generating process for WTP was either (a) equally shared by Dutch and Belgian stores, or (b) perfectly cancelled by other forces. In reality, the Dutch law was passed because Dutch consumers were already moving in a particular direction. The very political economy that produced the treatment is almost certainly correlated with the trajectory of WTP that would have occurred without the treatment.
The connection to Module 2, Part 2. In Module 2, Part 2, you saw that even in a carefully controlled laboratory experiment, treatments almost never input only a single latent construct. A persuasive message designed to target environmental concern will simultaneously activate social norms, identity, guilt, and product-quality inferences. A pricing manipulation targeting perceived value also inputs feelings of financial constraint and quality signalling. Module 2, Part 2 demonstrated this through the exclusion restriction: a treatment variable is like an instrument — it is valid only if it operates through a single pathway, and that is extremely difficult to guarantee even under tight experimental control.
Now consider what a natural policy shock does. The Dutch eco-label law was not designed by researchers to isolate a single causal pathway. It was passed by a legislature responding to voter sentiment, industry lobbying, coalition negotiations, and macroeconomic conditions. That single event simultaneously changed the information environment for eco-conscious consumers, the competitive dynamics among Dutch retailers, the media and advertising landscape around sustainability, and the social-norm signal about what kind of purchasing is expected. Each of these changes operates on WTP through a different causal pathway. The probability that a policy shock in a complex, endogenous social system inputs only one dimension of the outcome is, practically speaking, zero.
This is the deepest reason to distrust parallel trends: it is not just that treated and control groups might have had different trends — it is that the treatment itself almost certainly influenced multiple dimensions of WTP simultaneously, and each of those dimensions evolves differently in Dutch and Belgian stores. Parallel trends requires that all of these multi-dimensional influences happened to cancel out perfectly in the counterfactual, leaving only the badge-receipt pathway as the “true” effect. That is an extraordinarily demanding assumption.
A secondary connection to Module 2, Part 3. The exchangeability argument from Module 2, Part 3 reinforces the point from a different angle: even deliberate random assignment does not guarantee exchangeability as the number of dimensions of the outcome construct grows. DiD is in a harder position still — the “assignment mechanism” (which government passed a law, and when) was driven by the same forces that shape WTP trajectories, not by a randomisation procedure. The dimensions of WTP and the determinants of policy timing are not independent. An unrandom shuffle with a large deck is not going to produce anything close to exchangeability.
What this means for practice. This is not an argument that DiD is never useful — it is an argument for intellectual honesty about the assumption that underlies it.
- Name the dimensions explicitly. For your specific context, list the forces that could plausibly drive your outcome differently in treated and control groups over time. Each one is a dimension along which the parallel trends assumption could fail.
- Ask: why was this policy passed here and not there? The answer will almost always reveal correlates with the outcome trajectory. Those correlates are the potential violations.
- Treat parallel trends as a maintained assumption, not a verified fact. No test can establish it; it can only be falsified. Reporting a non-significant event study with two pre-periods as evidence of parallel trends is like reporting one shuffle of a 52-card deck as evidence of randomness — the test had no power to detect the problem.
- Use DiD for what it is: a productive disciplining device, not a causal proof machine. The DiD framework forces you to be explicit about what the counterfactual is, who the control group is, and what parallel trends means for your context. That discipline has value even when the assumption is imperfect. But the conclusions should be presented as conditional on the assumption — not as established causal facts.