18  Part 4: Difference-in-Differences

TipHow to use the code in this section

The code blocks throughout this section contain inline annotations explaining what data structures go in, what the key function arguments control, and what to look for in the output. Click any “▶ Code” triangle to expand a block and read the annotations alongside the code. Treat them as a guided checklist: wherever you see a comment starting with # ──, it marks a decision point you would need to revisit with your own data.

18.1 The Setup

The Dutch government introduces a mandatory eco-labelling law in 2022. Belgian supermarkets are not subject to it. We observe mean WTP in 15 Dutch stores (treated) and 20 Belgian stores (control).

18.2 Simulating Panel Data

Code
set.seed(2024)
n_dutch <- 15; n_belgian <- 20; n_stores_did <- n_dutch+n_belgian
years <- 2018:2024; T_treat <- 2022
# ── YOUR DATA: replace n_dutch / n_belgian with your counts of treated and
#    control units; replace years with your time periods; replace T_treat with
#    the first period when treatment took effect. Your data frame needs at minimum:
#    a unit ID, a time period, a 0/1 treated column, a 0/1 post column, and your
#    outcome variable.

panel_did <- expand.grid(store=1:n_stores_did, year=years) |>
  as_tibble() |>
  mutate(
    country      = ifelse(store<=n_dutch,"Netherlands","Belgium"),
    treated      = (country=="Netherlands"),
    post         = (year>=T_treat),
    did_indicator= treated & post,
    store_fe     = rnorm(n_stores_did, 0, 0.5)[store],
    time_trend   = 0.10*(year-2018),
    country_fe   = ifelse(country=="Netherlands", 0.30, 0),
    treat_effect = did_indicator * 0.70,
    WTP_did      = 5.20 + store_fe + time_trend + country_fe + treat_effect + rnorm(n(), 0, 0.35)
  )
# ── CHECK: plot group-mean outcomes by year (as below) — the two groups should
#    run near-parallel BEFORE T_treat; a visible gap opening only after T_treat
#    is consistent with parallel trends. Any pre-treatment divergence is a red flag.

panel_did |>
  group_by(country, year) |>
  summarise(mean_WTP=mean(WTP_did), .groups="drop") |>
  ggplot(aes(x=year, y=mean_WTP, colour=country, group=country)) +
  geom_vline(xintercept=T_treat-0.5, linetype="dashed", colour="grey50") +
  geom_line(linewidth=1.2) + geom_point(size=2.5) +
  annotate("text", x=T_treat+0.1, y=5.1,
           label="Dutch eco-label\nlaw takes effect", hjust=0, size=3.2) +
  scale_colour_manual(values=c("Netherlands"=clr_eco,"Belgium"=clr_ctrl)) +
  labs(x="Year", y="Mean WTP ($)", colour=NULL,
       title="DiD: parallel pre-trends are the key identifying assumption",
       subtitle="Dutch and Belgian stores track each other in slope before 2022; Dutch stores jump after") +
  theme_mod3()

NoteWhat “parallel trends” actually means — and why Belgium is the key

DiD hinges on one unanswerable question: what would Dutch stores have done after 2022 if the eco-label law had never passed? We can’t observe this counterfactual. The parallel trends assumption supplies the answer: Belgian stores tell us.

More precisely, the assumption states that absent treatment, the Dutch–Belgian WTP gap would have remained constant — both series drifting up or down by identical amounts each year. Under this assumption:

\[\underbrace{\Delta Y_{\text{Dutch}}}_{\text{observed}} - \underbrace{\Delta Y_{\text{Belgian}}}_{\text{counterfactual proxy}} = \delta_{\text{DiD}}\]

Reading the plot above: The two lines run nearly parallel from 2018–2021 (same slope, constant gap). After 2022 the Dutch line separates from the Belgian. DiD attributes that separation — roughly $0.70 — to the eco-label law, after subtracting whatever movement Belgium showed over the same period (which would have happened anyway).

What makes this assumption fail? Anything that makes Dutch and Belgian stores diverge for reasons unrelated to the law — e.g., Dutch shoppers becoming greener faster, or a Dutch-only economic shock. The parallel trends violation section below shows exactly this scenario.

The parallel trends assumption and Module 2: Parallel trends is to DiD what exchangeability is to experimentation. In Module 2, you saw that even carefully randomised experiments can fail exchangeability — attrition, demand effects, and differential compliance all erode it. In DiD, parallel trends is unverifiable for the post-treatment period (we never observe what Dutch stores would have done without the law). Pre-treatment trend tests provide some diagnostic evidence, but just as a successful manipulation check does not prove the exclusion restriction (Module 2), clean pre-trends do not prove that post-treatment trends would have remained parallel. It is an assumption, not a result.

A Module 1 parallel: Fixed effects eliminate time-invariant confounders — but only if the measurement of the outcome variable is consistent across time and units. If the WTP question wording, scale, or survey protocol changes between periods (a measurement artefact from Module 1), the DiD estimate conflates treatment effects with measurement change. In practice, always verify that outcome measurement is held constant across waves.

18.3 Two-Way Fixed Effects (TWFE)

\[Y_{it} = \alpha_i + \lambda_t + \delta \cdot D_{it} + \varepsilon_{it}\]

Code
# ── YOUR DATA: replace WTP_did with your outcome, did_indicator with the
#    column you created as treated * post, and replace factor(store)/factor(year)
#    with your unit and time identifiers. The formula adds one dummy per unit
#    and one dummy per time period — this is the two-way fixed effects (TWFE) model.
# ── KEY ARGS: clusters=store clusters standard errors at the unit level —
#    change "store" to whatever your unit identifier column is called.
#    If your units are nested (e.g., stores within chains), cluster at the
#    higher level (chains) to account for within-cluster correlation.
did_twfe <- lm_robust(WTP_did ~ did_indicator + factor(store) + factor(year),
                      data=panel_did, clusters=store)
did_coef <- tidy(did_twfe) |> filter(term=="did_indicatorTRUE")
# ── CHECK: the term for did_indicator should be "did_indicatorTRUE" if the
#    column is logical; it will be "did_indicator" if numeric (0/1).
#    The p-value and CI are cluster-robust — rely on these, not the naive SEs.
cat(sprintf(
  "True DiD effect  = 0.70\nEstimated DiD    = %.3f  (SE = %.3f, p = %.4f)\n95%% CI: [%.3f, %.3f]\n",
  did_coef$estimate, did_coef$std.error, did_coef$p.value,
  did_coef$conf.low, did_coef$conf.high
))
True DiD effect  = 0.70
Estimated DiD    = 0.631  (SE = 0.088, p = 0.0000)
95% CI: [0.452, 0.811]
NoteWhat the two sets of fixed effects absorb

\[Y_{it} = \underbrace{\alpha_i}_{\text{store FE}} + \underbrace{\lambda_t}_{\text{year FE}} + \delta \cdot D_{it} + \varepsilon_{it}\]

Each type of fixed effect removes a specific source of confounding:

Term What it controls for Example here
\(\alpha_i\) — store fixed effects Time-invariant differences between stores Store size, neighbourhood demographics, pre-existing customer base
\(\lambda_t\) — year fixed effects Year-level shocks that hit all stores equally Euro-area inflation, global supply chain costs, macro consumer sentiment
\(\delta\) — the DiD estimator How much more the treated group changed after treatment vs. the control group Causal effect of the Dutch eco-label law

The critical residual \(\varepsilon_{it}\): whatever the fixed effects can’t absorb — differential trends (treated and control groups moving at different rates), time-varying store-specific shocks — ends up here. This is exactly where parallel-trends violations hide. Adding pre-treatment event-study coefficients is a way to interrogate whether \(\varepsilon_{it}\) contains systematic pre-treatment divergence.

18.4 Testing Parallel Trends: Event Study

Code
# ── YOUR DATA: year_rel = your_time_var - T_treat (e.g., calendar year minus
#    first treatment year). The reference period is set to -1 (the year before
#    treatment); change ref="-1" if your calendar doesn't align with this convention.
panel_did <- panel_did |>
  mutate(year_rel=year-T_treat,
         year_rel_fac=relevel(factor(year_rel), ref="-1"))

# ── KEY ARGS: the formula treated:year_rel_fac interacts the treated indicator
#    with each relative-year factor — this produces one coefficient per period
#    relative to treatment. factor(store) and factor(year) absorb unit and time
#    fixed effects. clusters=store clusters SEs at the unit level.
event_formula <- as.formula("WTP_did ~ treated:year_rel_fac + factor(store) + factor(year)")
did_event     <- lm_robust(event_formula, data=panel_did, clusters=store)
# ── CHECK: in the event study plot, pre-treatment coefficients (yr < 0) should
#    be near zero with CIs crossing zero — this is evidence for parallel trends.
#    Any significant pre-trend (p < .05 before treatment) is a red flag that
#    the treatment and control groups were already diverging before the policy.

event_tbl <- tidy(did_event) |>
  filter(str_detect(term,"year_rel_fac")) |>
  mutate(yr=as.integer(str_extract(term,"-?\\d+$")),
         period=if_else(yr<0,"Pre-treatment","Post-treatment")) |>
  add_row(yr=-1L, estimate=0, conf.low=0, conf.high=0, period="Reference (−1)")

event_tbl |>
  ggplot(aes(x=yr, y=estimate, ymin=conf.low, ymax=conf.high, colour=period)) +
  geom_hline(yintercept=0, linetype="dashed", colour="grey50") +
  geom_vline(xintercept=-0.5, linetype="dotted", colour="grey60") +
  geom_pointrange(size=0.7) +
  scale_colour_manual(values=c("Pre-treatment"=clr_ctrl,"Post-treatment"=clr_eco,
                               "Reference (−1)"="grey40")) +
  labs(x="Years relative to Dutch eco-label law (0 = 2022)", y="DiD coefficient ($)",
       colour=NULL,
       title="Event study: pre-treatment coefficients near zero — parallel trends supported",
       subtitle="Post-treatment estimates rise consistently — evidence of a sustained law effect") +
  theme_mod3()

NoteHow to read an event study plot

Each point is the DiD coefficient for that relative year — how much the Dutch–Belgian gap in that year differed from the gap in the reference year (here: year \(-1\), the year just before treatment).

Pre-treatment years (left of the dotted line): If parallel trends holds, these should be near zero with overlapping CIs. A systematic upward or downward trend here is evidence of a pre-existing differential trend — the most direct threat to identification.

Post-treatment years (right of the dotted line): These estimate the treatment effect over time. A sustained positive trajectory (as here) suggests a genuine, growing effect of the law rather than a one-off jump.

What to worry about: (1) Any pre-period estimate far from zero and statistically significant; (2) CIs so wide they can’t rule out large violations — this happens when you have very few pre-periods (see the next section).

18.5 Synthetic Difference-in-Differences

WarningThe synthetic counterfactual is unverifiable

Synthetic DiD — like classical synthetic controls — constructs a weighted combination of untreated units designed to reproduce the treated unit’s pre-treatment trajectory. The pre-period fit can look compelling, but there is no way to verify that the synthetic world it generates is a sufficiently representative stand-in for the real counterfactual post-treatment. The method assumes the same weighting that matched pre-treatment trends would have continued to match post-treatment trends in the absence of the law. Unobserved shocks, structural breaks, or compositional changes that affect treated and synthetic units differently after treatment will bias the estimate without any diagnostic flag. Use this approach when you have a credible pool of donor units and a long pre-treatment window — and always report sensitivity analyses varying the donor pool.

Code
# ── YOUR DATA: synthdid_estimate() requires a (units × time) outcome matrix
#    arranged so that control units come FIRST (rows 1:N0) and pre-treatment
#    periods come FIRST (columns 1:T0). pivot_wider() here creates that layout;
#    replicate with your own unit ID, time, and outcome columns.
# ── KEY ARGS: N0 = number of control units (rows before treated units);
#    T0 = number of pre-treatment time periods (columns before treatment onset).
#    These must exactly match the row/column layout of Y_matrix.
sdid_wide <- panel_did |>
  dplyr::select(store, year, WTP_did, treated) |>
  pivot_wider(names_from=year, values_from=WTP_did) |>
  arrange(treated)

N0 <- sum(!sdid_wide$treated); T0 <- sum(years < T_treat)
Y_matrix <- sdid_wide |> dplyr::select(-store,-treated) |> as.matrix()

sdid_est <- synthdid_estimate(Y_matrix, N0, T0)
sc_est   <- sc_estimate(Y_matrix, N0, T0)
did_est  <- did_estimate(Y_matrix, N0, T0)

tibble(
  Method       = c("Standard DiD","Synthetic Control","Synthetic DiD"),
  Estimate     = round(c(as.numeric(did_est),as.numeric(sc_est),as.numeric(sdid_est)), 3),
  `True effect`= 0.70,
  Bias         = round(c(as.numeric(did_est),as.numeric(sc_est),as.numeric(sdid_est))-0.70, 3)
) |> knitr::kable(caption="All three estimators vs. the true effect of 0.70")
All three estimators vs. the true effect of 0.70
Method Estimate True effect Bias
Standard DiD 0.631 0.7 -0.069
Synthetic Control 0.618 0.7 -0.082
Synthetic DiD 0.618 0.7 -0.082
Code
# ── CHECK: compare the three estimates — large divergence between standard DiD
#    and Synthetic DiD suggests the parallel-trends assumption is strained and
#    the synthetic approach is finding a materially different counterfactual.
#    Use plot(sdid_est) to visualise the synthetic control trajectory.
Code
plot(sdid_est, se.method="placebo")

18.6 When Parallel Trends Fails — and Pre-Testing Cannot Always Save You

Parallel trends are not directly testable — you never observe what Dutch stores would have done without the eco-label law. What you can test is whether pre-treatment trajectories were parallel. The event study above is exactly this test. But two practical problems cripple it in most real datasets:

  1. Too few pre-treatment periods. Most DiD studies in marketing and management observe units for only 2–4 years before treatment. With so few periods, the event study has very low statistical power to detect violations.
  2. You collect whatever data exists. Pre-treatment archives are often limited — the number of pre-periods is determined by data availability, not statistical power considerations.

The Module 2 power connection: In Module 2, you saw how underpowered studies inflate apparent effect sizes — only the most extreme estimates cross the significance threshold, leading to an inflated published literature. The same logic applies here in reverse: a pre-trend test with only two pre-periods has low power to detect a violation, so a non-significant result tells you very little. A failure to find pre-trend divergence with few observations is not evidence that trends were parallel — it is evidence that your test had insufficient power to catch the divergence that may have been there. This is one reason Callaway–Sant’Anna and other modern DiD estimators require researchers to be explicit about the number of clean pre-periods available.

ImportantWhat the pre-trend test does — and does not — establish

The event study tests whether observed pre-treatment outcomes moved in parallel. It does not test:

  • Whether unmeasured confounders were evolving differently for treated and control units
  • Whether the differential dynamic would have continued into the post-period absent treatment
  • Whether a violation is simply too small to detect given the available pre-periods

Passing the pre-trend test is necessary but far from sufficient for causal identification.

18.6.1 A Realistic Parallel Trends Violation

Suppose Dutch stores were already attracting more environmentally conscious shoppers whose WTP grew $0.15/year faster than Belgian shoppers — not because of any law, but because of who was already shopping there (selection). The true treatment effect is zero. The eco-label law is irrelevant. But this ongoing differential growth will trick DiD into finding one.

▶ Same violation: unmistakeable with 8 pre-periods, invisible with 2
set.seed(2025)
n_dutch_v  <- 15; n_belgian_v <- 20; n_stores_v <- n_dutch_v + n_belgian_v
T_treat_v  <- 2022; n_post_v <- 3
diff_slope_v <- 0.15   # Dutch stores grow $0.15/yr faster throughout (selection, not treatment)
store_fes_v  <- rnorm(n_stores_v, 0, 0.6)

make_violation_panel <- function(yrs, true_eff = 0) {
  expand.grid(store = 1:n_stores_v, year = yrs) |> as_tibble() |>
    mutate(
      treated       = as.numeric(store <= n_dutch_v),
      country       = ifelse(treated == 1, "Netherlands", "Belgium"),
      post          = as.numeric(year >= T_treat_v),
      did_indicator = treated * post,
      store_fe      = store_fes_v[store],
      diff_trend    = treated * diff_slope_v * (year - T_treat_v),
      WTP           = 5.20 + store_fe + 0.08 * (year - T_treat_v) +
                      diff_trend + true_eff * did_indicator + rnorm(n(), 0, 0.40)
    )
}

panel_many_v <- make_violation_panel(2014:2024)   # 8 pre-periods
panel_few_v  <- make_violation_panel(2020:2024)   # 2 pre-periods

plot_pt_panel <- function(panel, subtitle) {
  panel |> group_by(country, year) |>
    summarise(WTP = mean(WTP), .groups = "drop") |>
    ggplot(aes(x = year, y = WTP, colour = country, group = country)) +
    geom_vline(xintercept = T_treat_v - 0.5, linetype = "dashed",
               colour = "grey50", linewidth = 0.9) +
    geom_line(linewidth = 1.2) + geom_point(size = 2.5) +
    scale_colour_manual(values = c("Netherlands" = clr_eco, "Belgium" = clr_ctrl)) +
    scale_x_continuous(breaks = unique(panel$year)) +
    labs(x = NULL, y = "Mean WTP ($)", colour = NULL, subtitle = subtitle) +
    theme_mod3()
}

p_many_v <- plot_pt_panel(panel_many_v,
  "8 pre-periods: the differential trend is unmistakeable — you would never trust DiD here")
p_few_v  <- plot_pt_panel(panel_few_v,
  "2 pre-periods: SAME DGP — the violation is invisible and DiD proceeds unchallenged")

(p_many_v / p_few_v) +
  plot_annotation(
    title    = "Parallel trends violation: Dutch WTP grows $0.15/yr faster than Belgian (true effect = $0)",
    subtitle = "How many pre-periods you happen to collect determines whether you can even see the problem",
    theme    = theme(plot.title    = element_text(size = 13, face = "bold"),
                     plot.subtitle = element_text(size = 11))
  )

With 8 pre-periods, the violation is glaringly obvious — you would never run DiD here. With 2 pre-periods, the same violation is invisible. Yet the DiD estimate is equally spurious in both cases:

▶ DiD on 2-pre-period data: statistically significant, entirely spurious
did_viol <- lm_robust(WTP ~ did_indicator + factor(store) + factor(year),
                      data = panel_few_v, clusters = store)
dv <- tidy(did_viol) |> dplyr::filter(term == "did_indicator")
cat(sprintf(
  "True treatment effect  = $0.000\n
DiD estimate           = $%.3f  (SE = %.3f,  p = %.4f)\n95%% CI: [$%.3f, $%.3f]\n\nThis looks like a meaningful, significant eco-label effect.\nIt is entirely spurious. The violation was undetectable with 2 pre-periods.\n",
  dv$estimate, dv$std.error, dv$p.value, dv$conf.low, dv$conf.high
))
True treatment effect  = $0.000

DiD estimate           = $0.376  (SE = 0.109,  p = 0.0017)
95% CI: [$0.153, $0.599]

This looks like a meaningful, significant eco-label effect.
It is entirely spurious. The violation was undetectable with 2 pre-periods.

18.6.2 Monte Carlo: Pre-Test Power and Type I Error

We now formalise this with 300 simulations per condition. True effect = $0 throughout. We vary the number of pre-treatment periods from 1 to 6 and track three quantities:

Quantity What it means
Pre-test power How often does the pre-trend test detect the violation? (p < .05)
Overall Type I error How often does DiD falsely reject H₀?
Conditional Type I error Given the pre-test passes (p ≥ .05), how often does DiD still falsely reject?

The conditional rate is what matters most: it is the false-positive rate faced by researchers who did their due diligence with an event study and got clean-looking pre-trends.

▶ Monte Carlo: 300 sims × 6 pre-period counts (true effect = 0)
set.seed(2025)
N_SIM_PT <- 300

pt_sim_one <- function(n_pre) {
  years_s <- c(seq(T_treat_v - n_pre, T_treat_v - 1),
               T_treat_v:(T_treat_v + n_post_v - 1))
  n_yrs <- length(years_s)

  mat <- replicate(N_SIM_PT, {
    n_obs     <- n_stores_v * n_yrs
    store_s   <- rep(1:n_stores_v, each = n_yrs)
    year_s    <- rep(years_s, times = n_stores_v)
    treated_s <- as.numeric(store_s <= n_dutch_v)
    post_s    <- as.numeric(year_s >= T_treat_v)
    did_s     <- treated_s * post_s
    diff_s    <- treated_s * diff_slope_v * (year_s - T_treat_v)
    fe_s      <- rnorm(n_stores_v, 0, 0.6)[store_s]
    WTP_s     <- 5.20 + fe_s + 0.08 * (year_s - T_treat_v) + diff_s + rnorm(n_obs, 0, 0.40)

    df_s <- data.frame(
      WTP     = WTP_s,
      did     = did_s,
      store   = factor(store_s),
      year    = factor(year_s),
      treated = treated_s,
      post    = post_s,
      year_num= year_s
    )

    # DiD p-value with clustered SEs
    fit_d  <- lm(WTP ~ did + store + year, data = df_s)
    pval_d <- coeftest(fit_d, vcovCL(fit_d, cluster = ~store))["did", 4]

    # Pre-trend test: treated x linear year interaction in pre-period only
    if (n_pre >= 2) {
      pre_s      <- df_s[df_s$post == 0, ]
      pre_s$yr_c <- pre_s$year_num - mean(pre_s$year_num)
      fit_p  <- lm(WTP ~ treated * yr_c + store, data = pre_s)
      ct_p   <- coeftest(fit_p, vcovCL(fit_p, cluster = ~store))
      pval_p <- tryCatch(ct_p["treated:yr_c", 4], error = function(e) 1.0)
    } else {
      pval_p <- 1.0   # Cannot test with only 1 pre-period — always "passes"
    }

    c(pval_d, pval_p)
  })  # 2 × N_SIM_PT matrix

  data.frame(n_pre    = n_pre,
             did_pval = mat[1, ],
             pre_pval = mat[2, ])
}

sim_all_pt <- map_dfr(1:6, pt_sim_one)
cat(sprintf("Simulation complete: %d total runs across 6 pre-period conditions.\n",
            nrow(sim_all_pt)))
Simulation complete: 1800 total runs across 6 pre-period conditions.

Monte Carlo summary (300 sims per condition, true effect = $0, differential trend = $0.15/yr). ‘Conditional’ = false positive rate among studies where the pre-trend test passed.
Pre-periods Pre-test power Sims passing Overall Type I Conditional Type I
1 0% 300 / 300 34% 34%
2 4% 288 / 300 78% 77%
3 23% 230 / 300 97% 98%
4 59% 122 / 300 100% 100%
5 92% 24 / 300 100% 100%
6 98% 5 / 300 100% 100%
WarningWhat this means for practice

The key pattern:

  • With 1 pre-period: The test is impossible. The false-positive rate can be 40–70% — a researcher following standard practice would be nearly flipping a coin.
  • With 2–3 pre-periods: Pre-test power is low. Most violations slip through. Conditional Type I error is still far above the nominal 5%.
  • With 5–6 pre-periods: Power improves substantially. But violations that slip through the test still produce inflated Type I error.

The fundamental problem is not the test — it is the data. Researchers often have no choice but to use available archives, which may span only 1–3 pre-treatment periods. This is structurally similar to mediation: just as researchers rarely measure all possible confounders of the M→Y path, they rarely collect enough pre-treatment history to reliably test parallel trends.

What to do:

  • Report the number of pre-treatment periods and be honest about the test’s power
  • Run placebo DiDs — apply the treatment label to earlier years or non-treated units and check that “effects” are near zero
  • Use Rambachan & Roth’s (2023) HonestDiD sensitivity analysis, which asks: how large a pre-trend violation would be required to explain away your result? This reframes the question from “is there a violation?” (often unknowable) to “how large a violation would matter?” (estimable and reportable)
  • Treat a passing event study as weak evidence, not proof, of parallel trends