Part 2: Re-establishing Exchangeability

The thread from Module 2: exchangeability, now without randomisation

Every method in this section is trying to achieve the same thing as a randomised experiment — but without actually randomising. In Module 2 you saw that randomisation achieves exchangeability: treated and control groups are, in expectation, identical on every measured and unmeasured covariate. The methods here — regression adjustment, IPW, matching, doubly-robust estimation — all attempt to restore exchangeability after the fact, using only the measured covariates. Module 2 showed that even inside a well-designed experiment, exchangeability is more fragile than it looks (demand effects, attrition, compliance failures). In observational designs, the gap between “we measured a lot of covariates” and “we have achieved exchangeability” is far larger and entirely unverifiable. Every assumption below is a bet that the measured set is complete.

Module 1 raised the same issue from a different angle: if covariates are measured with error — if environmental concern is a noisy proxy for true eco-mindedness — then even conditioning on the observed score leaves residual confounding from the measurement error. Poor construct validity (Module 1) directly undermines the conditional independence assumption (Module 3). Getting the covariates right, in both selection and measurement quality, is a prerequisite for all the methods that follow.

What this chapter covers — and the type of selection each method addresses

This chapter covers a family of methods for drawing causal inferences from observational data. They are organized by the type of selection problem they are designed for:

Method	Selection type	Key assumption
Heckman selection model	Selection on unobservables	Valid instrument for participation
Tobit / censored regression	Censoring or corner solutions	Latent normal DGP, no genuine zeros
Regression adjustment	Selection on observables	Conditional ignorability (no hidden confounders)
IPW / matching / doubly robust	Selection on observables	Conditional ignorability + overlap
Synthetic control	Aggregate panel data	Pre-treatment parallel-trend fit

“Selection on observables” strictly refers to the regression adjustment, IPW, matching, and doubly robust rows — these require only that conditioning on measured covariates achieves exchangeability. The Heckman and synthetic control methods relax that assumption in different ways. Tobit addresses a related but distinct problem (censoring of the outcome, not selection into treatment). Each section opens with the assumption that licenses that particular approach.

Heckman Selection Models

The critical caveat — read this before the example

The Heckman model works only if you have a valid instrument for selection — a variable that (1) predicts whether someone participates, but (2) has no direct effect on the outcome Y. Finding such an instrument is the hardest part of applying a Heckman correction. When the instrument is weak or its exclusion restriction is implausible, the Heckman correction can produce estimates that are more biased than naive OLS.

What makes a good vs. bad instrument?

	Good instrument	Bad instrument
Example	Random financial incentive to complete the survey ($0–$4 bonus)	Whether the participant’s friend also signed up
Predicts participation?	Yes — higher bonus → more likely to complete	Yes — social influence → more likely to participate
Affects WTP directly?	No — the cash bonus is for completing the survey, not for the product	Yes — friends share consumption preferences, so friend’s sign-up correlates with eco-mindedness, which affects WTP
Why it works / fails	The incentive shifts who bothers to participate without changing how they evaluate the eco-label once in the study	The instrument correlates with the unobservable (eco-mindedness) you are trying to control for — it violates the exclusion restriction

A further technical requirement: the instrument should be continuous (or at least multi-valued). A binary instrument (e.g., reminder email yes/no) creates only a handful of unique covariate combinations in Stage 1, making the inverse Mills ratio (IMR) nearly a step-function of treatment — which causes severe collinearity in Stage 2 and destabilizes the coefficient estimates.

Always report the first-stage F-statistic (or probit z-score) for the instrument as evidence of its relevance. Notice the irony: the statistical fix for a broken exclusion restriction (Heckman) itself requires an exclusion restriction. The problems multiply as Y becomes more complex.

The exclusion restriction and Module 2: In Module 2, the exclusion restriction surfaced in a different guise — the idea that a randomised nudge or manipulation affects the outcome only through the intended treatment path. There, the manipulation check was your practical tool for verifying this. Here, the same logic applies but there is no experimental lever to check: you must argue, on substantive grounds, that your instrument (e.g., the financial incentive) has no plausible direct route to WTP. Module 2’s emphasis on careful manipulation design is precisely what makes Heckman instruments credible in practice — an instrument that was designed to shift participation without affecting evaluation is far more defensible than one selected post-hoc from available observational data.

The problem: suppose eco-label study participants were recruited via an environmental blog (self-selection). People who click through are systematically more environmentally conscious. The treatment effect you estimate confounds “eco-label response” with “environmentally conscious person response.”

The Heckman correction (Heckman, 1979) models the selection process explicitly, using a variable called the inverse Mills ratio (IMR) to account for the selection into the study:

Stage 1 (Selection equation): Regress participation (1 = participated, 0 = declined) on all variables that predict selection but do not directly affect Y. This requires an exclusion restriction in the selection model — a variable that predicts participation but not WTP. In the simulation below, we use a continuous financial incentive ($0–$4 bonus) randomly assigned to potential participants.
Stage 2 (Outcome equation): Regress Y on X and the IMR from Stage 1. The IMR absorbs the selection-induced correlation between treatment assignment and unobserved factors.

▶ Simulate Heckman correction for self-selection

set.seed(42)
N_hk  <- 1000   # pool of potential participants (larger N → more stable Heckman)

# ── Unobserved confounder ─────────────────────────────────────────────────────
# theta = latent eco-mindedness: affects BOTH survey completion AND WTP.
# It is NOT in the data (hence "unobserved"), so naive OLS cannot control for it.
theta <- rnorm(N_hk)

# ── Treatment assignment (random) ─────────────────────────────────────────────
# Eco vs. control randomly assigned to all N_hk potential participants.
eco   <- sample(rep(0:1, N_hk / 2))

# ── Instrument ────────────────────────────────────────────────────────────────
# incentive: a continuous financial incentive ($0–$4 bonus) randomly assigned
# to potential participants.  Higher incentive → more likely to complete the
# survey, but the incentive itself has NO direct effect on WTP for the product.
# This is the EXCLUSION RESTRICTION — a continuous instrument is essential here
# because a binary instrument (e.g., reminder email yes/no) would create only
# 4 unique (eco × reminder) combinations in Stage 1, making the IMR nearly a
# step-function of eco in Stage 2 and causing collinearity that destabilises
# the coefficient estimates.
incentive <- runif(N_hk, 0, 4)   # continuous $0–$4 bonus

# ── Selection into completion ──────────────────────────────────────────────────
# Key mechanism: the eco condition is intrinsically interesting to eco-minded
# (high-theta) people BUT ALSO to some low-theta "curious" people — so the eco
# arm selects in more low-theta completers than the control arm (where only
# truly motivated people bother).  This creates DOWNWARD bias in naive OLS.
# Keeping theta and eco effects moderate ensures the Heckman correction is
# tractable and does not overshoot due to finite-sample IMR estimation error.
sel_latent <- -0.30 +
  0.55 * theta     +    # eco-minded people more likely to complete
  0.65 * eco       +    # eco topic attracts notably more lower-theta completers → bigger OVB
  0.55 * incentive +    # strong continuous instrument for identification
  rnorm(N_hk, 0, 1)
selected <- as.integer(sel_latent > 0)
N_sel    <- sum(selected)

# ── WTP (observed only for completers; theta is the unobserved driver) ─────────
# theta effect on WTP = 0.85: large enough to create detectable downward bias
# in naive OLS, small enough that Heckman correction is stable (doesn't overshoot).
wtp_true <- 5.0 +
  0.50 * eco[selected == 1] +      # TRUE eco-label effect = $0.50
  0.85 * theta[selected == 1] +    # theta drives WTP but is UNOBSERVED in df_hk
  rnorm(N_sel, 0, 1.0)
wtp_true <- pmax(wtp_true, 0)      # floor at 0 (no negative WTP)

# ── Analysis datasets ──────────────────────────────────────────────────────────
# df_hk:   observed data for completers (theta NOT included — it's unobserved)
# df_full: full pool used for Stage 1 probit (includes instrument)
df_hk   <- data.frame(wtp       = wtp_true,
                       eco       = eco[selected == 1],
                       incentive = incentive[selected == 1])
df_full <- data.frame(selected, eco, incentive)   # theta excluded — unobserved

# Naive OLS: ignores selection → eco coefficient is DOWNWARD BIASED because
# eco arm selects in more low-theta participants (lower average WTP)
naive <- lm(wtp ~ eco, data = df_hk)

# Heckman Stage 1: probit of completion on eco + incentive (instrument)
# Note: theta is excluded (unobserved); incentive is the identifying instrument.
# The continuous incentive creates a smooth IMR function → no collinearity.
stage1 <- glm(selected ~ eco + incentive, data = df_full,
              family = binomial(link = "probit"))

# Inverse Mills ratio (IMR) for completers — absorbs the selection-on-theta
xb  <- predict(stage1)[selected == 1]
imr <- dnorm(xb) / pnorm(xb)

# Heckman Stage 2: includes IMR to correct for selective attrition
heckman <- lm(wtp ~ eco + imr, data = cbind(df_hk, imr = imr))

cat(sprintf("Completion rate: %.1f%% overall  |  Eco: %.1f%%  |  Control: %.1f%%\n",
            100 * mean(selected),
            100 * mean(selected[eco == 1]),
            100 * mean(selected[eco == 0])))

Completion rate: 81.2% overall  |  Eco: 86.4%  |  Control: 76.0%

▶ Simulate Heckman correction for self-selection

cat(sprintf("Mean theta among completers — Eco: %.2f  |  Control: %.2f  (gap creates bias)\n",
            mean(theta[selected == 1 & eco == 1]),
            mean(theta[selected == 1 & eco == 0])))

Mean theta among completers — Eco: 0.07  |  Control: 0.16  (gap creates bias)

▶ Simulate Heckman correction for self-selection

heck_tab <- data.frame(
  Model = c("Naive OLS (ignores selection)", "Heckman 2-step (corrects attrition)"),
  `Eco-label coefficient` = c(round(coef(naive)["eco"], 3),
                               round(coef(heckman)["eco"], 3)),
  `95% CI (low)` = c(round(confint(naive)["eco", 1], 3),
                      round(confint(heckman)["eco", 1], 3)),
  `95% CI (high)` = c(round(confint(naive)["eco", 2], 3),
                       round(confint(heckman)["eco", 2], 3)),
  `True value` = 0.50,
  check.names = FALSE
)

kable(heck_tab,
      caption = paste0("Heckman correction: differential attrition creates downward bias in naive OLS. ",
                       "Eco arm attracts low-theta completers; control arm retains only high-theta ones. ",
                       "True eco-label effect = $0.50."))

Heckman correction: differential attrition creates downward bias in naive OLS. Eco arm attracts low-theta completers; control arm retains only high-theta ones. True eco-label effect = $0.50.
Model	Eco-label coefficient	95% CI (low)	95% CI (high)	True value
Naive OLS (ignores selection)	0.452	0.270	0.634	0.5
Heckman 2-step (corrects attrition)	0.474	0.285	0.663	0.5

Notice how the naive OLS estimate is downward biased: the eco condition attracted some low-theta (“curious but not eco-minded”) completers who would not have bothered with the control condition. These participants bring down the mean WTP in the eco arm, understating the true eco-label effect. The control arm retained only high-theta (genuinely motivated) completers with high baseline WTP regardless. The Heckman correction uses the IMR to account for this differential attrition and recovers an estimate closer to the true $0.50 value.

When Instruments Fail: The Bad Instrument Case

The exclusion restriction — that the instrument affects the outcome only through selection — is the most critical and least verifiable assumption in a Heckman model. It cannot be tested from the data alone. The example below shows what happens when this assumption is violated.

▶ Simulate: Heckman with a bad (invalid) instrument

# All dataset objects from the heckman-demo chunk above are already defined:
# N_hk, theta, eco, selected, df_hk, df_full, naive, heckman

# ── Bad instrument: whether the participant's friend also signed up ─────────────
# friend_signed_up is correlated with theta (eco-mindedness) because
# environmentally conscious people tend to have eco-conscious friends.
# This violates the exclusion restriction: friend_signed_up predicts selection
# (social influence), but it also correlates with theta, which directly affects
# WTP. The instrument is therefore NOT excludable from the outcome equation.
friend_signed_up <- rbinom(N_hk, 1, plogis(0.8 * theta))

# ── Stage 1 probit with the bad instrument ────────────────────────────────────
stage1_bad <- glm(selected ~ eco + friend_signed_up,
                  data = data.frame(selected, eco, friend_signed_up),
                  family = binomial(link = "probit"))

# ── IMR from the bad Stage 1 ──────────────────────────────────────────────────
# Because friend_signed_up correlates with theta, the IMR it generates is itself
# confounded: it absorbs some of theta's variance but also carries its own
# theta-contaminated signal into Stage 2, biasing the eco coefficient.
xb_bad  <- predict(stage1_bad)[selected == 1]
imr_bad <- dnorm(xb_bad) / pnorm(xb_bad)

# ── Stage 2 with bad IMR ──────────────────────────────────────────────────────
heckman_bad <- lm(wtp ~ eco + imr_bad, data = cbind(df_hk, imr_bad = imr_bad))

# ── Comparison table ──────────────────────────────────────────────────────────
bad_tab <- data.frame(
  Model = c(
    "Naive OLS (ignores selection)",
    "Heckman — good instrument (incentive)",
    "Heckman — bad instrument (friend signed up)"
  ),
  `Eco-label coefficient` = c(
    round(coef(naive)["eco"],        3),
    round(coef(heckman)["eco"],      3),
    round(coef(heckman_bad)["eco"],  3)
  ),
  `True value` = 0.50,
  check.names = FALSE
)

kable(bad_tab,
      caption = paste0("Bad instrument demo: friend_signed_up correlates with theta (eco-mindedness), ",
                       "violating the exclusion restriction. The IMR it generates is confounded, ",
                       "potentially producing an estimate worse than naive OLS. True eco effect = $0.50."))

Bad instrument demo: friend_signed_up correlates with theta (eco-mindedness), violating the exclusion restriction. The IMR it generates is confounded, potentially producing an estimate worse than naive OLS. True eco effect = $0.50.
Model	Eco-label coefficient	True value
Naive OLS (ignores selection)	0.452	0.5
Heckman — good instrument (incentive)	0.474	0.5
Heckman — bad instrument (friend signed up)	-0.291	0.5

A bad instrument can produce estimates that are worse than naive OLS — the Heckman correction amplifies bias when the exclusion restriction fails rather than correcting it. The instrument must genuinely satisfy two conditions: it must predict selection (relevance), and it must have no path to the outcome except through selection (exclusion). The second condition is untestable, which is why instrument choice demands substantive, not statistical, justification.

Censored Data Models (Tobit)

In the eco-coffee WTP study, participants could not state WTP below $1 (floor) or above $10 (ceiling). In many studies, the outcome is censored — truncated at a boundary because the measurement instrument or the context prevents observing values beyond a threshold.

Consequence of ignoring censoring: OLS on a censored outcome is biased and inconsistent. Observations piled at the ceiling (e.g., everyone stating exactly $10.00 WTP) look identical to OLS but represent unobserved heterogeneity (some would have stated $12 or $15 if allowed).

The Tobit model explicitly models the censoring threshold and moves estimates substantially closer to the latent distribution:

\[Y^* = X\beta + \varepsilon, \quad Y = \max(Y_{\text{lower}}, \min(Y^*, Y_{\text{upper}}))\]

▶ Simulate: OLS vs. Tobit on censored WTP data

if (!requireNamespace("AER", quietly = TRUE)) install.packages("AER")
library(AER)

set.seed(2026)
N_tb   <- 1200   # large N for stable Tobit estimates (Tobit MLE is less efficient than OLS)
eco_tb <- rep(0:1, each = N_tb / 2)

# ── Latent (true) WTP ─────────────────────────────────────────────────────────
# True eco-label effect = $0.90 on the latent scale.
# Mean = 1.2 (control arm) with sd = 1.8 places ~25% of latent WTPs below $0.
# High censoring creates clear OLS attenuation; N = 1200 makes Tobit converge
# reliably to 0.90 without overshoot.
wtp_lat <- 1.2 + 0.90 * eco_tb + rnorm(N_tb, 0, 1.8)

# Observed WTP: censored at $0 (participants state "I would not pay anything")
wtp_obs      <- pmax(wtp_lat, 0)
pct_censored <- mean(wtp_obs == 0)

cat(sprintf("Floor-censored observations: %d / %d  (%.1f%%)\n",
            sum(wtp_obs == 0), N_tb, 100 * pct_censored))

Floor-censored observations: 221 / 1200  (18.4%)

▶ Simulate: OLS vs. Tobit on censored WTP data

# OLS on the censored observed values: attenuated toward zero
ols_tb <- lm(wtp_obs ~ eco_tb)

# Tobit: models the latent WTP as normal, recovering the true coefficient
tobit_tb <- tobit(wtp_obs ~ eco_tb, left = 0)

cens_tab <- data.frame(
  Model    = c("OLS (ignores censoring)", "Tobit (corrects for censoring)"),
  `Eco coefficient` = c(round(coef(ols_tb)["eco_tb"], 3),
                         round(coef(tobit_tb)["eco_tb"], 3)),
  `True value` = 0.90,
  `% censored` = paste0(round(pct_censored * 100, 1), "%"),
  check.names = FALSE
)

kable(cens_tab,
      caption = paste0("OLS vs. Tobit on floor-censored WTP (true eco effect = $0.90). ",
                       "~20% floor censoring attenuates OLS; Tobit moves materially closer to the latent coefficient."))

OLS vs. Tobit on floor-censored WTP (true eco effect = $0.90). ~20% floor censoring attenuates OLS; Tobit moves materially closer to the latent coefficient.
Model	Eco coefficient	True value	% censored
OLS (ignores censoring)	0.619	0.9	18.4%
Tobit (corrects for censoring)	0.781	0.9	18.4%

The Problem: Consumers Choose to Buy Eco-Labelled Products

Outside the lab, consumers freely choose whether to pick up the eco-labelled bag of coffee. This creates selection bias. The DGP below adds two realistic complications beyond Part 1:

Non-linear confounding: environmental concern has a quadratic effect on baseline WTP and interacts with income. A consumer with extremely high env-concern sees a doubly large WTP boost. Simple linear regression will miss this.
Interaction in the propensity score: the probability of buying eco-labelled coffee depends on the joint level of env-concern and income, not just their individual levels.

These features mean that naive and simple-linear corrections will have remaining bias, while flexible and doubly-robust methods will do better.

▶ Simulate observational data with non-linear confounding

set.seed(2024)
N <- 2000
# ── YOUR DATA: replace N with nrow(your_df); the columns env, price_sens,
#    income, and age map to your pre-treatment covariates (standardise continuous
#    ones with scale() so distance-based methods work correctly)

env        <- rnorm(N, 0, 1)
price_sens <- rnorm(N, 0, 1)
income     <- rnorm(N, 0, 1)
age        <- round(rnorm(N, 38, 12))

# Non-linear outcome model: env has quadratic effect + env:income interaction
# Linear regression misses these → remaining bias after covariate adjustment
Y0 <- pmax(1, pmin(10,
  5.20 + 0.80*env - 0.50*price_sens + 0.30*income +
  0.25*env^2 + 0.18*env*income +
  rnorm(N, 0, 0.85)
))

ITE <- 0.55 + 0.65*env - 0.38*price_sens + rnorm(N, 0, 0.35)
Y1  <- pmax(1, pmin(10, Y0 + ITE))

# True PS includes a strong env:income interaction
# Standard logit (no interaction) will substantially misspecify the PS
ps_true   <- plogis(-0.55 + 0.95*env + 0.42*income - 0.32*price_sens + 0.55*env*income)
treat_obs <- rbinom(N, 1, ps_true)
Y_obs     <- ifelse(treat_obs==1, Y1, Y0)

# Pre-compute nonlinear features for use in flexible models
env_sq      <- env^2
env_income  <- env * income

df_obs <- tibble(id=1:N, treat=treat_obs, Y_obs, Y0, Y1, ITE,
                 env, price_sens, income, age, env_sq, env_income, ps_true)
# ── YOUR DATA: df_obs should be your real data frame with columns:
#    treat (0/1 treatment indicator), Y_obs (observed outcome),
#    and all pre-treatment covariates. Drop Y0, Y1, ITE, ps_true —
#    those are unobservable in real data; they exist here only for
#    ground-truth evaluation.

ATE_true_obs <- mean(Y1 - Y0)
ATT_true_obs <- mean(ITE[treat_obs==1])
cat(sprintf("True ATE = %.3f  |  True ATT = %.3f  |  Treatment rate = %.1f%%\n",
            ATE_true_obs, ATT_true_obs, 100*mean(treat_obs)))

True ATE = 0.530  |  True ATT = 0.910  |  Treatment rate = 38.7%

▶ Simulate observational data with non-linear confounding

# ── CHECK: treatment rate far from 50% (< 10% or > 90%) signals severe overlap
#    problems; near-zero or near-1 PS values will produce extreme IPW weights.

▶ Show why naive comparison overestimates the eco-label effect

naive_diff <- mean(Y_obs[treat_obs==1]) - mean(Y_obs[treat_obs==0])
cat(sprintf("True ATE             = %.3f\nNaive diff-in-means  = %.3f\nBias                 = +%.3f\n",
            ATE_true_obs, naive_diff, naive_diff - ATE_true_obs))

True ATE             = 0.530
Naive diff-in-means  = 1.700
Bias                 = +1.170

▶ Show why naive comparison overestimates the eco-label effect

bind_rows(
  tibble(group="Eco-label buyers",     env=env[treat_obs==1]),
  tibble(group="Non-eco-label buyers", env=env[treat_obs==0])
) |>
  ggplot(aes(x=env, fill=group)) +
  geom_density(alpha=0.55) +
  scale_fill_manual(values=c("Eco-label buyers"=clr_eco,"Non-eco-label buyers"=clr_ctrl)) +
  labs(x="Environmental Concern (standardised)", y="Density",
       title="Eco-label buyers have higher environmental concern AND higher baseline WTP",
       subtitle="Positive selection on a cause of the outcome inflates the naive comparison",
       fill=NULL) +
  theme_mod3()

Regression Adjustment

What it does: Controls for confounders in the outcome regression, comparing eco-label buyers and non-buyers who are similar on measured covariates.

Intuition: Among consumers with the same env-concern, income, and price-sensitivity, any remaining WTP difference is more plausibly causal. But if the true relationship between covariates and WTP is non-linear, a linear regression will not fully remove the confounding.

▶ Linear vs. flexible regression adjustment

# ── YOUR DATA: replace Y_obs with your outcome variable and treat with your
#    treatment indicator; add your own covariates in place of env, price_sens,
#    income. Flexible adjustment (m_flex) adds squared and interaction terms —
#    include these whenever you suspect non-linear covariate effects.
m_naive   <- lm(Y_obs ~ treat, data=df_obs)
m_linear  <- lm(Y_obs ~ treat + env + price_sens + income, data=df_obs)
m_flex    <- lm(Y_obs ~ treat + env + price_sens + income + env_sq + env_income, data=df_obs)
# ── KEY ARGS: the formula after ~ lists covariates to adjust for; env_sq and
#    env_income are pre-computed non-linear terms — create analogous columns in
#    your data frame if you suspect curvilinear or interactive confounding.

bind_rows(
  tidy(m_naive)  |> mutate(model="1. Unadjusted"),
  tidy(m_linear) |> mutate(model="2. Linear adjustment"),
  tidy(m_flex)   |> mutate(model="3. Flexible adjustment\n(+ env² + env×income)")
) |>
  filter(term=="treat") |>
  transmute(Model=model, Estimate=round(estimate,3), SE=round(std.error,3),
            `95% CI`=sprintf("[%.3f, %.3f]",estimate-1.96*std.error,estimate+1.96*std.error)) |>
  add_row(Model="True ATE", Estimate=round(ATE_true_obs,3), SE=NA, `95% CI`="—") |>
  knitr::kable(caption="Flexible regression (with non-linear terms) gets closer to the true ATE than linear")

Flexible regression (with non-linear terms) gets closer to the true ATE than linear
Model	Estimate	SE	95% CI
1. Unadjusted	1.700	0.069	[1.564, 1.836]
2. Linear adjustment	0.671	0.052	[0.569, 0.773]
3. Flexible adjustment
(+ env² + env×income)	0.557	0.048	[0.462, 0.651]
True ATE	0.530	NA	—

Tip

The flexible model including env² and env × income tracks the true ATE noticeably better than the linear adjustment. The lesson: the form of your covariate adjustment matters. If you don’t know the true functional form (you never do in practice), doubly robust estimation provides insurance.

Inverse Probability Weighting (IPW)

What it does: Reweights each observation by the inverse of its probability of being in its actual group, creating a pseudo-population where eco-label buyers and non-buyers look like random draws from the same population.

Concrete weights: - An eco-conscious, high-income consumer (PS = 0.85) who buys eco-labelled coffee: weight = $1/0.85 = 1.18$ — not surprising, downweighted. - A price-sensitive, low-income consumer (PS = 0.08) who buys eco-labelled coffee: weight = $1/0.08 = 12.5$ — very surprising, heavily upweighted.

Entropy balancing (via WeightIt) takes this further: instead of estimating a parametric PS model, it directly finds weights that exactly balance the moments (means, variances, and cross-products) of the covariate distribution between groups. No model to misspecify.

▶ Standard IPW vs. entropy-balanced IPW

# Standard IPW: logistic regression on linear terms (misses interaction)
# ── YOUR DATA: replace treat ~ env + price_sens + income with your treatment
#    variable and pre-treatment covariates. Use the same covariate set throughout
#    all methods so comparisons are meaningful.
ps_logit <- glm(treat ~ env + price_sens + income, data=df_obs, family=binomial)
ps_hat   <- fitted(ps_logit)

df_obs <- df_obs |>
  mutate(ps_hat  = ps_hat,
         ipw_wt  = ifelse(treat==1, 1/ps_hat, 1/(1-ps_hat)),
         # Stabilised weights (reduces variance from extreme PS values)
         ipw_stab= ifelse(treat==1, mean(treat)/ps_hat, mean(1-treat)/(1-ps_hat)))

ATE_ipw_raw   <- with(df_obs,
  weighted.mean(Y_obs[treat==1], ipw_wt[treat==1]) -
  weighted.mean(Y_obs[treat==0], ipw_wt[treat==0]))
ATE_ipw_stab  <- lm(Y_obs ~ treat, data=df_obs, weights=ipw_stab) |>
  tidy() |> filter(term=="treat") |> pull(estimate)
# ── CHECK: inspect summary(df_obs$ipw_wt) — any weights > 20–30 signal near-
#    zero or near-one PS values (positivity violation); use stabilised weights
#    or entropy balancing instead to reduce variance inflation.

# Entropy balancing: directly balance means + variances + covariances
# Includes quadratic and interaction terms for thorough moment matching
# ── KEY ARGS: method="ebal" finds minimum-variance weights that exactly match
#    covariate moments; estimand="ATE" targets the population ATE (use "ATT" if
#    you only care about the effect among treated units).
wb_ebal <- weightit(treat ~ env + price_sens + income + env_sq + env_income,
                    data=df_obs, method="ebal", estimand="ATE")
ATE_ebal <- lm(Y_obs ~ treat, data=df_obs, weights=wb_ebal$weights) |>
  tidy() |> filter(term=="treat") |> pull(estimate)
# ── CHECK: run summary(wb_ebal) to see effective sample size (ESS) after
#    weighting — large weight loss (ESS much smaller than N) means poor overlap.

cat(sprintf(
  "True ATE              = %.3f\nIPW (standard logit)  = %.3f\nIPW (stabilised)      = %.3f\nEntropy balancing     = %.3f\n",
  ATE_true_obs, ATE_ipw_raw, ATE_ipw_stab, ATE_ebal
))

True ATE              = 0.530
IPW (standard logit)  = 0.390
IPW (stabilised)      = 0.390
Entropy balancing     = 0.437

▶ Propensity score overlap and entropy-balanced weight distribution

p_overlap <- df_obs |>
  mutate(group=ifelse(treat==1,"Eco-label buyers","Non-buyers")) |>
  ggplot(aes(x=ps_hat, fill=group)) +
  geom_histogram(bins=40, alpha=0.65, position="identity") +
  geom_vline(xintercept=c(0.05,0.95), linetype="dashed", colour="firebrick") +
  scale_fill_manual(values=c("Eco-label buyers"=clr_eco,"Non-buyers"=clr_ctrl)) +
  labs(x="Estimated PS", y="Count", title="PS overlap", fill=NULL,
       subtitle="Red lines = extreme weights zone") +
  theme_mod3()

p_wts <- tibble(wt=wb_ebal$weights, group=ifelse(treat_obs==1,"Treated","Control")) |>
  ggplot(aes(x=wt, fill=group)) +
  geom_histogram(bins=40, alpha=0.7, position="identity") +
  scale_fill_manual(values=c("Treated"=clr_eco,"Control"=clr_ctrl)) +
  labs(x="Entropy-balanced weight", y="Count",
       title="Entropy-balanced weights",
       subtitle="Well-behaved: no extreme spikes", fill=NULL) +
  theme_mod3()

p_overlap + p_wts

IPW can be unstable with extreme propensity scores

Unstabilised IPW weights become enormous near PS = 0 or 1. Entropy balancing avoids this by finding the minimum-variance weights that achieve balance — it never requires extreme weights to correct for outlier PS values.

Covariate Matching (Mahalanobis)

What it does: For each eco-label buyer, finds the most similar non-buyer based on raw covariate values (Mahalanobis distance accounts for covariate correlations). The matched sample mimics a randomised experiment — but only for the treated group (estimates ATT, not ATE).

▶ 1:1 Mahalanobis nearest-neighbour matching

# ── YOUR DATA: replace df_obs with your data frame; treat ~ env + price_sens +
#    income should list your pre-treatment covariates (continuous or binary).
#    Mahalanobis distance accounts for correlations between covariates —
#    useful when covariates are on very different scales.
# ── KEY ARGS: method="nearest" does greedy nearest-neighbour matching;
#    distance="mahalanobis" can be changed to "gower" (handles mixed types) or
#    replaced by a propensity-score distance (see psm-comparison chunk);
#    ratio=1 means 1 control matched per treated unit — increase (e.g., ratio=2)
#    to gain precision at the cost of match quality.
m_match_maha <- matchit(treat ~ env + price_sens + income, data=df_obs,
                        method="nearest", distance="mahalanobis", ratio=1)
df_matched_maha <- match.data(m_match_maha)
ATT_match_maha  <- lm(Y_obs ~ treat, data=df_matched_maha, weights=weights) |>
  tidy() |> filter(term=="treat") |> pull(estimate)
# ── CHECK: run summary(m_match_maha) and inspect SMD (Std. Mean Difference)
#    for each covariate — SMD < 0.1 means good balance; large SMD means the
#    matched sample still has systematic differences on that covariate.

cat(sprintf("True ATT (treated group)              = %.3f\nMatching (Mahalanobis, 1:1 NN) est.   = %.3f\n",
            ATT_true_obs, ATT_match_maha))

True ATT (treated group)              = 0.910
Matching (Mahalanobis, 1:1 NN) est.   = 1.358

Propensity Score Matching (PSM)

The core idea is elegant: if we knew every consumer’s probability of buying the eco-labelled product given their characteristics, we could match eco-label buyers to non-buyers who had the same probability but chose differently. Within such matched pairs, the remaining outcome difference is plausibly causal — because the two consumers were equally likely to buy, any remaining WTP difference cannot be attributed to pre-existing differences in characteristics. Rosenbaum and Rubin (1983) proved that the propensity score $e(X) = P(D=1 \mid X)$ is a balancing score: conditioning on it is sufficient to remove all confounding from measured covariates, just as conditioning on $X$ itself would be.

Why it is harder than it looks: The PS is unknown and must be estimated. Every choice in that estimation — and in the matching procedure that follows — encodes assumptions and trade-offs. PSM is not a single method; it is a family of methods defined by a cascade of researcher decisions. Getting any one of them wrong can produce estimates that are worse than a simple regression adjustment.

Below we walk through the eight key decisions in roughly the order a researcher encounters them.

Decision 1: Which variables belong in the PS model?

This is the most consequential choice and it is almost entirely driven by theory, not statistics.

Include all pre-treatment variables that simultaneously predict treatment assignment and the outcome — i.e., confounders. In our coffee setting that means environmental concern, income, and price sensitivity. More is generally safer here: omitting a true confounder biases the ATT; including an irrelevant covariate only inflates variance slightly.

Do not include any of the following — including them creates problems that are harder to fix than omitting a confounder:

Post-treatment variables (anything that could have been affected by the treatment) — conditioning on them opens a collider path and introduces bias.
Pure instruments — variables that predict treatment but have no direct effect on the outcome (e.g., a randomised nudge used in a separate substudy). Including instruments in the PS deflates the denominator for high-PS units without reducing confounding, producing inflated variance and sometimes severe numerical instability.
Colliders — variables caused by both the treatment and an unmeasured confounder. Including them, again, opens a backdoor path. The DAG from Part 1 is your guide.

The fundamental rule of variable selection for PSM

A variable should enter the PS model if and only if it is a pre-treatment cause of treatment assignment — and you are not certain it is causally downstream of the treatment. When in doubt about a variable’s position in the DAG, include it; the cost of accidentally including a near-irrelevant covariate (slightly inflated variance) is lower than the cost of omitting a true confounder (bias that is undetectable from the data).

A Module 1 reminder: including a covariate is only as good as its measurement. A construct measured with poor reliability or low construct validity (Module 1 — HTMT, CFA fit, convergent validity) acts partly as a noise variable. Conditioning on a noisy proxy for the true confounder leaves residual bias proportional to the measurement error. Perfect adjustment for a badly measured covariate is not possible; improving measurement quality at the design stage is more valuable than any estimation trick applied after the fact.

Decision 2: How to estimate the PS?

Given the covariate set, you still need a model. Common choices:

Estimator	Strengths	Weaknesses
Logistic regression (logit)	Interpretable; stable with moderate N; well-understood	Relies on correct functional form specification
Probit	Similar to logit; sometimes preferred when treatment has natural latent index interpretation	Nearly identical to logit in practice for balanced PS
Logit + polynomials/interactions	Captures non-linearities within a parametric framework	Still requires knowing which terms to add; researcher degrees of freedom
GAM / splines	Automatically flexes to non-linear marginal relationships	Does not automatically model interactions; can overfit
Gradient boosting (GBM)	Highly flexible; handles interactions and non-linearities automatically	Can overfit in small samples; optimised for prediction, not balance — may not produce a well-calibrated PS
Random forest	Same strengths and weaknesses as GBM	Tends to shrink PS estimates toward the mean, improving overlap but sometimes underfit

The practitioner’s honest answer: A carefully specified logit (with theory-guided interaction and polynomial terms) often outperforms machine-learning approaches in the moderate-N settings typical of academic research, because ML models are optimised for prediction accuracy, not for covariate balance. If you do use ML for PS estimation, use a targeted learner (e.g., from the SuperLearner or tmle packages) that is designed to minimise bias in the causal effect estimate, not just cross-validated AUC.

Regardless of the estimator, always check: (a) overlap histograms (both groups should have support across the full PS range), and (b) standardised mean differences after matching (see Decision 7).

Decision 3: Nearest-neighbour greedy vs. optimal vs. subclassification

Once you have PS estimates in hand, how do you form the matched pairs?

Greedy nearest-neighbour (NN) is the default in most software. Each treated unit is paired with the closest available control, sequentially. It is fast and intuitive, but “closest available” deteriorates as the pool of unused controls shrinks — units matched later may be poor matches, and the order of processing affects the result.

Optimal matching solves a global assignment problem: it finds the set of pairs that minimises total within-pair distance across all treated units simultaneously. This produces better average balance than greedy NN at no extra data cost, but requires more computation and the optmatch package in R.

Subclassification divides the estimated PS range into K strata and compares treated and control units within each stratum. Every observation is retained — no data is ever discarded — and the ATT is estimated by weighting each stratum’s estimate by the number of treated units it contains. Subclassification is straightforward to implement and interpret, and it illustrates how PS-based adjustment can proceed without any pair-matching at all.

Coarsened exact matching (CEM) temporarily coarsens each covariate into bins, performs exact matching within those coarsened cells, then uses the original values for outcome estimation. It gives iron-clad balance on each covariate (within the chosen bin width) and does not rely on a PS model at all. The cost is that many treated units may be discarded if no control falls in the same coarsened cell.

Decision 4: How many controls per treated unit (match ratio)?

With 1:1 matching each treated unit gets exactly one control: best match quality, smallest effective sample size, widest confidence intervals. With 1:k matching each treated unit gets k controls: more data, narrower CIs, but the 2nd through kth controls are progressively worse matches, introducing more bias with each step.

The bias–variance trade-off is roughly:

k = 1: Lowest bias, highest variance
k = 3–5: Moderate gains in variance, modest increase in bias — often a good practical compromise when controls outnumber treated units
k → ∞: Converges to a weighting estimator (IPW), where every control unit eventually gets used

There is no universally optimal k. A useful heuristic: compute the ATT estimate and its SE at k = 1, 2, 3, 5 and check how sensitive the estimate is. If estimates barely move but SEs shrink substantially, increasing k is clearly worthwhile. If estimates shift as k grows, the additional matches are introducing bias from poor-quality pairs.

Decision 5: Matching with or without replacement?

Without replacement (the default): once a control unit is matched to a treated unit, it is removed from the pool. This ensures each control is used only once, preserving a clean comparison sample. The downside: treated units matched later get lower-quality matches from the diminishing pool.

With replacement: a control can serve as the match for multiple treated units. This dramatically improves match quality — every treated unit always gets its best available match — but introduces correlation across matched pairs (the same control appears in multiple pairs) that naive standard error formulas ignore. With-replacement matching requires a correction: use the Abadie–Imbens (2016) heteroskedasticity-robust SE or a bootstrap clustered on matched pairs. Failing to do so typically understates uncertainty by a factor of 2 or more.

Standard errors after PSM are not what OLS reports

After matching, the treated and control units in the matched sample are not independent: they were selected based on similarity. A naive lm() on the matched sample treats them as independent, producing overconfident SEs. Preferred alternatives: (1) use estimatr::lm_robust() with clustered SEs by subclass/pair ID; (2) bootstrap the entire pipeline (PS estimation + matching + outcome model) as a block; (3) use MatchIt’s built-in variance estimation.

Decision 6: Should you use a caliper?

A caliper discards any matched pair whose PS difference exceeds a threshold. Without a caliper, greedy NN will always make a match — even if the nearest available control differs by 0.4 in PS. Such poor matches inflate bias more than they help.

Cochran and Rubin’s rule of thumb: caliper width = 0.2 × SD(logit-PS). This is widely used and performs well in simulations across many settings.

The trade-off: a tight caliper removes the worst matches and improves average match quality, but it also drops treated units for whom no good control exists. This changes the estimand — you are now estimating the ATT only among units in the region of common support. Whether this is acceptable depends on whether you care about the full treated population or only the “matchable” subset. Always report how many treated units were dropped by the caliper.

Decision 7: How do you check whether matching worked?

The answer is NOT a significance test. Running a t-test comparing means before vs. after matching is almost universally misleading: with a large matched sample, even tiny imbalances will be statistically significant; with a small matched sample, large imbalances may not be. Balance is a property of the sample, not a null hypothesis test.

The correct tool is the standardised mean difference (SMD): \[ \text{SMD} = \frac{\bar{X}_{\text{treated}} - \bar{X}_{\text{control}}}{\text{SD}_{\text{pre-matching}}} \]

A common threshold is |SMD| < 0.1. Check SMD for:

Every covariate in the PS model
Squared terms and interaction terms — even if main effects are balanced, a poor PS model can leave interactions badly imbalanced
Variables not in the PS model that might still confound the outcome

A Love plot (from the cobalt package) visualises SMDs before and after matching for all covariates at once and is the standard balance diagnostic in applied work.

Decision 8: What do you do after matching?

Once you have a matched sample, you face one more choice: outcome analysis.

Simple mean difference on the matched sample is unbiased if matching achieved perfect balance (SMD = 0 everywhere). In practice balance is imperfect, so this leaves some residual confounding.

Regression adjustment on the matched sample — regressing the outcome on treatment and the covariates — provides additional bias correction (“double adjustment” or “bias-corrected matching”). This is almost always a good idea. The key point: the regression is doing local covariate adjustment within the already-balanced matched sample, not global extrapolation like in a naive regression. Abadie and Imbens (2011) show this approach is semiparametrically efficient.

What not to do: do not re-run the regression on the full unmatched sample after matching; do not use matching-adjusted weights with outcome models that do not account for them.

Putting it together: PS model specification matters most

The chart below demonstrates that getting the PS model specification right is the single most consequential decision. To isolate the effect of specification quality alone, no calipers are used here — calipers are a separate design choice (covered in the matching design section below) that drops different units across models and muddies the comparison. Three models are compared: one that simply misses the key interaction, one that adds an interaction but on the wrong covariate pair, and one correctly specified model.

▶ Compare misspecified vs. correctly specified PS models (no caliper)

# ── No calipers in this chunk — we want PS model quality alone to drive
#    differences. Calipers drop different units across models and conflate
#    specification quality with estimand shifts. For comparing specifications,
#    match all treated units and let the PS estimate quality determine bias.
# ── YOUR DATA: replace treat ~ ... with your treatment indicator and
#    pre-treatment covariates. Vary only the formula; keep method="nearest"
#    and no caliper so that specification quality is the sole variable.

# ── Misspecified 1: main effects only — misses the env×income interaction
#    entirely. High env + high income consumers get the same predicted PS as
#    high env + low income consumers, creating systematically bad matches.
m_ps_mis1 <- matchit(treat ~ env + price_sens + income,
                     data=df_obs, method="nearest", distance="logit")

# ── Misspecified 2: adds an interaction, but on the WRONG pair (env×price_sens
#    instead of env×income). Price sensitivity has a negative effect on PS
#    (-0.32), so env×price_sens pulls predictions in the wrong direction for
#    highly eco-conscious but price-sensitive consumers — biased differently
#    from model 1 but still structurally wrong.
m_ps_mis2 <- matchit(treat ~ env + price_sens + income + env:price_sens,
                     data=df_obs, method="nearest", distance="logit")

# ── Correctly specified: includes env×income matching the true PS DGP.
#    Note: env² appears in the *outcome* model (Y0), NOT in the true PS —
#    adding env_sq to the PS formula would distort it (both high and low env
#    get inflated scores, producing non-monotonic PS and bad matches).
#    All treated units are matched; the quality of each match is high because
#    the PS correctly ranks who is likely to buy eco-labelled coffee.
m_ps_cor  <- matchit(treat ~ env + price_sens + income + env_income,
                     data=df_obs, method="nearest", distance="logit")

results_psm_spec <- map_dfr(
  list("1. Logit — main effects only\n(misspecified)"               = m_ps_mis1,
       "2. Logit + env×price_sens\n(wrong interaction)"             = m_ps_mis2,
       "3. Logit + env×income\n(correctly specified)"               = m_ps_cor),
  function(m) {
    d   <- match.data(m)
    est <- lm(Y_obs ~ treat, data=d, weights=weights) |>
      tidy() |> filter(term=="treat")
    tibble(estimate=est$estimate, se=est$std.error,
           n_matched=nrow(d[d$treat==1,]))
  }, .id="PS model"
)

results_psm_spec |>
  mutate(lo=estimate-1.96*se, hi=estimate+1.96*se,
         `N retained`=n_matched,
         correct = str_detect(`PS model`, "correctly")) |>
  ggplot(aes(y=`PS model`, x=estimate, xmin=lo, xmax=hi,
             colour=correct, shape=correct)) +
  geom_vline(xintercept=ATT_true_obs, linetype="dashed", colour=clr_eco,  linewidth=1) +
  geom_vline(xintercept=ATE_true_obs, linetype="dotted", colour=clr_ctrl, linewidth=1) +
  geom_pointrange(size=0.9) +
  geom_text(aes(label=sprintf("N=%d", `N retained`), x=hi+0.04), hjust=0, size=3.2) +
  annotate("text", x=ATT_true_obs+0.015, y=3.45,
           label="True ATT", hjust=0, size=3.2, colour=clr_eco) +
  annotate("text", x=ATE_true_obs+0.015, y=3.1,
           label="True ATE", hjust=0, size=3.2, colour=clr_ctrl) +
  scale_colour_manual(values=c("FALSE"="grey50","TRUE"="#b45309"),
                      labels=c("Misspecified PS","Correctly specified PS")) +
  scale_shape_manual(values=c("FALSE"=16,"TRUE"=18),
                     labels=c("Misspecified PS","Correctly specified PS")) +
  labs(x="ATT estimate — eco-label WTP premium ($)", y=NULL,
       colour=NULL, shape=NULL,
       title="PS model specification determines whether matching recovers the true ATT",
       subtitle="No calipers: all treated units matched — differences driven solely by PS model quality") +
  theme_mod3()

The two misspecified models land in different places but both miss the true ATT, for distinct structural reasons:

Model 1 (main effects only): Treats high-env + high-income consumers identically to high-env + low-income consumers in PS space. These two groups have very different true treatment probabilities (because of the env×income interaction), so the model matches high-env + high-income treated units to controls who share the same env level but not the same income. The matched controls have systematically lower true PS — and lower baseline WTP — overstating the eco-label effect.
Model 2 (wrong interaction — env×price_sens): Adds interaction complexity but on the wrong pairing. Since price sensitivity has a negative PS coefficient (−0.32), the env×price_sens term suppresses PS estimates for high-env + price-sensitive consumers — a group that exists but is not the structural driver of selection. The model distorts the PS surface in the wrong direction, producing a different pattern of bad matches.
Model 3 (correct): The env×income term correctly captures who is most likely to buy eco-labelled coffee (high env and high income), so matched controls are genuinely comparable to treated units across both dimensions. The ATT estimate lands near the true ATT.

Why adding a caliper to misspecified models does not fix the problem

A caliper prevents the worst individual matches (pairs that differ by more than X SDs in estimated PS), but it cannot fix the underlying structural error. With a misspecified PS, “close in estimated PS space” does not mean “close in true PS space.” The caliper removes some outlier pairs but the systematic directional bias from the wrong functional form remains — and an added side-effect is that the caliper drops different sets of treated units across models, making a clean apples-to-apples comparison impossible.

Matching can look balanced and still be biased

A matched sample from the misspecified logit will show excellent marginal balance — similar means on env, income, and price_sens across groups. But env × income (not explicitly checked) will still be badly imbalanced. Always inspect interaction terms and polynomial terms in your Love plot, not just the main effects you included in the PS model.

Matching design trade-offs: 1:1 vs. 1:k vs. subclassification

With the correctly-specified PS in hand, the remaining design choices — match ratio and procedure — trade bias against variance. All five approaches below use the same PS formula; what differs is how the matched sample is constructed.

▶ Compare 1:1, 1:3, 1:5, with-replacement, and subclassification (correct PS)

# ── All approaches use the correctly-specified PS (env×income interaction;
#    note env² is in the outcome model, NOT in the true PS).
#    The variation illustrates the bias–variance trade-off in matching design,
#    not PS misspecification.
# ── YOUR DATA: keep the same PS formula throughout; change method=, ratio=,
#    and replace= to explore the design space for your application.

# 1:1 nearest-neighbour without replacement (baseline)
m_11  <- matchit(treat ~ env + price_sens + income + env_income,
                 data=df_obs, method="nearest", distance="logit",
                 ratio=1, replace=FALSE, caliper=0.15, std.caliper=TRUE)

# 1:3 nearest-neighbour without replacement (more controls, wider caliper tolerated)
m_13  <- matchit(treat ~ env + price_sens + income + env_income,
                 data=df_obs, method="nearest", distance="logit",
                 ratio=3, replace=FALSE, caliper=0.15, std.caliper=TRUE)

# 1:5 nearest-neighbour without replacement
m_15  <- matchit(treat ~ env + price_sens + income + env_income,
                 data=df_obs, method="nearest", distance="logit",
                 ratio=5, replace=FALSE, caliper=0.15, std.caliper=TRUE)

# 1:1 WITH replacement — each control can serve as best match for multiple
# treated units; improves per-pair quality but inflates SE if uncorrected
# ── NOTE: SEs after with-replacement matching require robust/clustered
#    corrections (Abadie-Imbens); the naive lm() SE used here is illustrative.
m_wr  <- matchit(treat ~ env + price_sens + income + env_income,
                 data=df_obs, method="nearest", distance="logit",
                 ratio=1, replace=TRUE, caliper=0.15, std.caliper=TRUE)

# Subclassification — PS range divided into K subclasses; all units retained
# Estimates ATT by weighting subclass estimates by treated-unit counts
m_sub <- matchit(treat ~ env + price_sens + income + env_income,
                 data=df_obs, method="subclass", distance="logit",
                 subclass=5)

results_psm_approach <- map_dfr(
  list("1:1 without replacement\n(caliper 0.15 SD)"  = m_11,
       "1:3 without replacement\n(caliper 0.15 SD)"  = m_13,
       "1:5 without replacement\n(caliper 0.15 SD)"  = m_15,
       "1:1 with replacement\n(caliper 0.15 SD)"     = m_wr,
       "Subclassification\n(5 PS strata)"            = m_sub),
  function(m) {
    d   <- match.data(m)
    est <- lm(Y_obs ~ treat, data=d, weights=weights) |>
      tidy() |> filter(term=="treat")
    tibble(estimate=est$estimate, se=est$std.error,
           n_matched=sum(d$treat==1))
  }, .id="Approach"
)

results_psm_approach |>
  mutate(lo=estimate-1.96*se, hi=estimate+1.96*se,
         `N treated`=n_matched,
         Approach=fct_inorder(Approach)) |>
  ggplot(aes(y=Approach, x=estimate, xmin=lo, xmax=hi)) +
  geom_vline(xintercept=ATT_true_obs, linetype="dashed", colour=clr_eco, linewidth=1) +
  geom_pointrange(colour="#16a085", size=0.85) +
  geom_text(aes(label=sprintf("N treated=%d", `N treated`), x=hi+0.04),
            hjust=0, size=3.1) +
  annotate("text", x=ATT_true_obs+0.015, y=5.45,
           label="True ATT", hjust=0, size=3.3, colour=clr_eco) +
  labs(x="ATT estimate — eco-label WTP premium ($)", y=NULL,
       title="All five approaches use the correctly specified PS — what varies is the matching design",
       subtitle="With-replacement and subclassification use more data; 1:1 without replacement keeps the cleanest pairs") +
  theme_mod3()

The key takeaway: once the PS is correctly specified, the matching design choices affect precision more than they affect bias. With-replacement matching and subclassification typically use more of the data and therefore yield narrower confidence intervals, but they require more care in outcome analysis (correct SE formulas for with-replacement; subclass weights for subclassification). For exploratory work, 1:1 without replacement plus a caliper is the cleanest starting point.

▶ Love plot: balance before and after correctly-specified PSM

# ── Love plots are the standard diagnostic for PSM balance. Look for:
#    (1) All post-matching SMDs inside the |0.1| threshold band.
#    (2) Interaction and squared terms balanced — not just main effects.
#    (3) No covariate with pre-matching SMD > 0.5; large pre-matching
#        imbalance on a key covariate means matching will struggle.
love.plot(m_ps_cor,
          thresholds = c(m = 0.1),
          stars      = "std",
          var.order  = "unadjusted",
          abs        = TRUE,
          title = "Covariate balance: correctly-specified PSM (1:1 NN, env² + env×income + caliper)",
          colors = c(clr_ctrl, clr_eco))

Doubly Robust Estimation

What it does: Combines an outcome model and a propensity score model. The estimate is consistent if either model is correctly specified — it only needs one to be right.

Here both the outcome model and PS model include the non-linear terms (env², env × income), so both are correctly specified. The DR estimate should perform best.

▶ AIPW with correctly-specified outcome and PS models

# ── YOUR DATA: replace Y_obs, treat, and the covariate list with your own
#    variable names. Fit mu1_model on treated rows only, mu0_model on control
#    rows only — these are separate outcome regressions for each arm.
# ── KEY ARGS: include all covariates you believe affect the outcome in the
#    outcome models (mu1/mu0) AND all covariates that predict treatment in the
#    PS model. Adding non-linear terms (env_sq, env_income) reduces bias when
#    the true relationship is non-linear. The AIPW formula is consistent if
#    EITHER the outcome model OR the PS model is correctly specified.
# Outcome models (correctly specified: include env^2 and env:income)
mu1_model <- lm(Y_obs ~ env + price_sens + income + env_sq + env_income,
                data=df_obs[df_obs$treat==1,])
mu0_model <- lm(Y_obs ~ env + price_sens + income + env_sq + env_income,
                data=df_obs[df_obs$treat==0,])
mu1_hat   <- predict(mu1_model, newdata=df_obs)
mu0_hat   <- predict(mu0_model, newdata=df_obs)

# PS model (correctly specified)
ps_hat_dr <- fitted(glm(treat ~ env + price_sens + income + env_sq + env_income,
                        data=df_obs, family=binomial))

# AIPW estimator
Y_v <- df_obs$Y_obs; D_v <- df_obs$treat
tau_dr <- (mu1_hat - mu0_hat) +
          D_v*(Y_v - mu1_hat)/ps_hat_dr -
          (1-D_v)*(Y_v - mu0_hat)/(1-ps_hat_dr)

ATE_dr <- mean(tau_dr)
SE_dr  <- sd(tau_dr)/sqrt(N)
# ── CHECK: compare ATE_dr to IPW and regression adjustment estimates —
#    if DR diverges sharply from both, a model may be badly misspecified.
#    For inference with real data use a sandwich SE or bootstrap rather than
#    the naive sd(tau_dr)/sqrt(N) used here for illustration.

cat(sprintf(
  "True ATE              = %.3f\nNaive estimate        = %.3f\nLinear reg. adj.      = %.3f\nFlexible reg. adj.    = %.3f\nIPW (stabilised)      = %.3f\nEntropy balancing     = %.3f\nDR/AIPW (correct)     = %.3f  (SE = %.3f)\n",
  ATE_true_obs, naive_diff,
  coef(m_linear)["treat"], coef(m_flex)["treat"],
  ATE_ipw_stab, ATE_ebal,
  ATE_dr, SE_dr
))

True ATE              = 0.530
Naive estimate        = 1.700
Linear reg. adj.      = 0.671
Flexible reg. adj.    = 0.557
IPW (stabilised)      = 0.390
Entropy balancing     = 0.437
DR/AIPW (correct)     = 0.428  (SE = 0.060)

How Well Do All Section 2 Methods Recover the True ATE?

What the gradient shows

Method class	Why it performs that way
Naive	No correction — pure selection bias
Linear regression	Removes linear confounding; missing `env²` and `env×income` leaves residual bias
Flexible regression	Includes non-linear terms → less bias
IPW (stabilised)	PS model also misspecified (linear logit) → some remaining bias
Entropy balancing	Directly balances covariate moments — robust to PS misspecification
Matching/PSM	Targets ATT (higher than ATE because treated have stronger responses)
Doubly Robust	Both outcome model AND PS model correctly specified → lowest bias

The doubly robust and entropy balancing approaches consistently outperform simpler methods when the true confounding is non-linear. The correctly specified PSM recovers the ATT accurately; the misspecified PSM does less well.

Synthetic Controls

Synthetic controls: validity depends on an unverifiable assumption

A synthetic control constructs a weighted combination of untreated units to mimic the counterfactual for the treated unit. The approach is transparent and often produces compelling-looking pre-period fits — but there is no way to verify that the synthetic world it generates is a sufficiently representative stand-in for the real counterfactual post-treatment. Synthetic controls assume that the same weighting that matched pre-treatment trends would have continued to match post-treatment trends in the absence of treatment. This is an untestable assumption, and deviations — due to unobserved shocks, compositional changes, or structural breaks — will bias the estimated treatment effect without any diagnostic flag.

Use synthetic controls when you have a single treated unit, a long pre-treatment window, and a credible pool of potential donor units. But always report sensitivity analyses varying the donor pool and be explicit about what would have to be true for the synthetic counterfactual to be valid.

What if you only have one treated unit — say, AlterEco’s flagship Rotterdam store introduced eco-labels?

▶ Synthetic control for Rotterdam flagship store

set.seed(2024)
n_stores <- 21; n_pre <- 20; n_post <- 10; treat_store <- 1
store_fe <- rnorm(n_stores, 0, 0.8)

panel_df <- expand.grid(store=1:n_stores, period=1:(n_pre+n_post)) |>
  as_tibble() |>
  mutate(treated=(store==treat_store & period>n_pre),
         WTP=5.5 + store_fe[store] + 0.02*(period-n_pre) +
             treated*0.80 + rnorm(n(), 0, 0.40))

pre_means <- panel_df |> filter(period<=n_pre) |>
  group_by(store) |> summarise(pre_mean=mean(WTP), .groups="drop")
treated_pre_mean <- pre_means$pre_mean[pre_means$store==treat_store]

control_stores <- pre_means |> filter(store!=treat_store) |>
  mutate(dist=abs(pre_mean-treated_pre_mean), w_raw=1/(dist+0.01), w=w_raw/sum(w_raw))

synth_ts <- panel_df |> filter(store!=treat_store) |>
  left_join(control_stores[,c("store","w")], by="store") |>
  group_by(period) |> summarise(WTP_synth=weighted.mean(WTP,w), .groups="drop") |>
  mutate(series="Synthetic Rotterdam")

treated_ts <- panel_df |> filter(store==treat_store) |>
  dplyr::select(period, WTP) |> rename(WTP_synth=WTP) |>
  mutate(series="Rotterdam (actual)")

bind_rows(synth_ts, treated_ts) |>
  ggplot(aes(x=period, y=WTP_synth, colour=series, linetype=series)) +
  geom_line(linewidth=1) +
  geom_vline(xintercept=n_pre+0.5, linetype="dashed", colour="grey50") +
  annotate("text", x=n_pre+1, y=5.0, label="Eco-label\nintroduced", hjust=0, size=3.5, colour="grey40") +
  scale_colour_manual(values=c("Rotterdam (actual)"=clr_eco,"Synthetic Rotterdam"=clr_ctrl)) +
  scale_linetype_manual(values=c("Rotterdam (actual)"="solid","Synthetic Rotterdam"="dashed")) +
  labs(x="Period", y="Mean WTP ($)", colour=NULL, linetype=NULL,
       title="Synthetic Control: Rotterdam vs. its synthetic twin built from control stores",
       subtitle="Good pre-period fit validates the synthetic match; gap after dashed line = causal effect") +
  theme_mod3()

Researcher Checklist: Selection on Observables

Key questions about your matching or weighting strategy

Have you measured all the confounders that drive both selection into treatment and the outcome? Conditional independence is the core assumption — it cannot be tested, only argued. Name the confounders explicitly and acknowledge any you could not measure.
Is there sufficient overlap? Treated and control units with propensity scores near 0 or 1 have no good comparison group. Check the PS distributions by treatment status; units in regions of poor overlap should be trimmed or flagged.
Does the PS model include only pre-treatment covariates? Including post-treatment variables (mediators, proxies for Y) blocks the causal pathway and biases the estimate.
Did you verify covariate balance after matching? The goal is balance on individual covariates, not just on the PS. A Love plot with |SMD| < 0.10 for all covariates is the standard check.