15  Part 1: What Is a Causal Effect?

16 Part 1: What Do We Mean by “Causal Effect”?

16.1 Causal Inference is a Deductive Process

ImportantWhat this exercise assumes

The 64 structures shown below represent the complete space of possible causal relationships among three variables under ideal conditions: (1) proper measurement — X, M, and Y are each measured without discriminant validity failures or systematic error (Module 1), and (2) no omitted variables — these three variables are the only players in the system (Module 2). In real research, measurement error and hidden confounders expand this space considerably.

Causal inference is not induction — you cannot prove a causal structure from data. It is deduction by elimination: you bring every tool in your methodological toolkit to bear and see which structures survive. The exercise below makes this concrete.

The grid shows all 64 structurally distinct relationships that can exist among three variables X, M, and Y — arranged as a triangle with X at the lower-left, Y at the lower-right, and M at the top. Each cell shows one possible causal structure. Arrows between any two variables can point forward (A → B), backward (A ← B), bidirectionally (A ↔︎ B, representing correlated unmeasured causes or feedback), or not exist at all.

Click the buttons below to apply research findings. Each finding rules out the structures inconsistent with it — those cells will be crossed out in red. After exhausting your toolkit, notice how many structures remain. That irreducible remainder is why causal inference is hard.

16.1.1 What if There Is an Unobserved Confounder?

The exercise above assumed no hidden variables. Now introduce U — an unobserved variable that may have arrows to X, M, and/or Y but is never measured. No research tool can constrain U directly.

Each of the three U-pairs (U–X, U–M, U–Y) can be absent, forward, backward, or bidirectional — 4³ = 64 U configurations for every one of the 64 X–M–Y structures, giving 4,096 total states. Rather than draw all 4,096 cells, the grid below preserves the familiar 64 X–M–Y layout. Each cell’s blue shading encodes how many of the 64 U configurations remain consistent with your findings: deeper blue = U is highly ambiguous for that structure; pale = almost no U configuration survives; crossed out = the X–M–Y structure is ruled out for every possible U configuration. The small dashed circle labelled U in each cell is a reminder that the confounder is always present but unobserved.

Notice in particular: pressing Manipulate X eliminates U→X (since X is now exogenous), but U’s connections to M and Y remain entirely free — experimental control of X does not eliminate unmeasured confounding of the M–Y path.


16.2 The Fundamental Problem of Causal Inference

Every participant in the AlterEco eco-label study had two potential outcomes:

  • \(Y_i(0)\): their WTP if they did not see the eco-label
  • \(Y_i(1)\): their WTP if they did see the eco-label

Their individual treatment effect is \(\tau_i = Y_i(1) - Y_i(0)\).

The problem? We can only observe one of these for any individual. This is called the Fundamental Problem of Causal Inference (Holland, 1986). Everything that follows is a strategy for getting around it.

▶ Build the data-generating system
set.seed(2024)
N <- 2000

env        <- rnorm(N, 0, 1)
price_sens <- rnorm(N, 0, 1)
income     <- rnorm(N, 0, 1)

Y0 <- 5.50 + 0.80*env - 0.50*price_sens + 0.30*income + rnorm(N, 0, 1.10)
Y0 <- pmax(1, pmin(10, Y0))

ITE <- 0.60 + 0.80*env - 0.55*price_sens + rnorm(N, 0, 0.40)
Y1  <- pmax(1, pmin(10, Y0 + ITE))

ATE_true <- mean(Y1 - Y0)
cat(sprintf("True ATE  = %.3f\nTrue ITE SD = %.3f\n", ATE_true, sd(Y1 - Y0)))
True ATE  = 0.586
True ITE SD = 0.967
The fundamental problem: exactly ONE potential outcome per person is observed.
Person Env concern Y(0) No label Y(1) With label True ITE Assigned to What we observe
Sophie (high concern) 1.3 8.79 10.00 1.21 Control Y(0)=$8.79 Y(1)=???
Rens (moderate+) 1.2 6.42 6.57 0.15 Treatment Y(0)=??? Y(1)=$6.57
Lotte (neutral) -0.1 5.16 5.87 0.70 Control Y(0)=$5.16 Y(1)=???
Daan (low concern) -1.2 6.73 6.62 -0.11 Treatment Y(0)=??? Y(1)=$6.62
Kim (price-sensitive) -1.7 5.12 5.46 0.34 Control Y(0)=$5.12 Y(1)=???

16.3 Individual Treatment Effects (ITE)

The individual treatment effect for person \(i\) is:

\[\tau_i = Y_i(1) - Y_i(0)\]

This is the only truly assumption-free definition of causality: it is the effect on this person, in this moment, caused by this treatment. Did the eco-label change this consumer’s willingness to pay? That is the ITE.

The ITE is simultaneously the most interpretable estimand and the one we can never directly observe — because every person appears in only one condition at one point in time. The histogram below shows the true ITE distribution that exists in reality but is entirely invisible to any researcher. Notice how wide it is: some consumers gain over $2 in WTP from seeing the label; others gain nothing; a few are actually put off by it. This heterogeneity is the central fact that all other estimands must grapple with.

▶ Plot the distribution of individual treatment effects
tibble(ITE=ITE,
       env_grp=cut(env, c(-Inf,-0.5,0.5,Inf), labels=c("Low","Moderate","High"))) |>
  ggplot(aes(x=ITE, fill=env_grp)) +
  geom_histogram(bins=50, alpha=0.75, position="identity") +
  geom_vline(xintercept=ATE_true, linetype="dashed", linewidth=1.1, colour="grey20") +
  annotate("text", x=ATE_true+0.06, y=82,
           label=sprintf("ATE = $%.2f", ATE_true), hjust=0, size=3.5, fontface="bold") +
  scale_fill_manual(values=c("High"=clr_eco,"Moderate"="#74b9a0","Low"=clr_ctrl)) +
  labs(x="Individual Treatment Effect  Y(1)−Y(0)  ($)", y="Count", fill=NULL,
       title="The eco-label effect is wildly different across individuals",
       subtitle="In real research you never see this distribution — only the ATE is estimable") +
  theme_mod3()

16.4 Average Treatment Effect (ATE)

\[\text{ATE} = E[\tau_i] = E[Y_i(1) - Y_i(0)]\]

The ATE is the expectation of the ITE across the entire target population — the answer to “what would happen on average if this treatment were applied to everyone?”

The assumption needed to use ATE as a proxy for ITE: Treatment effect homogeneity — the belief that every unit responds the same way (\(\tau_i = \tau\) for all \(i\)). If the eco-label added exactly the same WTP boost to every consumer, the ATE and every ITE would be identical, and knowing the ATE would fully characterise all individual responses.

In practice, homogeneity is almost never plausible. The ITE histogram above makes this plain: effects range from negative to strongly positive. This means the ATE answers a different question than the ITE. The ATE is the right estimand for population-level policy decisions — “should supermarkets mandate eco-labels on all products?” — because its answer applies in aggregate even when individual effects vary wildly. But it is a poor guide to predicting any individual consumer’s response, because it averages over enormous heterogeneity.

Why the ATE is still the gold standard for experiments: In a randomised trial, random assignment makes the treated and control groups exchangeable in expectation, so the observed mean difference is an unbiased estimate of the ATE without assuming homogeneity. The experiment recovers the ATE honestly; it is the subsequent inference from ATE to individual effect that requires the homogeneity assumption.

That said, the exchangeability assumption is more fragile than it appears — and more fragile than is often acknowledged. Random assignment guarantees balance in expectation across infinite replications, but any single study can produce unlucky covariate imbalance. More importantly, the behavioural conditions for exchangeability extend beyond the draw of assignment: if participants in the treatment condition behave differently because they know they are treated — through demand characteristics, Hawthorne effects, or awareness of experimental purpose — or if the sample differs systematically from the target population, then the observed mean difference estimates something other than the population ATE. Module 2 explored these threats in detail: the design features that protect exchangeability (pre-registration, cover stories, unobtrusive measures, stimulus sampling, representative recruitment) are precisely what make an ATE estimate credible rather than merely randomised.

Code
treat_rct <- rbinom(N, 1, 0.5)
Y_obs_rct <- ifelse(treat_rct==1, Y1, Y0)
df_rct    <- tibble(id=1:N, treat=treat_rct, Y_obs=Y_obs_rct, Y0, Y1, ITE, env, price_sens, income)

ATE_est <- mean(Y_obs_rct[treat_rct==1]) - mean(Y_obs_rct[treat_rct==0])
lm(Y_obs ~ treat, data=df_rct) |> tidy() |>
  filter(term=="treat") |>
  transmute(Estimand="ATE", `True value`=round(ATE_true,3), Estimate=round(estimate,3),
            SE=round(std.error,3),
            `95% CI`=sprintf("[%.3f, %.3f]", estimate-1.96*std.error, estimate+1.96*std.error)) |>
  knitr::kable(caption="ATE estimate from randomised eco-label experiment")
ATE estimate from randomised eco-label experiment
Estimand True value Estimate SE 95% CI
ATE 0.586 0.72 0.08 [0.563, 0.878]

16.5 Conditional Average Treatment Effect (CATE)

\[\text{CATE}(x) = E[\tau_i \mid X_i = x]\]

The CATE partitions the population by pre-treatment characteristics \(X\) and estimates a separate average treatment effect within each subgroup. The eco-label may add $1.40 WTP for consumers with high environmental concern but only $0.15 for low-concern consumers — the ATE buries this story.

What CATE relaxes: The global homogeneity assumption. CATE allows different types of consumers to respond differently; it only requires homogeneity within each subgroup (everyone with the same covariate profile \(x\) is assumed to respond identically to their group’s average).

What CATE still assumes — and why the gap to ITE remains:

  1. You have measured the right moderators — and measured them well. You can only condition on variables you have collected. If the strongest driver of heterogeneity is something unmeasured — say, a consumer’s specific shopping occasion or the brand’s existing equity — your CATEs are averages over that unobserved heterogeneity and may not correspond to any real subgroup’s true effect. But even when you have collected the right moderator, measurement quality matters enormously: a construct that suffers from discriminant validity failures, cross-group non-invariance, or systematic response bias (see Module 1) produces distorted CATE estimates even if the causal heterogeneity is real. A CATE model built on “environmental concern” items that actually blend concern with general pro-social identity will attribute the wrong heterogeneity to the wrong mechanism. The entire CATE machinery is predicated on the measurement infrastructure being sound.

  2. Correct functional form when extrapolating. Estimating CATE for a continuous moderator (e.g., environmental concern score) requires assuming a relationship between the moderator and the treatment effect — linearity, correct interactions, no omitted non-linearities. Even a well-estimated linear CATE model will give wrong predictions for individuals whose true effect relationship is non-linear.

  3. Within-cell homogeneity. Even if you carve out fine-grained cells, predicting an individual’s response from their cell’s CATE still assumes every person in that cell responds identically to the cell average. As cells get finer this becomes a stronger claim about fewer people, eventually reducing to the same problem as the ITE itself.

The plot below shows CATE estimates versus the (unobservable) true CATEs across four environmental concern bins. The estimates track the truth well here — but each bar still summarises a distribution of individual effects within the bin.

Code
df_rct |>
  mutate(env_grp=cut(env, c(-Inf,-1,0,1,Inf), labels=c("Very Low","Low","High","Very High"))) |>
  group_by(env_grp) |>
  summarise(CATE_true=mean(Y1-Y0), CATE_est=mean(Y_obs[treat==1])-mean(Y_obs[treat==0]),
            n=n(), .groups="drop") |>
  pivot_longer(starts_with("CATE"), names_to="type", values_to="val") |>
  ggplot(aes(x=env_grp, y=val, fill=type)) +
  geom_col(position="dodge", alpha=0.85, width=0.65) +
  geom_hline(yintercept=ATE_true, linetype="dashed", colour="grey30") +
  annotate("text", x=0.65, y=ATE_true+0.05, label="Overall ATE", size=3.5) +
  scale_fill_manual(values=c(CATE_true=clr_eco, CATE_est=clr_ctrl),
                    labels=c("True CATE","Estimated CATE")) +
  labs(x="Environmental Concern", y="Eco-label effect ($)", fill=NULL,
       title="CATE: the label works far better for environmentally engaged consumers",
       subtitle="Reporting only the ATE hides this story completely") +
  theme_mod3()

16.6 Compliance, ITT, LATE, ATT, and ATC

When the treatment of interest cannot be directly assigned — only encouraged — the population stratifies into four compliance types based on how each unit would respond to assignment under both conditions:

Type \(D_i(Z{=}0)\) \(D_i(Z{=}1)\) Description
Complier 0 1 Takes treatment only when assigned
Always-taker 1 1 Takes treatment regardless
Never-taker 0 0 Never takes treatment
Defier 1 0 Takes treatment only when not assigned

In the eco-label context: a store may be assigned to display eco-labels (\(Z=1\)), but individual consumers may or may not actually look at them (\(D=1\)). Consumers who would read any label they see are always-takers; those who read the label only because the store displays it prominently are compliers; those who never engage with shelf labels are never-takers.

Each estimand below targets a different slice of the ITE distribution — and each translation back to the ITE requires different assumptions:

Intent-to-treat (ITT): The effect of assignment to the treatment condition regardless of actual receipt. In an imperfect-compliance world, ITT underestimates the ATE by a factor equal to the compliance rate, because some assigned units are not actually treated. ITT is the most conservative and assumption-light estimand here, but it answers the wrong question if you care about the treatment’s biological or psychological mechanism rather than the logistical reality of assignment.

Local Average Treatment Effect (LATE): The ITT scaled by the first-stage compliance rate (the Wald estimator). This recovers the average ITE for compliers only — those who take the treatment because they were assigned to, not because they always would have. The LATE is the right estimand when you care about the policy lever of assignment, and it rests on two additional assumptions beyond randomisation:

  • Monotonicity: No defiers. Every unit’s probability of taking treatment is weakly higher under assignment (\(D_i(1) \geq D_i(0)\) for all \(i\)). This rules out consumers who deliberately avoid eco-labels when stores promote them.
  • Exclusion restriction: Assignment \(Z\) affects the outcome only through take-up \(D\) — not through any direct path. If displaying eco-labels changes store atmosphere and affects WTP independently of whether any consumer reads them, the exclusion restriction fails. This is a structural version of the manipulation validity problem from Module 2: just as a laboratory manipulation can inadvertently activate demand characteristics or alter the psychological context beyond the intended treatment, an instrument can influence outcomes through channels the researcher neither intended nor anticipated. The same design rigour that Module 2 prescribed for protecting internal validity — careful operationalisation, manipulation checks, ruling out alternative accounts of the treatment — applies equally to establishing that an instrument is truly excludable.

Even with these assumptions satisfied, LATE still describes an average over compliers’ ITEs, requiring within-complier homogeneity to predict any individual complier’s response.

ATT and ATC: The average treatment effect for units that did (ATT) or did not (ATC) actually receive the treatment. Because treatment receipt may be self-selected — consumers who seek out eco-labels are likely those who care most about sustainability and therefore respond most strongly — ATT \(\neq\) ATE \(\neq\) ATC. Translating from ATT to ITE for a specific treated unit requires assuming that unit is representative of the treated group, which is a homogeneity claim within the treated subpopulation.

NoteThe chain of assumptions back to ITE

Every estimand is a different answer to a different version of the causal question. The further you move from the population-wide ATE toward more targeted estimands, the more specific the claim — but also the more assumptions required to bridge back to the individual level:

Estimand Population Additional assumption(s) to reach ITE
ATE All units Treatment effect homogeneity
CATE Subgroup \(X = x\) Correct moderator specification; within-cell homogeneity
LATE Compliers Monotonicity; exclusion restriction; within-complier homogeneity
ATT Treated units Representativeness of the treated for the target individual
ATC Control units Representativeness of controls for the target individual

None of these estimands is the ITE. All of them are averages over a group of individuals. The ITE is the only true unit-level causal effect, and it is the only one that requires no assumptions about other units — but it is also the only one we can never observe. Applied causal inference is the art of choosing the estimand whose assumptions are most defensible given your design and question, while being explicit that a gap always remains between your estimate and the individual-level effect that treatment ultimately acts upon.

Code
set.seed(2024)
Z <- treat_rct
p_comply_if_Z1 <- plogis(0.8 + 0.6*env)
p_always_taker <- plogis(-2.5 + 0.5*env)
D_if_Z1 <- rbinom(N, 1, p_comply_if_Z1)
D_if_Z0 <- rbinom(N, 1, p_always_taker)
D       <- ifelse(Z==1, D_if_Z1, D_if_Z0)
compliance_type <- case_when(
  D_if_Z1==1 & D_if_Z0==0 ~ "Complier",
  D_if_Z1==1 & D_if_Z0==1 ~ "Always-taker",
  D_if_Z1==0 & D_if_Z0==0 ~ "Never-taker",
  D_if_Z1==0 & D_if_Z0==1 ~ "Defier"
)
knitr::kable(table(compliance_type), col.names=c("Compliance type","Count"),
             caption="True compliance breakdown (observable only in simulation)")
True compliance breakdown (observable only in simulation)
Compliance type Count
Always-taker 133
Complier 1244
Defier 44
Never-taker 579
Code
Y_obs_iv <- ifelse(D==1, Y1, Y0)
df_iv    <- tibble(Z, D, Y_obs_iv, Y0, Y1, ITE, compliance_type, env, price_sens)

ITT       <- mean(Y_obs_iv[Z==1]) - mean(Y_obs_iv[Z==0])
FS        <- mean(D[Z==1]) - mean(D[Z==0])
LATE_est  <- ITT / FS
LATE_true <- mean(ITE[compliance_type=="Complier"])
ATT_est   <- mean(ITE[D==1])
ATC_est   <- mean(ITE[D==0])

tibble(
  Estimand    = c("ATE","ITT","LATE (Wald)","ATT","ATC"),
  Description = c(
    "Effect averaged over all 2,000 consumers",
    "Effect of *assigning* the label regardless of compliance",
    "Effect for consumers who complied *because* they were assigned",
    "Effect for consumers who *actually saw* the label",
    "Effect for those who *did not see* it"
  ),
  `True value` = round(c(ATE_true, ITT, LATE_true, ATT_est, ATC_est), 3),
  Estimated    = round(c(ATE_true, ITT, LATE_est,  ATT_est, ATC_est), 3)
) |> knitr::kable(caption="Five ways to summarise 'the effect' of the eco-label")
Five ways to summarise ‘the effect’ of the eco-label
Estimand Description True value Estimated
ATE Effect averaged over all 2,000 consumers 0.586 0.586
ITT Effect of assigning the label regardless of compliance 0.520 0.520
LATE (Wald) Effect for consumers who complied because they were assigned 0.716 0.889
ATT Effect for consumers who actually saw the label 0.792 0.792
ATC Effect for those who did not see it 0.499 0.499