Part 2: What Does Random Assignment Require?

▶ Load required packages

# install.packages(c("tidyverse","ggplot2","knitr","scales","patchwork"))

library(tidyverse)
library(ggplot2)
library(knitr)
library(scales)
library(patchwork)
if (!requireNamespace("lavaan", quietly = TRUE)) install.packages("lavaan")
library(lavaan)

set.seed(2025)

Part 2: What Does “Random Assignment” Actually Require?

The central message of Part 2

Never assume that random assignment has achieved actual randomization. Random assignment is a procedure — clicking “Randomize” in Qualtrics, running sample() in R. What that procedure is supposed to deliver is exchangeability — the latent property that treatment and control groups are interchangeable with respect to every potential cause of Y, so that any difference in outcomes can be attributed solely to the treatment. The procedure does not guarantee the property. Whether it does depends on (1) how complex your outcome variable Y is and (2) how many participants you have. This section develops the gap between the two and equips you to evaluate — not assume — whether exchangeability was achieved.

Three failure modes that can produce imbalanced groups

Not all imbalance has the same cause or the same remedy:

A flawed assignment mechanism — the procedure itself was not truly random (e.g., participants self-selected into conditions, blocks were allocated predictably, or the algorithm had a bug). The imbalance is structural; better analysis cannot fix a broken procedure.
A valid randomization with unlucky realized imbalance — the procedure was correct, but any finite-sample draw will produce some imbalance by chance. The diagnostic reflects the realized sample, not the quality of the procedure. Failed balance is not the same thing as failed randomization.
Latent imbalance on unmeasured dimensions — measured covariates balance out, but unobserved antecedents of Y do not. This is undetectable from the balance table and is the primary threat this section addresses.

The right language for cases 2 and 3 is “realized imbalance” or “failed balance diagnostic” — not “randomization failed.” A valid randomization can produce an unlucky balance outcome in a finite sample. The question is always whether that imbalance is small enough, relative to the covariate’s effect on Y, that the treatment contrast remains substantively credible.

The Measurement Analogy: Observable vs. Latent

In Module 1, we introduced a distinction that runs through all of empirical research:

	Example from Module 1
Observable (indicator)	A participant’s response on a 7-point Likert scale
Latent construct	Their true level of Green Purchase Intention

The observable is what you can see. The latent construct is what you actually want to know. The entire Module 1 was about the gap between these two things: your observed scale picks up variance from the wrong latent constructs (discriminant validity), from unequal measurement properties across groups (non-invariance), and from subpopulations that behave differently (latent classes).

Exactly the same distinction applies to randomization:

	In Randomization
Observable (indicator)	Pressing “Randomize” in Qualtrics, running `sample()` in R, using a random number table
Latent construct	Exchangeability — the property that treatment and control groups are interchangeable with respect to all potential causes of Y; no systematic pre-existing difference remains between conditions

Just as a 7-point scale can look like it measures Green Purchase Intention while actually reflecting Environmental Concern, a randomization procedure can look random (you clicked the button — the observable is there) while failing to achieve exchangeability — the latent property that the groups truly have no pre-existing differences on anything that affects Y.

The construct validity question for randomization

Ask yourself the question from Module 1, applied to your treatment variable:

“Does your observable act of randomization have construct validity as an indicator of latent exchangeability?”

Exchangeability is not checked by confirming that a randomization algorithm was run — it must be verified empirically. Its validity depends entirely on how complex the outcome variable Y is and how many participants you have. We will now build the intuition for why.

A Concrete Anchor: The Playing Cards Analogy

Imagine you have a sorted deck of cards — you know exactly what order the cards are in. Before a game, you need to make the deck random. So you shuffle it once.

Did one shuffle make the deck random?

The answer depends on the size of the deck.

5 cards: There are 5! = 120 possible orderings. One riffle shuffle covers a substantial fraction of these. The deck is close to random.
52 cards: There are 52! ≈ 8 × 10⁶⁷ possible orderings. One riffle shuffle covers an astronomically small fraction of these. Significant structure from the original sorted order survives.

This is why casino dealers shuffle 7 times — a mathematical result (Bayer & Diaconis, 1992) showing that 7 GSR riffle shuffles are sufficient for a 52-card deck to approach uniformity. One shuffle is not enough.

We can visualize this directly. We simulate one-through-ten riffle shuffles for decks of different sizes and measure how much structure (correlation with the original sorted order) survives.

▶ Simulate riffle shuffles for different deck sizes

one_riffle <- function(deck) {
  n <- length(deck)
  k <- rbinom(1, n, 0.5)
  if (k == 0L || k == n) return(deck[sample(n)])
  pile_a <- deck[1:k]; pile_b <- deck[(k + 1):n]
  result <- integer(n); idx <- 1L
  while (length(pile_a) > 0L && length(pile_b) > 0L) {
    p_a <- length(pile_a) / (length(pile_a) + length(pile_b))
    if (runif(1) < p_a) { result[idx] <- pile_a[1]; pile_a <- pile_a[-1]
    } else               { result[idx] <- pile_b[1]; pile_b <- pile_b[-1] }
    idx <- idx + 1L
  }
  if (length(pile_a) > 0L) result[idx:n] <- pile_a
  if (length(pile_b) > 0L) result[idx:n] <- pile_b
  result
}

set.seed(42)
shuffle_sim <- expand.grid(deck_size = c(5L, 10L, 20L, 52L),
                           n_shuffles = 1:10, sim = 1:400) |>
  dplyr::mutate(tau = mapply(function(n, k) {
    deck <- 1:n
    for (i in seq_len(k)) deck <- one_riffle(deck)
    cor(1:n, deck, method = "kendall")
  }, deck_size, n_shuffles))

shuffle_summary <- shuffle_sim |>
  group_by(deck_size, n_shuffles) |>
  summarise(mean_tau = mean(tau), lo = quantile(tau, 0.10),
            hi = quantile(tau, 0.90), .groups = "drop") |>
  mutate(deck_label = paste0(deck_size, " cards"))

ggplot(shuffle_summary,
       aes(x = n_shuffles, y = mean_tau,
           color = factor(deck_size), fill = factor(deck_size), group = deck_size)) +
  geom_ribbon(aes(ymin = lo, ymax = hi), alpha = 0.14, color = NA) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray40", linewidth = 0.8) +
  annotate("text", x = 10.2, y = 0.03, label = "Random\n(\u03c4 \u2248 0)",
           size = 3.5, color = "gray40", hjust = 0) +
  scale_x_continuous(breaks = 1:10) +
  scale_color_manual(values = c("5"  = "#4a90d9", "10" = "#52b788",
                                "20" = "#f4a261", "52" = "#e63946"),
                     name = "Deck size") +
  scale_fill_manual(values  = c("5"  = "#4a90d9", "10" = "#52b788",
                                "20" = "#f4a261", "52" = "#e63946"),
                    name = "Deck size") +
  labs(x = "Number of riffle shuffles",
       y = "Kendall's \u03c4 with original sorted order\n(1 = perfectly sorted  \u00b7  0 = random)",
       title = "How Many Shuffles Are Needed to Achieve Randomness?",
       subtitle = "Ribbon = 10th\u201390th percentile across 400 simulations per deck size") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank())

The analogy to experiments maps directly:

Playing Cards	Research Experiment
Cards in the deck	Participants in the study
Order before shuffling	Values on potential confounders
One shuffle	Clicking “Randomize” once in Qualtrics
“Truly random” deck	Treatment and control groups balanced on ALL confounders
Size of the deck	Complexity (number of dimensions) of the outcome Y

The key takeaway: just as a larger deck needs more shuffles to achieve randomness, a more complex outcome variable Y needs more observations to achieve true independence between treatment and all potential confounders.

Exchangeability: The Latent Property Random Assignment Is Supposed to Deliver

The permutation logic in Part 1 rested on exchangeability: if treatment had no effect, we could have assigned any participant to either condition without changing what we would observe. In formal terms, exchangeability means:

\[Y^{(0)}, Y^{(1)} \perp\!\!\!\perp T\]

That is, potential outcomes (what Y would be under control, what Y would be under treatment) are independent of which condition a participant actually received. This is the statistical expression of “the groups have no pre-existing differences that matter.”

Exchangeability holds when treatment assignment is truly independent of all potential causes of Y. In practice, researchers assume this is achieved by the act of randomization — but you should never make this assumption without empirical verification. The gap between the observable and the latent is stark:

What the researcher observes: Qualtrics assigned participants to conditions using a random number generator. ✓
What the researcher needs (latent): Exchangeability — treatment assignment is uncorrelated with all potential causal antecedents of Y — every dimension, every subdimension, every latent moderator in the entire data-generating process. ?

Whether the randomization procedure achieves the exchangeability property depends on two things: 1. How many antecedents Y has (the complexity of the construct) 2. How many participants you have (the sample size)

Case 1 — A Narrow Construct: Visual Fixation Time

Suppose your outcome variable Y is total gaze duration (in milliseconds) on an eco-label during a product display, measured via eye-tracking. This is a narrow, physiological measure driven by a small number of simple perceptual factors.

Think of dimensions as all the hidden moderators in the data-generating process — every property of the participant or their environment that makes them more or less likely to fixate on the label, regardless of the experimental manipulation. These are the variables that, if correlated with treatment assignment by chance, would confound your estimate of the treatment effect. In the entire data-generating process for visual fixation, suppose there are only 2 such binary dimensions:

Dimension	Low (0)	High (1)
Screen luminance	Dim display	Bright display
Viewing distance	Far from screen	Close to screen

These are simple physical factors. They are not set by the researcher and could vary unsystematically across participants assigned to each condition.

Because both dimensions are binary, the complete “design space” of all possible hidden-moderator combinations is a 2 × 2 matrix of 4 cells:

▶ Visualize: 2×2 cell matrix for narrow construct

cells <- expand.grid(
  Luminance = c("Dim display\n(Low)", "Bright display\n(High)"),
  Distance  = c("Far from screen\n(Low)", "Close to screen\n(High)")
)
cells$label <- c("Cell A", "Cell B", "Cell C", "Cell D")

ggplot(cells, aes(x = Luminance, y = Distance)) +
  geom_tile(fill = "#d6ecff", color = "#4a90d9", linewidth = 1.2,
            width = 0.92, height = 0.92) +
  geom_text(aes(label = label), size = 5.5, fontface = "bold", color = "#1a3a6b") +
  geom_text(aes(label = "n \u2265 1"), size = 3.8, color = "#555", vjust = 3.2) +
  labs(title = "Design Space for Visual Fixation (Narrow Construct)",
       subtitle = "2 binary hidden moderators \u2192 2\u00b2 = 4 cells  \u00b7  Minimum N to populate matrix = 4",
       x = "Screen Luminance", y = "Viewing Distance") +
  theme_minimal(base_size = 13) +
  theme(panel.grid = element_blank(), axis.text = element_text(size = 10))

Why 4 cells?

With k binary dimensions in the data-generating process, the total number of unique hidden-moderator combinations is 2^k. Each cell represents one possible “type” of participant in terms of everything that drives Y. For randomization to succeed, treatment and control groups must be approximately balanced across all 2^k cells — the smaller k is, the easier this is to achieve.

For k = 2: 2² = 4 cells. A minimal experiment of N = 4 would cover the space; N = 40 gives 10 per cell with very reliable balance.

With only 4 meaningful combinations in the design space, even a modest experiment (e.g., N = 40, 10 per cell) will very likely achieve approximate balance between treatment and control. One act of randomization is almost certainly sufficient.

Case 2 — A Broad Construct: Sustainable Purchase Intention

Now suppose your outcome Y is sustainable product purchase intention — the kind of broad, latent, multi-faceted attitude that appears throughout consumer and organizational research, and that you saw in Module 1 struggling to maintain discriminant validity across measurement attempts.

In the data-generating process, suppose there are 10 binary determinants of this construct:

▶ Table: 10 binary dimensions of sustainable purchase intention

dims_table <- data.frame(
  `#`       = 1:10,
  Dimension = c("Environmental values", "Household income", "Social norm salience",
                "Brand trust", "Perceived product quality", "Health motivation",
                "Peer influence", "Prior green behavior",
                "Price sensitivity", "Emotional response to sustainability messaging"),
  Low       = c("Weak", "Low income", "Low salience", "Low trust",
                "Low quality perception", "Low", "Weak influence",
                "Little prior behavior", "High sensitivity", "Low emotional response"),
  High      = c("Strong", "High income", "High salience", "High trust",
                "High quality perception", "High", "Strong influence",
                "Extensive prior behavior", "Low sensitivity", "High emotional response"),
  check.names = FALSE
)
kable(dims_table,
      col.names = c("#", "Dimension", "Level: Low (0)", "Level: High (1)"),
      caption   = "10 binary antecedents of Sustainable Purchase Intention")

10 binary antecedents of Sustainable Purchase Intention
#	Dimension	Level: Low (0)	Level: High (1)
1	Environmental values	Weak	Strong
2	Household income	Low income	High income
3	Social norm salience	Low salience	High salience
4	Brand trust	Low trust	High trust
5	Perceived product quality	Low quality perception	High quality perception
6	Health motivation	Low	High
7	Peer influence	Weak influence	Strong influence
8	Prior green behavior	Little prior behavior	Extensive prior behavior
9	Price sensitivity	High sensitivity	Low sensitivity
10	Emotional response to sustainability messaging	Low emotional response	High emotional response

With k = 10 binary dimensions, the total number of unique participant profiles in the design space is 2^10 = 1,024 cells. Notice how quickly this grows compared to the narrow construct:

▶ Plot: combinatorial growth of design space

expl_df <- data.frame(k = 1:15, cells = 2^(1:15)) |>
  mutate(label = ifelse(k == 2,  "Recall\nAccuracy\n(k = 2)",
                 ifelse(k == 10, "Sustainable\nPurchase\nIntention\n(k = 10)", NA)))

ggplot(expl_df, aes(x = k, y = cells)) +
  geom_line(color = "#4a90d9", linewidth = 1.4) +
  geom_point(size = 3, color = "#4a90d9") +
  geom_point(data = filter(expl_df, k %in% c(2, 10)),
             size = 5, shape = 21, fill = "white", color = "#d7301f", stroke = 2) +
  geom_text(data = filter(expl_df, k %in% c(2, 10)),
            aes(label = label),
            nudge_x = 0.9, nudge_y = 0, hjust = 0, size = 3.2,
            color = "#d7301f", lineheight = 0.9) +
  geom_vline(xintercept = c(2, 10), linetype = "dotted", color = "#d7301f", linewidth = 0.7) +
  scale_y_log10(labels = label_comma(),
                breaks  = c(4, 32, 256, 1024, 8192, 32768)) +
  scale_x_continuous(breaks = c(1, 2, 5, 10, 15)) +
  labs(x = "Number of binary dimensions (k)",
       y = "Design space: number of unique profiles (log scale)",
       title = "How Quickly Does the Design Space Grow?",
       subtitle = "Each additional binary dimension doubles the number of distinct participant profiles (2\u1d4f)") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank())

Visualizing the 1,024-Cell Matrix

With 10 binary dimensions, the full design matrix has 1,024 rows. We can visualize this as a heatmap where each column is a unique participant profile and each row is one dimension:

▶ Plot: all 1,024 unique profiles in the 10-dimension design space

all_profiles <- as.data.frame(
  do.call(expand.grid, replicate(10, 0:1, simplify = FALSE)))
colnames(all_profiles) <- c("Env. Values", "Income", "Social Norms",
                             "Brand Trust", "Perc. Quality", "Health Motiv.",
                             "Peer Influence", "Prior Behavior",
                             "Price Sensitiv.", "Emotional Resp.")
all_profiles$profile_id <- seq_len(nrow(all_profiles))

plot_df_m <- all_profiles |>
  pivot_longer(-profile_id, names_to = "Dimension", values_to = "Level") |>
  mutate(Dimension = factor(Dimension, levels = rev(colnames(all_profiles)[-11])))

ggplot(plot_df_m, aes(x = profile_id, y = Dimension, fill = factor(Level))) +
  geom_raster() +
  scale_fill_manual(values = c("0" = "#d6ecff", "1" = "#1a3a6b"),
                    labels = c("Low (0)", "High (1)"), name = "Level") +
  labs(x = "Profile (1 of 1,024 unique combinations)", y = NULL,
       title = "The Full Design Space for Sustainable Purchase Intention",
       subtitle = "Each column = one participant profile  \u00b7  Each row = one binary dimension  \u00b7  1,024 cells total") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        panel.grid = element_blank(), legend.position = "top")

To exactly fill this matrix — one observation per cell — you would need N = 1,024 participants. That is just to have a single observation per cell, with no statistical power to spare.

From Exact Coverage to Approximate Orthogonality

Here is the key insight: you do not need to fill every cell exactly.

In a randomized experiment, what you need is not exact coverage, but approximate independence — the correlation between treatment assignment and each of the 10 dimensions should be close to zero.

How many observations do you need for that? With random assignment and N participants, the expected correlation between treatment T and any single confounder C is zero. The actual correlation in any given sample will deviate from zero by chance, with standard error ≈ 1 / √N. If you want ALL D correlations to stay below some threshold $|r_{\max}|$, you must account for D simultaneous comparisons (Bonferroni):

\[N \geq \left(\frac{z_{1-\alpha/(2D)}}{r_{\max}}\right)^2\]

What does $r_{\max}$ actually mean?

Before looking at the numbers, it is worth anchoring $r_{\max}$ concretely. If your treatment T accidentally correlates r with omitted confounder C, and C has effect β (in SD units of Y — this is identical to Cohen’s d when Y is standardized) on the outcome, then the omitted variable bias in your estimated treatment effect is approximately β × r.

A quick reminder from the power table above: Cohen’s d expresses the treatment effect in the same SD-unit currency. An effect of d = 0.40 means the group means differ by 0.40 standard deviations of Y. The β values below use that same ruler — so a β of 0.50 means the confounder shifts Y by half a standard deviation.

▶ Plot: how r_max tolerance maps to omitted-variable bias

rmax_vals <- c(0.05, 0.10, 0.20, 0.30, 0.40, 0.60)
beta_vals <- seq(0.1, 1.5, by = 0.1)

bias_df <- expand.grid(r_max = rmax_vals, beta = beta_vals) |>
  mutate(bias   = r_max * beta,
         r_label = sprintf("r = %.2f", r_max))

# Cohen's d reference lines
cohen_refs <- data.frame(
  d     = c(0.20, 0.50, 0.80),
  label = c("Small effect\n(d = 0.20)", "Medium effect\n(d = 0.50)", "Large effect\n(d = 0.80)")
)

ggplot(bias_df, aes(x = beta, y = bias,
                    color = factor(r_max), group = factor(r_max))) +
  geom_hline(data = cohen_refs, aes(yintercept = d),
             linetype = "dotted", color = "gray60", linewidth = 0.6) +
  geom_text(data = cohen_refs, aes(x = 1.52, y = d + 0.02, label = label),
            inherit.aes = FALSE, color = "gray50", size = 2.8, hjust = 0, lineheight = 0.85) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 2.5) +
  scale_color_manual(
    values = c("0.05" = "#08519c", "0.1" = "#4a90d9", "0.2" = "#52b788",
               "0.3"  = "#f4a261", "0.4" = "#e07b39", "0.6" = "#e63946"),
    labels = sprintf("r_max = %.2f", rmax_vals),
    name   = "Correlation\ntolerance (r_max)"
  ) +
  scale_x_continuous(breaks = seq(0.2, 1.4, 0.2),
                     labels = function(x) sprintf("\u03b2 = %.1f SD", x)) +
  scale_y_continuous(labels = function(x) sprintf("%.2f SD", x),
                     limits = c(0, 0.95)) +
  coord_cartesian(clip = "off") +
  labs(x = "Effect of omitted confounder on Y (\u03b2 = Cohen's d equivalent; SD units)",
       y = "Omitted-variable bias in your treatment estimate",
       title = "How Much Bias Does Residual Imbalance Create?",
       subtitle = paste0("Bias \u2248 r_max \u00d7 \u03b2  \u00b7  Dotted lines = small/medium/large effect benchmarks\n",
                         "Lower r_max = less bias once achieved, but requires more N to verify")) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank(), legend.position = "bottom",
        legend.direction = "horizontal",
        plot.margin = margin(5, 50, 5, 5))

In the coffee WTP context (SD ≈ $1.50):

r = 0.05 with β = 0.8 SD: bias ≈ 0.04 SD ≈ $0.06 — negligible
r = 0.10 with β = 0.8 SD: bias ≈ 0.08 SD ≈ $0.12 — small but detectable
r = 0.20 with β = 0.8 SD: bias ≈ 0.16 SD ≈ $0.24 — now as large as a small true effect
r = 0.40 with β = 1.0 SD → bias ≈ 0.40 SD ≈ $0.60 — larger than many published effects
r = 0.60 with β = 1.0 SD → bias ≈ 0.60 SD ≈ $0.90 — could reverse the sign of your treatment estimate entirely

A Violation of Exchangeability Is Omitted Variable Bias

Before turning to sample size requirements, it is worth being precise about what a violation of exchangeability actually is — because the answer connects directly to Module 1, Case 2 on omitted variable bias.

Recall the OVB formula. When a variable C is omitted from a regression, the estimated effect of treatment T is:

\[\hat{\beta}_T = \beta_T + \underbrace{\beta_C \cdot r_{T,C}}_{\text{omitted variable bias}}\]

where $\beta_C$ is C’s true effect on Y and $r_{T,C}$ is the correlation between T and C. In a well-designed experiment, randomization drives $r_{T,C} \approx 0$ for every confounder C — making the OVB term vanish by construction. When exchangeability fails — when treatment assignment becomes accidentally correlated with some confounder — $r_{T,C} \neq 0$, and OVB enters the experimental estimate through exactly the same channel as in observational research.

A concrete example. Suppose your eco-label WTP study recruited participants through an environmental forum, creating a systematic imbalance on environmental values:

Group	High env. values (%)	Observed WTP
Control (N = 95)	38%	$4.20
Eco-label (N = 105)	61%	$5.50

This imbalance corresponds to r(T, env_values) ≈ 0.23. Environmental values have a direct positive effect on WTP of β ≈ 0.8 SD (≈ $1.20) even without any eco-label stimulus present.

What you observe: Eco-label group mean WTP = $5.50, control mean = $4.20. Observed difference = $1.30.

The bias: $\beta_C \cdot r_{T,C} = 1.20 \times 0.23 \approx \$0.28$.

True eco-label effect: $1.30 − $0.28 = $1.02 — your estimate is inflated by roughly 27%.

The groups are no longer exchangeable. You are not comparing “eco-label vs. no eco-label” — you are comparing “eco-label + environmentally minded participants” against “no eco-label + less environmentally minded participants.” The imbalance on environmental values correlates with WTP independently of the eco-label, injecting bias into what was supposed to be a clean experimental estimate.

Reading the direction of bias from the balance table

Because the bias formula is $\beta_C \cdot r_{T,C}$, you can reason about direction and magnitude for any confounder you measured:

Treated group has more of a positive confounder (r > 0, β > 0): bias inflates the treatment estimate — your effect looks larger than it is
Treated group has less of a positive confounder (r < 0, β > 0): bias deflates the estimate — your effect looks smaller than it is
Confounder has a negative effect on Y (β < 0): these directions flip

For confounders you did not measure, the direction is unknown — which is why preventing imbalance through design (blocked randomization, large N) is always preferable to correcting for it post hoc.

This connection is not incidental. It is the reason why the entire framework for evaluating randomization in this section is built around $r_{\max}$: that tolerance parameter is the limit you are placing on permissible OVB. A looser $r_{\max}$ allows more OVB to enter through treatment-covariate correlations; a stricter one requires more observations to achieve.

Required N for Approximate Orthogonality

▶ Calculate: N required for approximate orthogonality

required_N <- function(D, r_thresh = 0.10, alpha = 0.05) {
  z_crit <- qnorm(1 - alpha / (2 * D))
  ceiling((z_crit / r_thresh)^2)
}

dims_check   <- c(1, 2, 3, 5, 8, 10, 15, 20)
thresh_check <- c(0.05, 0.10, 0.20, 0.30, 0.40, 0.60)

ortho_table <- expand.grid(D = dims_check, r_max = thresh_check) |>
  mutate(N_req = mapply(required_N, D, r_max),
         exact_cells = 2^D) |>
  pivot_wider(names_from = r_max, values_from = N_req,
              names_prefix = "r_max = ") |>
  rename(`Dimensions (D)` = D, `Exact cells (2^D)` = exact_cells)

kable(ortho_table,
      caption = "Minimum N for approximate orthogonality at \u03b1 = .05 (Bonferroni-corrected)",
      format.args = list(big.mark = ","))

Minimum N for approximate orthogonality at α = .05 (Bonferroni-corrected)
Dimensions (D)	Exact cells (2^D)	r_max = 0.05	r_max = 0.1	r_max = 0.2	r_max = 0.3	r_max = 0.4	r_max = 0.6
1	2	1,537	385	97	43	25	11
2	4	2,010	503	126	56	32	14
3	8	2,293	574	144	64	36	16
5	32	2,654	664	166	74	42	19
8	256	2,991	748	187	84	47	21
10	1,024	3,152	788	197	88	50	22
15	32,768	3,447	862	216	96	54	24
20	1,048,576	3,657	915	229	102	58	26

▶ Plot: N required vs. dimensions for different tolerances

ortho_plot_df <- expand.grid(D = 1:20, r_max = thresh_check) |>
  mutate(N_required = mapply(required_N, D, r_max),
         exact_N    = 2^D,
         r_label    = paste0("r_max = ", r_max))

ggplot(ortho_plot_df, aes(x = D)) +
  geom_line(aes(y = N_required, color = r_label, group = r_label), linewidth = 1.2) +
  geom_line(aes(y = exact_N), color = "black", linetype = "dashed", linewidth = 1.0) +
  geom_vline(xintercept = c(2, 10), linetype = "dotted", color = "gray50") +
  annotate("text", x = 2.3, y = 40000, label = "Recall\nAccuracy\n(k=2)",
           size = 2.8, color = "gray40", hjust = 0) +
  annotate("text", x = 10.3, y = 40000, label = "Sust. Purchase\nIntention\n(k=10)",
           size = 2.8, color = "gray40", hjust = 0) +
  annotate("text", x = 14.5, y = 2^14 * 1.8, label = "Exact fill\n(2^D cells)",
           size = 2.8, color = "black") +
  scale_y_log10(labels = label_comma()) +
  scale_x_continuous(breaks = seq(1, 20, 2), limits = c(1, 21)) +
  scale_color_manual(
    values = c("r_max = 0.05" = "#08519c", "r_max = 0.1"  = "#4a90d9",
               "r_max = 0.2"  = "#52b788", "r_max = 0.3"  = "#f4a261",
               "r_max = 0.4"  = "#e07b39", "r_max = 0.6"  = "#e63946"),
    name = "Correlation\ntolerance"
  ) +
  coord_cartesian(clip = "off") +
  labs(x = "Number of binary dimensions of Y (D)",
       y = "Observations required (log scale)",
       title = "How Many Observations for Approximate Orthogonality?",
       subtitle = "Dashed = exact fill (2^D)  \u00b7  Colored lines = Bonferroni-corrected independence threshold") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank(), legend.position = "right",
        plot.margin = margin(5, 30, 5, 5))

Reading the table: Why does r_max = 0.05 at D = 10 require more N than the Exact cells column?

At first glance, a row like D = 10 looks paradoxical: the Exact cells (2^D) = 1,024 column is smaller than the r_max = 0.05 column (≈ 3,152). Doesn’t that mean structured assignment needs fewer people than random assignment — even while achieving perfect orthogonality?

Yes — and that is exactly right, for a subtle reason. The two columns are measuring different things:

Exact cells (2^D): This is the total number of cells in a fully structured full-factorial design — every possible combination of D binary dimensions gets exactly one observation. By construction, all D factors are perfectly balanced and mutually uncorrelated (r = 0 exactly). No statistical test is needed; orthogonality is guaranteed by design. The 1,024 figure is simply the number of cells, not a sample size adequate for inference.
r_max columns: These are the N required to verify that a random, unstructured assignment achieved approximate orthogonality at the stated tolerance, using a Bonferroni-corrected independence test. With random assignment, there is no structural guarantee — you must check empirically, and the check requires enough power to detect any correlation exceeding r_max.

The full factorial is more efficient per cell because its structure eliminates random correlation entirely. But that efficiency evaporates quickly in practice: with only 1 observation per cell you cannot estimate variance or make inferential statements. A realistic full factorial at D = 10 needs at least 10–30 observations per cell, putting the total at 10,240–30,720 — far more than the 3,152 needed under random assignment with r_max = 0.05.

In short: the exact-cells column tells you the minimum to fill the design; the r_max columns tell you the minimum to trust random assignment. At moderate D, random assignment wins on cost; at small D and large cells, full factorial wins on precision.

The key tradeoff — and why most experiments are implicitly operating at high $r_{\max}$

The table above reveals an uncomfortable tension that sits at the heart of experimental practice:

Lower $r_{\max}$ = a stricter standard = less confounding once achieved = better causal inference.

But lower $r_{\max}$ also requires far more observations to reliably verify. N scales as $1/r_{\max}^2$, so halving your tolerance quadruples the required sample.

The core punchline

This is the central tension in designing experiments for complex constructs. The researcher faces a dilemma:

Set a strict $r_{\max}$ (e.g., 0.05–0.10): Your causal inferences will be clean — but you need sample sizes that few lab studies can afford.
Accept a lenient $r_{\max}$ (e.g., 0.30–0.40): Your sample size is achievable — but residual confounding may be large enough to bias your estimates substantially, especially when individual confounders have strong effects (β ≈ 0.5–1.0 SD) on Y.

A typical psychology or marketing experiment with N = 60–120 participants is, at best, achieving $r_{\max} \approx 0.30$–$0.40$ for a construct with D = 5–10 dimensions — regardless of whether Qualtrics randomized. Researchers who report a balance table with N = 80 and declare “randomization succeeded” are not wrong about the observable; they are implicitly operating at a very lenient latent $r_{\max}$, where substantial confounding remains undetected.

In Module 1’s language: their randomization observable has low construct validity as an indicator of latent randomization, because the gap between the observable and the latent property is too wide to close at typical sample sizes.

With a narrow outcome like Visual Fixation Time (D = 2 dimensions): - To ensure $|r| < 0.10$ with 95% confidence: N ≈ 503 — feasible - A study with N = 80 comfortably satisfies even a strict $r_{\max} = 0.20$

With a broad outcome like Sustainable Purchase Intention (D = 10 dimensions): - To ensure $|r| < 0.10$: N ≈ 788 — rarely achieved in practice - To ensure $|r| < 0.20$: N ≈ 197 — still demanding for most lab studies - A study with N = 200 barely meets the r_max < 0.20 threshold (required N = 197). A typical lab study with N = 80, however, is operating at an effective $r_{\max} \approx 0.31$ — where a single strong confounder (β ≈ 0.8 SD) can inject bias of 0.24 SD or more into your treatment estimate

This is the direct experimental parallel to Module 1’s core finding about discriminant validity. A scale with many correlated sub-dimensions is hard to measure cleanly; a construct with many latent antecedents is hard to randomize cleanly. Both problems grow with construct breadth, and both require you to map the latent space before claiming clean measurement or clean identification.

Simulation: Correlated Dimensions — Two Countervailing Forces

The analysis so far assumed that the 10 latent dimensions of Y are independent of one another in the data-generating process. But is that realistic?

Think about the constructs driving Sustainable Purchase Intention. Environmental values, health motivation, social norm salience, and peer influence tend to cluster together — the “conscious consumer” profile. Similarly, price sensitivity, income, and habitual spending patterns cluster together — the “pragmatic buyer” profile. The latent dimensions are correlated, not orthogonal.

Two forces pull in opposite directions

Before running any simulation, consider the logic:

Force 1 — fewer independent checks (sometimes helps). If two dimensions are highly correlated (say, environmental values and social norm salience move together), then for the purpose of counting how many independent things can go wrong, you effectively have fewer than 10 independent dimensions. Fewer independent checks = lower probability that any one of them exceeds the threshold by chance = slightly easier to achieve orthogonality for a given N.

Force 2 — cascade failures (always hurts). The flip side is severe. When randomization does fail on one correlated dimension, it fails on all the correlated ones simultaneously. Imbalance on “environmental values” drags along imbalance on “social norms,” “health motivation,” and “peer influence” all at once. Each of these has its own independent β effect on Y, so the bias compounds. Instead of one confounder contributing β × r bias, you get four correlated confounders each contributing, with their effects accumulating.

When does correlation help vs. hurt?

The key variable is why the dimensions are correlated:

Scenario	Structure	Effect on randomization
Helpful: dimensions cluster around a few latent “super-factors”	e.g., all “values-based” dimensions move together as one bloc	Fewer effective checks; orthogonality easier to achieve at smaller N
Hurtful: dimensions are correlated and each carries independent variance in Y	e.g., environmental values, health focus, and social norms each have separate β effects on WTP	Cascade failures: one imbalanced assignment fails simultaneously on 3 dimensions with additive bias
Worst case: high correlation and strong independent effects on Y	The dims are correlated with each other AND each has a large β	One randomization failure contaminates every correlated dimension, compounding bias substantially

The concrete intuition: imagine you are a researcher measuring balance, and you check environmental values and find it balanced (r ≈ 0). If that dimension was highly correlated with health focus and social norms in your sample, you might infer those are balanced too — and you might be right. That is the helpful version. But if each of those three dimensions has an independent effect on WTP that wasn’t captured by environmental values alone, then any failure of balance cascades to all three simultaneously — that is the hurtful version.

This maps directly onto Module 1’s concept of correlated subscales. When subscales are highly correlated, you gain efficiency in measurement (fewer items needed to capture the same information). But if they are discriminantly invalid — correlated and capturing separate variance — then mismeasuring one means mismeasuring all.

Simulation

We run two contrasting scenarios, each featuring correlated dimensions but differing in whether the failure mode is primarily helpful or hurtful:

Scenario A — “Helpful” correlation (ρ = 0.75, uniform small β): Dimensions are very strongly correlated — effectively 10 measured dimensions but only ~2–3 independent blocs. Each dimension has a similar, small β effect on Y (β = 0.15). When the correlation structure is this tight, the effective number of independent checks drops sharply, making orthogonality easier to achieve.
Scenario B — “Hurtful” correlation (ρ = 0.35, heterogeneous large β): Dimensions have moderate correlation but each carries a meaningfully different and large independent effect on Y (ranging from β = 0.05 to β = 1.50). When randomization fails on one dimension, several correlated ones fail simultaneously — and because each carries its own large β, the cumulative bias explodes.

▶ Simulate: three DGP scenarios + Type-1 error rate (may take ~2 min)

# ── Helper: correlated binary covariates via Cholesky ────────────────────────
corr_binary <- function(N, D, rho) {
  Sigma <- matrix(rho, D, D); diag(Sigma) <- 1
  L     <- chol(Sigma)
  z     <- matrix(rnorm(N * D), nrow = N) %*% L
  (z > 0L) * 1L
}

# ── Simulation: P(max |r| < threshold) ───────────────────────────────────────
sim_orthog <- function(N, D, rho = 0, threshold = 0.10, n_sim = 1500) {
  mean(replicate(n_sim, {
    X        <- if (rho == 0) matrix(rbinom(N * D, 1, 0.5), nrow = N)
                else          corr_binary(N, D, rho)
    T_assign <- rbinom(N, 1, 0.5)
    cors     <- apply(X, 2, function(col) abs(cor(col, T_assign)))
    max(cors) < threshold
  }))
}

# ── Simulation: cumulative bias when any failure occurs ──────────────────────
sim_bias <- function(N = 200, D = 10, rho = 0, betas, threshold = 0.10, n_sim = 1000) {
  replicate(n_sim, {
    X        <- if (rho == 0) matrix(rbinom(N * D, 1, 0.5), nrow = N)
                else          corr_binary(N, D, rho)
    T_assign <- rbinom(N, 1, 0.5)
    cors     <- apply(X, 2, function(col) cor(col, T_assign))
    sum(abs(cors[abs(cors) > threshold]) * betas[abs(cors) > threshold])
  })
}

# ── Simulation: Type-1 and Type-2 error rates, conditional on orthogonality failing ──
# The key insight: under PERFECT randomization, Type-1 is always ~5% regardless of betas.
# The interesting question is: given your experiment FAILED the orthogonality check
# (as many real experiments do), how bad are your error rates?
# Large betas (Scenario B) turn small residual imbalance into large bias → inflated Type-1.
# Tiny betas (Scenario A) produce negligible bias even when imbalance exists → Type-1 near 5%.
sim_error_rates <- function(N, D, rho, betas, r_max = 0.10,
                            true_eff = 0.40, n_sim = 2000) {
  results <- t(replicate(n_sim, {
    X        <- if (rho == 0) matrix(rbinom(N * D, 1, 0.5), nrow = N)
                else          corr_binary(N, D, rho)
    T_assign <- rbinom(N, 1, 0.5)
    cors     <- apply(X, 2, function(x) abs(cor(x, T_assign)))
    orth_ok  <- max(cors) <= r_max          # did this experiment pass the balance check?
    Y_null   <- as.vector(X %*% betas) + rnorm(N, 0, 1)
    p_null   <- t.test(Y_null[T_assign == 1], Y_null[T_assign == 0])$p.value
    Y_alt    <- true_eff * T_assign + as.vector(X %*% betas) + rnorm(N, 0, 1)
    p_alt    <- t.test(Y_alt[T_assign == 1], Y_alt[T_assign == 0])$p.value
    c(t1 = as.integer(p_null < 0.05), t2 = as.integer(p_alt > 0.05),
      orth_ok = as.integer(orth_ok))
  }))
  failed <- results[, "orth_ok"] == 0L
  passed <- results[, "orth_ok"] == 1L
  # Type-1: false-positive rate CONDITIONAL ON orthogonality FAILING (violations present)
  t1_cond  <- if (sum(failed) > 10) mean(results[failed, "t1"]) else NA_real_
  # Type-2: miss rate CONDITIONAL ON orthogonality PASSING (no violations; p-values are valid here)
  t2_valid <- if (sum(passed) > 10) mean(results[passed, "t2"]) else NA_real_
  data.frame(t1_rate = t1_cond, t2_rate = t2_valid,
             pct_fail = mean(failed))
}

set.seed(42)

N_vals <- c(50, 100, 200, 400, 800)
D      <- 10

# Scenario betas — made more extreme to amplify the contrast
betas_helpful <- rep(0.15, D)                    # uniform tiny betas
betas_hurtful <- c(1.50, 1.20, 1.00, 0.90, 0.70,
                   0.50, 0.30, 0.20, 0.10, 0.05) # large heterogeneous betas

# P(orthogonal) across N, for three DGP types
prob_df <- bind_rows(
  data.frame(N   = N_vals, DGP = "Independent (rho = 0)",
             prob = sapply(N_vals, sim_orthog, D = D, rho = 0.00)),
  data.frame(N   = N_vals, DGP = "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)",
             prob = sapply(N_vals, sim_orthog, D = D, rho = 0.75)),
  data.frame(N   = N_vals, DGP = "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)",
             prob = sapply(N_vals, sim_orthog, D = D, rho = 0.35))
)

scen_colors <- c(
  "Independent (rho = 0)"                                  = "#4a90d9",
  "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)" = "#52b788",
  "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)"  = "#e63946"
)

p_prob <- ggplot(prob_df, aes(x = N, y = prob, color = DGP, group = DGP)) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 3.5) +
  geom_hline(yintercept = 0.95, linetype = "dashed", color = "gray40") +
  annotate("text", x = 820, y = 0.957, label = "95% target",
           size = 3.0, color = "gray40", hjust = 0) +
  scale_x_continuous(breaks = N_vals) +
  scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
  scale_color_manual(values = scen_colors, name = NULL) +
  labs(x = "Sample size (N)",
       y = "P(max |r| < 0.10 across all 10 dimensions)",
       title = "Probability of Achieving Orthogonality (D = 10)",
       subtitle = paste0("Scenario A (strong clustering, tiny betas): easier — correlation reduces effective checks\n",
                         "Scenario B (moderate correlation, large varied betas): hardest — and most dangerous when it fails")) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank(), legend.position = "top",
        legend.text = element_text(size = 9))

# ── Bias when randomization fails ─────────────────────────────────────────────
set.seed(42)
bias_indep   <- sim_bias(N = 200, D = D, rho = 0.00, betas = betas_hurtful)
bias_helpful <- sim_bias(N = 200, D = D, rho = 0.75, betas = betas_helpful)
bias_hurtful <- sim_bias(N = 200, D = D, rho = 0.35, betas = betas_hurtful)

bias_df <- bind_rows(
  data.frame(bias = bias_indep,   DGP = "Independent (rho = 0)"),
  data.frame(bias = bias_helpful, DGP = "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)"),
  data.frame(bias = bias_hurtful, DGP = "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)")
) |>
  mutate(DGP = factor(DGP, levels = c("Independent (rho = 0)",
                                       "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)",
                                       "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)")))

p_bias <- ggplot(bias_df |> filter(bias > 0),
                 aes(x = bias, fill = DGP)) +
  geom_histogram(bins = 35, alpha = 0.80, color = "white", linewidth = 0.2) +
  geom_vline(data = bias_df |> filter(bias > 0) |>
               group_by(DGP) |> summarise(med = median(bias), .groups = "drop"),
             aes(xintercept = med, color = DGP), linewidth = 1.1, linetype = "dashed") +
  facet_wrap(~DGP, nrow = 3, scales = "free") +
  scale_fill_manual(values = scen_colors, guide = "none") +
  scale_color_manual(values = scen_colors, guide = "none") +
  scale_x_continuous(labels = function(x) sprintf("%.2f SD", x)) +
  labs(x = "Cumulative OVB when randomization fails (SD units of Y)",
       y = "Frequency across 1,000 experiments",
       title = "When Randomization Fails: How Large Is the Bias? (N = 200)",
       subtitle = "Dashed = median cumulative bias  \u00b7  Only experiments with \u2265 1 failing dimension shown") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        strip.text = element_text(face = "bold", size = 8.5))

# ── Type-1 error rates across N (Type-2 intentionally excluded — see note below) ──
set.seed(42)
err_df <- bind_rows(
  lapply(N_vals, function(n) {
    cbind(N = n, DGP = "Independent (rho = 0)",
          sim_error_rates(n, D, rho = 0.00, betas = betas_hurtful))
  }),
  lapply(N_vals, function(n) {
    cbind(N = n,
          DGP = "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)",
          sim_error_rates(n, D, rho = 0.75, betas = betas_helpful))
  }),
  lapply(N_vals, function(n) {
    cbind(N = n,
          DGP = "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)",
          sim_error_rates(n, D, rho = 0.35, betas = betas_hurtful))
  })
) |>
  mutate(N        = as.integer(N),
         t1_rate  = as.numeric(t1_rate),
         pct_fail = as.numeric(pct_fail),
         DGP      = as.character(DGP))

# Show Type-1 error only (Type-2 is not reported — see callout below)
p_err <- ggplot(err_df |> filter(DGP != "Independent (rho = 0)"),
                aes(x = N, y = t1_rate, color = DGP, group = DGP)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3.0) +
  geom_hline(yintercept = 0.05, linetype = "dashed", color = "gray40") +
  annotate("text", x = max(N_vals) * 1.02, y = 0.055, label = "5% nominal",
           color = "gray40", size = 3.2, hjust = 0) +
  scale_x_continuous(breaks = N_vals) +
  scale_y_continuous(labels = percent_format(accuracy = 1),
                     limits = c(0, NA)) +
  scale_color_manual(values = scen_colors[names(scen_colors) != "Independent (rho = 0)"],
                     name = NULL) +
  labs(x = "Sample size (N)",
       y = "Type-1 error rate (among failed experiments)",
       title = "False-Positive Rate Rises with N When Orthogonality Fails",
       subtitle = paste0(
         "Conditional on |r(T, X)| > 0.10  \u00b7  ",
         "Dashed = 5% nominal  \u00b7  ",
         "Scenario B's large \u03b2s amplify residual imbalance into detectable bias")) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(), legend.position = "top",
        legend.text = element_text(size = 8.5),
        plot.margin = margin(5, 50, 5, 5))

p_prob / p_bias / p_err

Reading the three plots together

Top plot — P(achieving orthogonality): Scenario A (ρ = 0.75, tiny uniform betas) is the easiest case. The very strong correlation among dimensions collapses the effective number of independent checks, so a smaller N achieves the 95% target. This is the helpful force at work. Scenario B (ρ = 0.35, large varied betas) offers little advantage over independent dimensions on this metric.

Middle plot — bias when randomization fails: This is where the scenarios diverge dramatically. When Scenario A fails, it fails on a highly clustered bloc with tiny β values — cumulative bias remains small. When Scenario B fails, it fails on moderately correlated dimensions each with large independent β effects — the bias distribution has a fat right tail, with individual experiments easily incurring > 1.0 SD of cumulative confound. Notice how much wider the Scenario B distribution is compared to A.

Bottom plot — Type-1 error rate (given orthogonality failed): Conditional on at least one covariate being imbalanced (|r| > 0.10), Scenario B’s large betas turn even small residual correlations into substantial bias — false positive rates rise well above 5% and continue rising as N increases. Scenario A stays near 5% because its tiny betas keep the omitted-variable bias small even when imbalance exists. This plot conditions on experiments with bad realized balance — it is a stylized sensitivity demonstration showing how dangerous residual imbalance can become when covariates strongly predict Y. It is not the unconditional false-positive rate of a properly randomized design, which remains near 5% regardless of N.

Why is Type-2 error (power) not shown?

Because the question of power is irrelevant when exchangeability is violated.

Power analysis asks: “If there is a true effect, how often will my test correctly detect it?” That question is only valid when the p-value is a valid test statistic — i.e., when treatment is exchangeable with control. When orthogonality fails, the p-value is not a valid measure of the treatment effect. A “detection” (p < 0.05) in an experiment where orthogonality failed could mean either (a) you correctly found a real effect, or (b) the confound inflated the test statistic into significance independently of whether an effect exists. You cannot distinguish them. Power therefore has no valid interpretation under a violated exchangeability assumption — the right question is how to restore valid inference, not how often the broken test produces p < 0.05.

Counterintuitive danger: larger N makes invalid tests more dangerous, not less

Most researchers assume larger samples make experiments safer. The bottom plot reveals the opposite: among experiments that fail the orthogonality check, the false-positive rate rises with N — the very opposite of what researchers intuitively expect.

The mechanism: the plot conditions on experiments where |r(T, X)| > 0.10. Among those “failing” experiments, the residual bias (r × β) is roughly constant across N — the failing correlation barely exceeds 0.10 in all cases. What does change is the standard error of the t-test, which shrinks as 1/√N. The same small bias that was invisible noise at small N becomes a statistically detectable “effect” at large N.

The compounding trap visualized: The plot below shows — for Scenario B (the dangerous case) — how Type-1 error (among experiments with violations) and Type-2 error/miss rate (among experiments without violations) evolve as N grows. Point size is proportional to the frequency of orthogonality violations. Read it as a race: the red line (false-positive danger) climbs faster than the blue line (miss rate) falls.

▶ Side-by-side: Type-1 error (violations) vs power (no violations) as N grows

# Extract Scenario B only; compute power = 1 - miss rate; reshape to long for facets
danger_df <- err_df |>
  filter(DGP == "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)") |>
  dplyr::select(N, t1_rate, t2_rate, pct_fail) |>
  mutate(power = 1 - t2_rate) |>                     # power = 1 - miss rate
  tidyr::pivot_longer(c(t1_rate, power),
                      names_to  = "metric",
                      values_to = "rate") |>
  mutate(
    panel = dplyr::case_when(
      metric == "t1_rate" ~
        "Type-1 Error Rate\n(experiments WITH orthogonality violations — p-values INVALID)",
      metric == "power" ~
        "Power (1 \u2212 Miss Rate)\n(experiments WITHOUT violations — p-values VALID)"
    ),
    panel = factor(panel, levels = c(
      "Type-1 Error Rate\n(experiments WITH orthogonality violations — p-values INVALID)",
      "Power (1 \u2212 Miss Rate)\n(experiments WITHOUT violations — p-values VALID)"
    )),
    color_val = ifelse(metric == "t1_rate", "#e63946", "#457b9d")
  )

# Reference line data: 5% for Type-1 panel, 80% for power panel
ref_lines <- data.frame(
  panel = factor(c(
    "Type-1 Error Rate\n(experiments WITH orthogonality violations — p-values INVALID)",
    "Power (1 \u2212 Miss Rate)\n(experiments WITHOUT violations — p-values VALID)"
  ), levels = levels(danger_df$panel)),
  yref  = c(0.05, 0.80),
  label = c("5% nominal \u03b1", "80% power")
)

ggplot(danger_df, aes(x = N, y = rate, group = panel)) +
  geom_hline(data = ref_lines, aes(yintercept = yref),
             linetype = "dashed", color = "gray50", linewidth = 0.8) +
  geom_text(data = ref_lines, aes(x = 820, y = yref + 0.025, label = label),
            color = "gray40", size = 3.0, hjust = 0, inherit.aes = FALSE) +
  geom_line(aes(color = panel), linewidth = 1.4) +
  geom_point(aes(size = pct_fail, color = panel), alpha = 0.88) +
  facet_wrap(~panel, nrow = 1, scales = "free_y") +
  scale_color_manual(values = c(
    "Type-1 Error Rate\n(experiments WITH orthogonality violations — p-values INVALID)" = "#e63946",
    "Power (1 \u2212 Miss Rate)\n(experiments WITHOUT violations — p-values VALID)"      = "#457b9d"
  ), guide = "none") +
  scale_size_continuous(
    name   = "Proportion of experiments\nwith |r(T,X)| > 0.10",
    range  = c(2, 8),
    labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = N_vals) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(
    x        = "Sample size (N)",
    y        = NULL,
    title    = "Larger N: Type-1 Error Climbs Faster Than Power Grows — Among Experiments That Fail Orthogonality",
    subtitle = paste0(
      "Scenario B (rho = 0.35, large heterogeneous \u03b2s)  \u00b7  ",
      "Point size = proportion of experiments with at least one |r(T,X)| > 0.10\n",
      "Left: false-positive danger rises with N (invalid experiments)  \u00b7  ",
      "Right: power in valid experiments rises with N (as expected)")) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor  = element_blank(),
    legend.position   = "bottom",
    legend.direction  = "horizontal",
    legend.text       = element_text(size = 8),
    strip.text        = element_text(face = "bold", size = 9.5),
    plot.margin       = margin(5, 20, 5, 5)
  )

Reading the panels: The left panel shows Type-1 error (false-positive rate) conditional on orthogonality having failed — rising from ~6% at N = 50 to 20–30% at N = 800, far above the 5% nominal level. The right panel shows power in the experiments that passed the orthogonality check — rising as expected, because larger N does increase power for valid experiments. The critical asymmetry: point size (proportion of experiments with violations) shrinks as N grows, so the situation looks increasingly safe to a researcher monitoring only the balance check pass rate. Yet the rare failures that remain become dramatically more dangerous. The compounding trap: sample size DOES reduce the frequency of orthogonality violations — but it reduces violations slower than it amplifies the danger of the violations that remain.

Larger N does not license looser balance checking — it demands stricter checking. Tighten your orthogonality criterion (lower r_max) as N grows. Never assume you have achieved actual randomization; random assignment is a procedure, not a guarantee, and its validity must be verified empirically every time.

The practical lesson: correlated dimensions are not uniformly good or bad. The key question is what their β values look like. If correlated dimensions all carry similar, modest effects, correlation genuinely helps by reducing effective checks. If correlated dimensions carry large, heterogeneous effects, correlation is a liability — and that liability is invisible unless you examine conditional bias and error rates, not just P(orthogonality).

This mirrors a core lesson from Module 1: a tight subscale structure looks efficient and reliable, but if the subscales carry separate predictive variance (discriminant invalidity), a failure in one cascades to all. The structural parallel is exact.

Checking the Observables: Randomization Diagnostics

A fundamental limitation — you can only check what you measured

You can only discover failures of randomization on dimensions you thought to measure.

This is the direct experimental parallel to Module 1’s measurement problem. In Module 1, we showed that unmeasured sub-dimensions of Y produce a biased estimate of your latent construct — you cannot diagnose measurement failure on dimensions you never operationalized. The same logic applies here: if a potential confounder was never measured as a covariate, no balance table, no Love plot, and no omnibus test will catch it.

This constraint is not a limitation of statistics — it is a fundamental epistemological constraint. Its implication is that randomization checking is only as good as the covariate measurement you did before the study began. The tools from Module 1 — CFA to identify sub-dimensions of Y, pilot text analysis to map what’s in X and Y (see Part 4), discriminant validity checks — are not just measurement tools. They are pre-registration scaffolding that determines whether your eventual randomization check has the construct coverage to be meaningful.

The practical lesson: Before collecting data, use the construct-mapping tools from Module 1 to enumerate every plausible antecedent of Y. Measure them all. Then, after randomization, check balance on every one of them. The diagnostics below are only valuable if you did that groundwork first.

Even though you cannot directly observe the latent dimensions, you can check the observable proxies you did measure. A balance table checks whether measured characteristics are distributed similarly across conditions — a necessary but not sufficient condition for latent randomization.

Here we simulate a study where randomization partially failed (treatment assignment was inadvertently correlated with one confounder) and show how to detect it.

▶ Simulate study with imperfect randomization

set.seed(42)
N_study <- 200

observed_chars <- data.frame(
  env_values     = rbinom(N_study, 1, 0.50),
  income_high    = rbinom(N_study, 1, 0.45),
  social_norms   = rbinom(N_study, 1, 0.50),
  brand_trust    = rbinom(N_study, 1, 0.55),
  quality_percep = rbinom(N_study, 1, 0.50),
  health_motiv   = rbinom(N_study, 1, 0.40),
  peer_influence = rbinom(N_study, 1, 0.50),
  prior_green    = rbinom(N_study, 1, 0.45),
  price_sensitiv = rbinom(N_study, 1, 0.50),
  emotion_resp   = rbinom(N_study, 1, 0.50)
)

# "Random" assignment — inadvertently correlated with env_values and income
# (e.g., participants self-selected into a study they saw on an eco-forum)
treatment_prob <- 0.35 +
  0.25 * observed_chars$env_values +
  0.15 * observed_chars$income_high
treatment_prob <- pmin(pmax(treatment_prob, 0.05), 0.95)
treatment <- rbinom(N_study, 1, treatment_prob)

study_df <- cbind(observed_chars,
                  treatment = factor(treatment, labels = c("Control", "Eco-label")))

Balance Table

▶ Compute balance table with chi-square tests

balance_results <- lapply(names(observed_chars), function(var) {
  ctrl <- observed_chars[[var]][treatment == 0]
  trt  <- observed_chars[[var]][treatment == 1]
  chi  <- chisq.test(table(observed_chars[[var]], treatment), correct = FALSE)
  data.frame(
    Characteristic  = gsub("_", " ", var) |> tools::toTitleCase(),
    `Control (%)`   = round(mean(ctrl) * 100, 1),
    `Eco-label (%)` = round(mean(trt)  * 100, 1),
    Difference      = round((mean(trt) - mean(ctrl)) * 100, 1),
    p_value         = round(chi$p.value, 3),
    Flag            = ifelse(chi$p.value < 0.05, "\u26a0\ufe0f Imbalanced", "")
  )
}) |> bind_rows()

kable(balance_results,
      col.names = c("Characteristic", "Control (%)", "Eco-label (%)",
                    "Diff (pp)", "p-value (\u03c7\u00b2)", ""),
      caption = "Balance table: observed characteristics by condition")

Balance table: observed characteristics by condition
Characteristic	Control (%)	Eco-label (%)	Diff (pp)	p-value (χ²)
Env Values	42.0	66.1	24.0	0.001	⚠️ Imbalanced
Income High	29.5	42.9	13.3	0.053
Social Norms	43.2	45.5	2.4	0.740
Brand Trust	61.4	53.6	-7.8	0.269
Quality Percep	48.9	43.8	-5.1	0.471
Health Motiv	39.8	35.7	-4.1	0.556
Peer Influence	46.6	54.5	7.9	0.269
Prior Green	53.4	36.6	-16.8	0.017	⚠️ Imbalanced
Price Sensitiv	43.2	50.9	7.7	0.278
Emotion Resp	54.5	60.7	6.2	0.380

Visual Balance Check (Love Plot)

▶ Plot: standardized differences (love plot)

smd_df <- data.frame(
  Characteristic = gsub("_", " ", names(observed_chars)) |> tools::toTitleCase(),
  SMD = sapply(names(observed_chars), function(var) {
    ctrl <- observed_chars[[var]][treatment == 0]
    trt  <- observed_chars[[var]][treatment == 1]
    pooled_sd <- sqrt((var(trt) + var(ctrl)) / 2)
    if (pooled_sd == 0) return(0)
    (mean(trt) - mean(ctrl)) / pooled_sd
  })
) |>
  arrange(desc(abs(SMD))) |>
  mutate(Characteristic = factor(Characteristic, levels = Characteristic),
         flag           = abs(SMD) > 0.10)

ggplot(smd_df, aes(x = SMD, y = Characteristic, color = flag)) +
  geom_vline(xintercept =  0,    color = "gray40", linewidth = 0.8) +
  geom_vline(xintercept =  0.10, linetype = "dashed", color = "#de2d26", linewidth = 0.7) +
  geom_vline(xintercept = -0.10, linetype = "dashed", color = "#de2d26", linewidth = 0.7) +
  geom_segment(aes(x = 0, xend = SMD, y = Characteristic, yend = Characteristic),
               linewidth = 1.0) +
  geom_point(size = 4) +
  scale_color_manual(values = c("FALSE" = "#52b788", "TRUE" = "#de2d26"),
                     labels = c("|SMD| \u2264 0.10", "|SMD| > 0.10"), name = NULL) +
  labs(x = "Standardized Mean Difference (Treatment \u2212 Control)",
       y = NULL,
       title = "Love Plot: Covariate Balance by Condition",
       subtitle = "Dashed lines = |SMD| = 0.10 threshold  \u00b7  Red = potential imbalance") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank(), legend.position = "top")

Omnibus Randomization Test

For a single omnibus test, we use logistic regression predicting treatment from all characteristics. If randomization succeeded, no characteristic should significantly predict assignment, and the model R² should be near zero.

▶ Omnibus randomization test via logistic regression

treat_int <- as.integer(treatment) - 0L

omnibus_model <- glm(treat_int ~ ., data = cbind(observed_chars, treat_int),
                     family = binomial)

null_ll <- logLik(glm(treat_int ~ 1, data = cbind(observed_chars, treat_int),
                      family = binomial))
full_ll <- logLik(omnibus_model)
n       <- N_study

r2_nag <- (1 - exp(2 * (null_ll - full_ll) / n)) / (1 - exp(2 * null_ll / n))

lrt_p  <- anova(glm(treat_int ~ 1, data = cbind(observed_chars, treat_int),
                    family = binomial),
                omnibus_model, test = "LRT")$`Pr(>Chi)`[2]

cat(sprintf("Omnibus LRT: p = %.4f\n",  lrt_p))

Omnibus LRT: p = 0.0041

▶ Omnibus randomization test via logistic regression

cat(sprintf("Nagelkerke R\u00b2: %.3f\n", r2_nag))

Nagelkerke R²: 0.162

▶ Omnibus randomization test via logistic regression

cat(sprintf("Interpretation: Covariates explain %.1f%% of variance in treatment assignment.\n",
            r2_nag * 100))

Interpretation: Covariates explain 16.2% of variance in treatment assignment.

▶ Omnibus randomization test via logistic regression

cat("Good randomization should produce R\u00b2 \u2248 0 and p well above .05.\n")

Good randomization should produce R² ≈ 0 and p well above .05.

Handling mixed covariate types

In practice, your covariates will rarely all be the same type. You may have binary indicators (gender, prior exposure), ordered categories (education level, income bracket), unordered categories (nationality, recruitment channel), and continuous measures (age, prior attitude score). The omnibus logistic regression approach handles all of these — you just need to encode each type correctly before passing them to glm:

▶ Template: omnibus randomization test for mixed covariate types (not run)

# ── Step 1: Encode each covariate type correctly ──────────────────────────────
# Suppose your covariate data frame is called `covariates` with columns:
#   gender       : binary (0/1 or factor with 2 levels)
#   edu_level    : ordinal (e.g., 1=high school, 2=bachelor, 3=master, 4=PhD)
#   country      : nominal (unordered categories → MUST be factor)
#   age          : continuous
#   prior_att    : continuous (e.g., prior attitude score on 1–7 scale)

covariates$gender    <- factor(covariates$gender)         # binary → factor (1 df)
covariates$edu_level <- as.integer(covariates$edu_level)  # ordinal → numeric (linear trend)
covariates$country   <- factor(covariates$country)        # nominal → factor (K-1 dummies)
# age and prior_att stay as numeric — no conversion needed

# ── Step 2: Omnibus logistic regression ───────────────────────────────────────
# treatment must be a binary 0/1 variable
# glm automatically dummy-codes all factor variables
omnibus_fit  <- glm(treatment ~ ., data = covariates, family = binomial)
null_fit     <- glm(treatment ~ 1, data = covariates, family = binomial)

# ── Step 3: Likelihood-ratio test ─────────────────────────────────────────────
lrt <- anova(null_fit, omnibus_fit, test = "LRT")
cat(sprintf("Omnibus LRT p = %.4f\n", lrt$`Pr(>Chi)`[2]))

# Nagelkerke pseudo-R²
n_obs  <- nrow(covariates)
nag_r2 <- (1 - exp(2 * (logLik(null_fit) - logLik(omnibus_fit)) / n_obs)) /
           (1 - exp(2 * logLik(null_fit) / n_obs))
cat(sprintf("Nagelkerke R² = %.3f\n", nag_r2))

# ── Optional: non-linear ordinal check ────────────────────────────────────────
# If you suspect the ordinal variable has a non-linear relationship with
# treatment assignment, convert it to a factor (dummy-coded) instead:
#   covariates$edu_level <- factor(covariates$edu_level)
# This uses more degrees of freedom but makes no linearity assumption.

Key points:

Binary covariates: encode as factor — produces 1 dummy variable.
Ordinal covariates: encode as integer (assumes linear trend across levels). If the relationship might be non-linear, use factor() instead — this creates K-1 dummies but uses more degrees of freedom.
Nominal (unordered) covariates: must be factor — glm automatically creates K-1 dummy variables (one per category minus the reference level).
Continuous covariates: pass as-is (numeric). No transformation needed.

The omnibus LRT tests whether any covariate — regardless of type — predicts treatment assignment better than chance. If p < .05 or R² > .05, investigate which covariates are driving the imbalance using the individual balance table and Love plot above.

What to look for

Diagnostic	Well-balanced realized sample	Potential realized imbalance
Balance table p-values	All p > .05	Any p < .05
Love plot SMDs	All \|SMD\| < 0.10	Any \|SMD\| > 0.10
Omnibus LRT p-value	p well above .05	p < .05
Nagelkerke R²	R² ≈ 0	R² > 0.05

These diagnostics provide evidence about the realized sample — they are not certification that the randomization procedure did or did not work. A balance table with all p > .05 shows you got a well-balanced draw from your randomization procedure; it does not prove the procedure was valid, nor does a failed balance check prove it was not. The deeper issue is that these checks only evaluate the observable dimensions you measured. The latent dimensions — all the antecedents of Y you did not think to measure — remain unverified regardless of what the balance table says.

When You Can’t Check What You Can’t See: Diagnosing Latent Imbalance

The three diagnostics above — balance table, Love plot, omnibus test — are essential, but they share a fundamental limitation:

You can only check balance on dimensions you thought to measure.

If an unmeasured dimension of Y is correlated with treatment assignment, no balance table will catch it. And as the analysis in Section 3.6 showed, the broader your outcome construct, the more such unmeasured dimensions there are likely to be.

What can you do?

What you can do before data collection

1. Map the construct before running the study. Use Module 1’s toolkit — confirmatory factor analysis, convergent/discriminant validity checks, latent class analysis — on pilot data to identify all the major sub-dimensions of Y. Every dimension you identify is a confounder you can now measure and check balance on.

2. Blocked (stratified) randomization. If you know in advance that certain dimensions of Y are likely to be distributed unevenly in your sample, pre-stratify on them. Assign treatment within blocks defined by those dimensions. This guarantees balance on the dimensions you can observe — reducing the degrees of freedom for latent imbalance to exploit.

3. Collect more observations. As Section 3.6 showed quantitatively, the probability of achieving approximate orthogonality on all D dimensions grows with N. If your outcome is broad (large D), power calculations based on the treatment effect alone dramatically underestimate the required sample.

What you can do after data collection

4. Open-ended text as a nomological net scanner — and an identification strategy.

As developed in detail in Part 4 (What’s in X? What’s in Y?), open-ended text responses from a small pilot are the most powerful tool for mapping the construct space before collecting main-study data. The nomological net scan reveals the exhaustive list of intermediary variables in the causal chain from X to Y. Every concept the text reveals is a mediator you should measure quantitatively — and a confounder you should check balance on.

The key insight from Part 4 bears repeating here: without the complete mediator list, the treatment coefficient in a regression is an uninterpretable mixture of all active pathways. With the list, you gain a structural identification strategy: decompose the total treatment effect into its component channels using the measured mediators.

The complete structural modeling demonstration — quantitative mediator scoring, parallel mediation, SEM with latent factors, and DGP recovery comparison — is developed step by step in Part 4 of this module, where it serves as the practical illustration of how to handle exclusion restriction violations once the nomological net has been mapped.

Researcher Checklist: Exchangeability

Key questions about your randomization

Did you verify exchangeability empirically, not just procedurally? Running a randomization algorithm is the observable. Exchangeability — no pre-existing group differences on any cause of Y — is the latent property you need. Report a balance table; don’t just assert that you randomized.
What is the effective $r_{\max}$ your sample size delivers? With N = 80 and D = 10 antecedents of Y, you are operating at r_max ≈ 0.31. Any confounder with β > 0.5 SD on Y can inject 0.15 SD of OVB — check whether that is acceptable.
Do any imbalanced covariates predict Y? Imbalance is only dangerous when the imbalanced variable also affects the outcome. Estimate the implied OVB: $\hat{\text{bias}} = \hat{\beta}_C \cdot r_{T,C}$.
Is your Y narrow or broad? Broader constructs have more antecedents and require larger samples or blocked randomization to achieve exchangeability. A power calculation based on effect size alone does not account for this.

Part 2: What Does “Random Assignment” Actually Require?

The Measurement Analogy: Observable vs. Latent

A Concrete Anchor: The Playing Cards Analogy

Exchangeability: The Latent Property Random Assignment Is Supposed to Deliver

Case 1 — A Narrow Construct: Visual Fixation Time

Case 2 — A Broad Construct: Sustainable Purchase Intention

Visualizing the 1,024-Cell Matrix

From Exact Coverage to Approximate Orthogonality

What does \(r_{\max}\) actually mean?

A Violation of Exchangeability Is Omitted Variable Bias

Required N for Approximate Orthogonality

Simulation: Correlated Dimensions — Two Countervailing Forces

Two forces pull in opposite directions

When does correlation help vs. hurt?

Simulation

Checking the Observables: Randomization Diagnostics

Balance Table

Visual Balance Check (Love Plot)

Omnibus Randomization Test

Handling mixed covariate types

When You Can’t Check What You Can’t See: Diagnosing Latent Imbalance

What you can do before data collection

What you can do after data collection

Researcher Checklist: Exchangeability