12  Part 3: What Does Random Assignment Require?

▶ Load required packages
# install.packages(c("tidyverse","ggplot2","knitr","scales","patchwork"))

library(tidyverse)
library(ggplot2)
library(knitr)
library(scales)
library(patchwork)
if (!requireNamespace("lavaan", quietly = TRUE)) install.packages("lavaan")
library(lavaan)

set.seed(2025)

13 Part 3: What Does “Random Assignment” Actually Require?

WarningThe central message of Part 3

Never assume that random assignment has achieved actual randomization. Random assignment is a procedure — clicking “Randomize” in Qualtrics, running sample() in R. Actual randomization is a property — treatment is orthogonal to all potential confounders in the data-generating process. The procedure does not guarantee the property. Whether it does depends on (1) how complex your outcome variable Y is and (2) how many participants you have. This section develops the gap between the two and equips you to evaluate — not assume — whether randomization succeeded.

13.1 The Measurement Analogy: Observable vs. Latent

In Module 1, we introduced a distinction that runs through all of empirical research:

Example from Module 1
Observable (indicator) A participant’s response on a 7-point Likert scale
Latent construct Their true level of Green Purchase Intention

The observable is what you can see. The latent construct is what you actually want to know. The entire Module 1 was about the gap between these two things: your observed scale picks up variance from the wrong latent constructs (discriminant validity), from unequal measurement properties across groups (non-invariance), and from subpopulations that behave differently (latent classes).

Exactly the same distinction applies to randomization:

In Randomization
Observable (indicator) Pressing “Randomize” in Qualtrics, running sample() in R, using a random number table
Latent construct The property of true independence: treatment assignment is orthogonal to ALL potential confounders

Just as a 7-point scale can look like it measures Green Purchase Intention while actually reflecting Environmental Concern, a randomization procedure can look random (you clicked the button — the observable is there) while failing to achieve true independence (the latent property).

ImportantThe construct validity question for randomization

Ask yourself the question from Module 1, applied to your treatment variable:

“Does your observable act of randomization have construct validity as an indicator of latent randomization?”

The answer depends entirely on how complex the outcome variable Y is that you are trying to study. We will now build the intuition for why.

13.2 A Concrete Anchor: The Playing Cards Analogy

Imagine you have a sorted deck of cards — you know exactly what order the cards are in. Before a game, you need to make the deck random. So you shuffle it once.

Did one shuffle make the deck random?

The answer depends on the size of the deck.

  • 5 cards: There are 5! = 120 possible orderings. One riffle shuffle covers a substantial fraction of these. The deck is close to random.
  • 52 cards: There are 52! ≈ 8 × 10⁶⁷ possible orderings. One riffle shuffle covers an astronomically small fraction of these. Significant structure from the original sorted order survives.

This is why casino dealers shuffle 7 times — a mathematical result (Bayer & Diaconis, 1992) showing that 7 GSR riffle shuffles are sufficient for a 52-card deck to approach uniformity. One shuffle is not enough.

We can visualize this directly. We simulate one-through-ten riffle shuffles for decks of different sizes and measure how much structure (correlation with the original sorted order) survives.

▶ Simulate riffle shuffles for different deck sizes
one_riffle <- function(deck) {
  n <- length(deck)
  k <- rbinom(1, n, 0.5)
  if (k == 0L || k == n) return(deck[sample(n)])
  pile_a <- deck[1:k]; pile_b <- deck[(k + 1):n]
  result <- integer(n); idx <- 1L
  while (length(pile_a) > 0L && length(pile_b) > 0L) {
    p_a <- length(pile_a) / (length(pile_a) + length(pile_b))
    if (runif(1) < p_a) { result[idx] <- pile_a[1]; pile_a <- pile_a[-1]
    } else               { result[idx] <- pile_b[1]; pile_b <- pile_b[-1] }
    idx <- idx + 1L
  }
  if (length(pile_a) > 0L) result[idx:n] <- pile_a
  if (length(pile_b) > 0L) result[idx:n] <- pile_b
  result
}

set.seed(42)
shuffle_sim <- expand.grid(deck_size = c(5L, 10L, 20L, 52L),
                           n_shuffles = 1:10, sim = 1:400) |>
  dplyr::mutate(tau = mapply(function(n, k) {
    deck <- 1:n
    for (i in seq_len(k)) deck <- one_riffle(deck)
    cor(1:n, deck, method = "kendall")
  }, deck_size, n_shuffles))

shuffle_summary <- shuffle_sim |>
  group_by(deck_size, n_shuffles) |>
  summarise(mean_tau = mean(tau), lo = quantile(tau, 0.10),
            hi = quantile(tau, 0.90), .groups = "drop") |>
  mutate(deck_label = paste0(deck_size, " cards"))

ggplot(shuffle_summary,
       aes(x = n_shuffles, y = mean_tau,
           color = factor(deck_size), fill = factor(deck_size), group = deck_size)) +
  geom_ribbon(aes(ymin = lo, ymax = hi), alpha = 0.14, color = NA) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray40", linewidth = 0.8) +
  annotate("text", x = 10.2, y = 0.03, label = "Random\n(\u03c4 \u2248 0)",
           size = 3.5, color = "gray40", hjust = 0) +
  scale_x_continuous(breaks = 1:10) +
  scale_color_manual(values = c("5"  = "#4a90d9", "10" = "#52b788",
                                "20" = "#f4a261", "52" = "#e63946"),
                     name = "Deck size") +
  scale_fill_manual(values  = c("5"  = "#4a90d9", "10" = "#52b788",
                                "20" = "#f4a261", "52" = "#e63946"),
                    name = "Deck size") +
  labs(x = "Number of riffle shuffles",
       y = "Kendall's \u03c4 with original sorted order\n(1 = perfectly sorted  \u00b7  0 = random)",
       title = "How Many Shuffles Are Needed to Achieve Randomness?",
       subtitle = "Ribbon = 10th\u201390th percentile across 400 simulations per deck size") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank())

The analogy to experiments maps directly:

Playing Cards Research Experiment
Cards in the deck Participants in the study
Order before shuffling Values on potential confounders
One shuffle Clicking “Randomize” once in Qualtrics
“Truly random” deck Treatment and control groups balanced on ALL confounders
Size of the deck Complexity (number of dimensions) of the outcome Y

The key takeaway: just as a larger deck needs more shuffles to achieve randomness, a more complex outcome variable Y needs more observations to achieve true independence between treatment and all potential confounders.

13.3 The Exchangeability Assumption: When It Holds and When It Doesn’t

The permutation logic in Part 1 rested on exchangeability: the assignment of participants to conditions could have been any other permutation without changing the distributional properties of the data.

Exchangeability holds when treatment assignment is truly independent of all potential causes of Y. In practice, researchers assume this is achieved by the act of randomization — but you should never make this assumption without empirical verification. Consider the gap:

  • What the researcher observes: Qualtrics assigned participants to conditions using a random number generator. ✓
  • What the researcher needs (latent): Treatment assignment is uncorrelated with all potential causal antecedents of Y — every dimension, every subdimension, every latent mediator in the entire data-generating process. ?

Whether the observable achieves the latent property depends on two things: 1. How many antecedents Y has (the complexity of the construct) 2. How many participants you have (the sample size)


13.4 Case 1 — A Narrow Construct: Visual Fixation Time

Suppose your outcome variable Y is total gaze duration (in milliseconds) on an eco-label during a product display, measured via eye-tracking. This is a narrow, physiological measure driven by a small number of simple perceptual factors.

Think of dimensions as all the hidden moderators in the data-generating process — every property of the participant or their environment that makes them more or less likely to fixate on the label, regardless of the experimental manipulation. These are the variables that, if correlated with treatment assignment by chance, would confound your estimate of the treatment effect. In the entire data-generating process for visual fixation, suppose there are only 2 such binary dimensions:

Dimension Low (0) High (1)
Screen luminance Dim display Bright display
Viewing distance Far from screen Close to screen

These are simple physical factors. They are not set by the researcher and could vary unsystematically across participants assigned to each condition.

Because both dimensions are binary, the complete “design space” of all possible hidden-moderator combinations is a 2 × 2 matrix of 4 cells:

▶ Visualize: 2×2 cell matrix for narrow construct
cells <- expand.grid(
  Luminance = c("Dim display\n(Low)", "Bright display\n(High)"),
  Distance  = c("Far from screen\n(Low)", "Close to screen\n(High)")
)
cells$label <- c("Cell A", "Cell B", "Cell C", "Cell D")

ggplot(cells, aes(x = Luminance, y = Distance)) +
  geom_tile(fill = "#d6ecff", color = "#4a90d9", linewidth = 1.2,
            width = 0.92, height = 0.92) +
  geom_text(aes(label = label), size = 5.5, fontface = "bold", color = "#1a3a6b") +
  geom_text(aes(label = "n \u2265 1"), size = 3.8, color = "#555", vjust = 3.2) +
  labs(title = "Design Space for Visual Fixation (Narrow Construct)",
       subtitle = "2 binary hidden moderators \u2192 2\u00b2 = 4 cells  \u00b7  Minimum N to populate matrix = 4",
       x = "Screen Luminance", y = "Viewing Distance") +
  theme_minimal(base_size = 13) +
  theme(panel.grid = element_blank(), axis.text = element_text(size = 10))

NoteWhy 4 cells?

With k binary dimensions in the data-generating process, the total number of unique hidden-moderator combinations is 2^k. Each cell represents one possible “type” of participant in terms of everything that drives Y. For randomization to succeed, treatment and control groups must be approximately balanced across all 2^k cells — the smaller k is, the easier this is to achieve.

For k = 2: 2² = 4 cells. A minimal experiment of N = 4 would cover the space; N = 40 gives 10 per cell with very reliable balance.

With only 4 meaningful combinations in the design space, even a modest experiment (e.g., N = 40, 10 per cell) will very likely achieve approximate balance between treatment and control. One act of randomization is almost certainly sufficient.


13.5 Case 2 — A Broad Construct: Sustainable Purchase Intention

Now suppose your outcome Y is sustainable product purchase intention — the kind of broad, latent, multi-faceted attitude that appears throughout consumer and organizational research, and that you saw in Module 1 struggling to maintain discriminant validity across measurement attempts.

In the data-generating process, suppose there are 10 binary determinants of this construct:

▶ Table: 10 binary dimensions of sustainable purchase intention
dims_table <- data.frame(
  `#`       = 1:10,
  Dimension = c("Environmental values", "Household income", "Social norm salience",
                "Brand trust", "Perceived product quality", "Health motivation",
                "Peer influence", "Prior green behavior",
                "Price sensitivity", "Emotional response to sustainability messaging"),
  Low       = c("Weak", "Low income", "Low salience", "Low trust",
                "Low quality perception", "Low", "Weak influence",
                "Little prior behavior", "High sensitivity", "Low emotional response"),
  High      = c("Strong", "High income", "High salience", "High trust",
                "High quality perception", "High", "Strong influence",
                "Extensive prior behavior", "Low sensitivity", "High emotional response"),
  check.names = FALSE
)
kable(dims_table,
      col.names = c("#", "Dimension", "Level: Low (0)", "Level: High (1)"),
      caption   = "10 binary antecedents of Sustainable Purchase Intention")
10 binary antecedents of Sustainable Purchase Intention
# Dimension Level: Low (0) Level: High (1)
1 Environmental values Weak Strong
2 Household income Low income High income
3 Social norm salience Low salience High salience
4 Brand trust Low trust High trust
5 Perceived product quality Low quality perception High quality perception
6 Health motivation Low High
7 Peer influence Weak influence Strong influence
8 Prior green behavior Little prior behavior Extensive prior behavior
9 Price sensitivity High sensitivity Low sensitivity
10 Emotional response to sustainability messaging Low emotional response High emotional response

With k = 10 binary dimensions, the total number of unique participant profiles in the design space is 2^10 = 1,024 cells. Notice how quickly this grows compared to the narrow construct:

▶ Plot: combinatorial growth of design space
expl_df <- data.frame(k = 1:15, cells = 2^(1:15)) |>
  mutate(label = ifelse(k == 2,  "Recall\nAccuracy\n(k = 2)",
                 ifelse(k == 10, "Sustainable\nPurchase\nIntention\n(k = 10)", NA)))

ggplot(expl_df, aes(x = k, y = cells)) +
  geom_line(color = "#4a90d9", linewidth = 1.4) +
  geom_point(size = 3, color = "#4a90d9") +
  geom_point(data = filter(expl_df, k %in% c(2, 10)),
             size = 5, shape = 21, fill = "white", color = "#d7301f", stroke = 2) +
  geom_text(data = filter(expl_df, k %in% c(2, 10)),
            aes(label = label),
            nudge_x = 0.9, nudge_y = 0, hjust = 0, size = 3.2,
            color = "#d7301f", lineheight = 0.9) +
  geom_vline(xintercept = c(2, 10), linetype = "dotted", color = "#d7301f", linewidth = 0.7) +
  scale_y_log10(labels = label_comma(),
                breaks  = c(4, 32, 256, 1024, 8192, 32768)) +
  scale_x_continuous(breaks = c(1, 2, 5, 10, 15)) +
  labs(x = "Number of binary dimensions (k)",
       y = "Design space: number of unique profiles (log scale)",
       title = "How Quickly Does the Design Space Grow?",
       subtitle = "Each additional binary dimension doubles the number of distinct participant profiles (2\u1d4f)") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank())

13.5.1 Visualizing the 1,024-Cell Matrix

With 10 binary dimensions, the full design matrix has 1,024 rows. We can visualize this as a heatmap where each column is a unique participant profile and each row is one dimension:

▶ Plot: all 1,024 unique profiles in the 10-dimension design space
all_profiles <- as.data.frame(
  do.call(expand.grid, replicate(10, 0:1, simplify = FALSE)))
colnames(all_profiles) <- c("Env. Values", "Income", "Social Norms",
                             "Brand Trust", "Perc. Quality", "Health Motiv.",
                             "Peer Influence", "Prior Behavior",
                             "Price Sensitiv.", "Emotional Resp.")
all_profiles$profile_id <- seq_len(nrow(all_profiles))

plot_df_m <- all_profiles |>
  pivot_longer(-profile_id, names_to = "Dimension", values_to = "Level") |>
  mutate(Dimension = factor(Dimension, levels = rev(colnames(all_profiles)[-11])))

ggplot(plot_df_m, aes(x = profile_id, y = Dimension, fill = factor(Level))) +
  geom_raster() +
  scale_fill_manual(values = c("0" = "#d6ecff", "1" = "#1a3a6b"),
                    labels = c("Low (0)", "High (1)"), name = "Level") +
  labs(x = "Profile (1 of 1,024 unique combinations)", y = NULL,
       title = "The Full Design Space for Sustainable Purchase Intention",
       subtitle = "Each column = one participant profile  \u00b7  Each row = one binary dimension  \u00b7  1,024 cells total") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        panel.grid = element_blank(), legend.position = "top")

To exactly fill this matrix — one observation per cell — you would need N = 1,024 participants. That is just to have a single observation per cell, with no statistical power to spare.


13.6 From Exact Coverage to Approximate Orthogonality

Here is the key insight: you do not need to fill every cell exactly.

In a randomized experiment, what you need is not exact coverage, but approximate independence — the correlation between treatment assignment and each of the 10 dimensions should be close to zero.

How many observations do you need for that? With random assignment and N participants, the expected correlation between treatment T and any single confounder C is zero. The actual correlation in any given sample will deviate from zero by chance, with standard error ≈ 1 / √N. If you want ALL D correlations to stay below some threshold \(|r_{\max}|\), you must account for D simultaneous comparisons (Bonferroni):

\[N \geq \left(\frac{z_{1-\alpha/(2D)}}{r_{\max}}\right)^2\]

13.6.1 What does \(r_{\max}\) actually mean?

Before looking at the numbers, it is worth anchoring \(r_{\max}\) concretely. If your treatment T accidentally correlates r with omitted confounder C, and C has effect β (in SD units of Y — this is identical to Cohen’s d when Y is standardized) on the outcome, then the omitted variable bias in your estimated treatment effect is approximately β × r.

A quick reminder from the power table above: Cohen’s d expresses the treatment effect in the same SD-unit currency. An effect of d = 0.40 means the group means differ by 0.40 standard deviations of Y. The β values below use that same ruler — so a β of 0.50 means the confounder shifts Y by half a standard deviation.

▶ Plot: how r_max tolerance maps to omitted-variable bias
rmax_vals <- c(0.05, 0.10, 0.20, 0.30, 0.40, 0.60)
beta_vals <- seq(0.1, 1.5, by = 0.1)

bias_df <- expand.grid(r_max = rmax_vals, beta = beta_vals) |>
  mutate(bias   = r_max * beta,
         r_label = sprintf("r = %.2f", r_max))

# Cohen's d reference lines
cohen_refs <- data.frame(
  d     = c(0.20, 0.50, 0.80),
  label = c("Small effect\n(d = 0.20)", "Medium effect\n(d = 0.50)", "Large effect\n(d = 0.80)")
)

ggplot(bias_df, aes(x = beta, y = bias,
                    color = factor(r_max), group = factor(r_max))) +
  geom_hline(data = cohen_refs, aes(yintercept = d),
             linetype = "dotted", color = "gray60", linewidth = 0.6) +
  geom_text(data = cohen_refs, aes(x = 1.52, y = d + 0.02, label = label),
            inherit.aes = FALSE, color = "gray50", size = 2.8, hjust = 0, lineheight = 0.85) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 2.5) +
  scale_color_manual(
    values = c("0.05" = "#08519c", "0.1" = "#4a90d9", "0.2" = "#52b788",
               "0.3"  = "#f4a261", "0.4" = "#e07b39", "0.6" = "#e63946"),
    labels = sprintf("r_max = %.2f", rmax_vals),
    name   = "Correlation\ntolerance (r_max)"
  ) +
  scale_x_continuous(breaks = seq(0.2, 1.4, 0.2),
                     labels = function(x) sprintf("\u03b2 = %.1f SD", x)) +
  scale_y_continuous(labels = function(x) sprintf("%.2f SD", x),
                     limits = c(0, 0.95)) +
  coord_cartesian(clip = "off") +
  labs(x = "Effect of omitted confounder on Y (\u03b2 = Cohen's d equivalent; SD units)",
       y = "Omitted-variable bias in your treatment estimate",
       title = "How Much Bias Does Residual Imbalance Create?",
       subtitle = paste0("Bias \u2248 r_max \u00d7 \u03b2  \u00b7  Dotted lines = small/medium/large effect benchmarks\n",
                         "Lower r_max = less bias once achieved, but requires more N to verify")) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank(), legend.position = "bottom",
        legend.direction = "horizontal",
        plot.margin = margin(5, 50, 5, 5))

In the coffee WTP context (SD ≈ $1.50):

  • r = 0.05 with β = 0.8 SD: bias ≈ 0.04 SD ≈ $0.06 — negligible
  • r = 0.10 with β = 0.8 SD: bias ≈ 0.08 SD ≈ $0.12 — small but detectable
  • r = 0.20 with β = 0.8 SD: bias ≈ 0.16 SD ≈ $0.24 — now as large as a small true effect
  • r = 0.40 with β = 1.0 SD → bias ≈ 0.40 SD ≈ $0.60 — larger than many published effects
  • r = 0.60 with β = 1.0 SD → bias ≈ 0.60 SD ≈ $0.90 — could reverse the sign of your treatment estimate entirely

13.6.2 Required N for Approximate Orthogonality

▶ Calculate: N required for approximate orthogonality
required_N <- function(D, r_thresh = 0.10, alpha = 0.05) {
  z_crit <- qnorm(1 - alpha / (2 * D))
  ceiling((z_crit / r_thresh)^2)
}

dims_check   <- c(1, 2, 3, 5, 8, 10, 15, 20)
thresh_check <- c(0.05, 0.10, 0.20, 0.30, 0.40, 0.60)

ortho_table <- expand.grid(D = dims_check, r_max = thresh_check) |>
  mutate(N_req = mapply(required_N, D, r_max),
         exact_cells = 2^D) |>
  pivot_wider(names_from = r_max, values_from = N_req,
              names_prefix = "r_max = ") |>
  rename(`Dimensions (D)` = D, `Exact cells (2^D)` = exact_cells)

kable(ortho_table,
      caption = "Minimum N for approximate orthogonality at \u03b1 = .05 (Bonferroni-corrected)",
      format.args = list(big.mark = ","))
Minimum N for approximate orthogonality at α = .05 (Bonferroni-corrected)
Dimensions (D) Exact cells (2^D) r_max = 0.05 r_max = 0.1 r_max = 0.2 r_max = 0.3 r_max = 0.4 r_max = 0.6
1 2 1,537 385 97 43 25 11
2 4 2,010 503 126 56 32 14
3 8 2,293 574 144 64 36 16
5 32 2,654 664 166 74 42 19
8 256 2,991 748 187 84 47 21
10 1,024 3,152 788 197 88 50 22
15 32,768 3,447 862 216 96 54 24
20 1,048,576 3,657 915 229 102 58 26
▶ Plot: N required vs. dimensions for different tolerances
ortho_plot_df <- expand.grid(D = 1:20, r_max = thresh_check) |>
  mutate(N_required = mapply(required_N, D, r_max),
         exact_N    = 2^D,
         r_label    = paste0("r_max = ", r_max))

ggplot(ortho_plot_df, aes(x = D)) +
  geom_line(aes(y = N_required, color = r_label, group = r_label), linewidth = 1.2) +
  geom_line(aes(y = exact_N), color = "black", linetype = "dashed", linewidth = 1.0) +
  geom_vline(xintercept = c(2, 10), linetype = "dotted", color = "gray50") +
  annotate("text", x = 2.3, y = 40000, label = "Recall\nAccuracy\n(k=2)",
           size = 2.8, color = "gray40", hjust = 0) +
  annotate("text", x = 10.3, y = 40000, label = "Sust. Purchase\nIntention\n(k=10)",
           size = 2.8, color = "gray40", hjust = 0) +
  annotate("text", x = 14.5, y = 2^14 * 1.8, label = "Exact fill\n(2^D cells)",
           size = 2.8, color = "black") +
  scale_y_log10(labels = label_comma()) +
  scale_x_continuous(breaks = seq(1, 20, 2), limits = c(1, 21)) +
  scale_color_manual(
    values = c("r_max = 0.05" = "#08519c", "r_max = 0.1"  = "#4a90d9",
               "r_max = 0.2"  = "#52b788", "r_max = 0.3"  = "#f4a261",
               "r_max = 0.4"  = "#e07b39", "r_max = 0.6"  = "#e63946"),
    name = "Correlation\ntolerance"
  ) +
  coord_cartesian(clip = "off") +
  labs(x = "Number of binary dimensions of Y (D)",
       y = "Observations required (log scale)",
       title = "How Many Observations for Approximate Orthogonality?",
       subtitle = "Dashed = exact fill (2^D)  \u00b7  Colored lines = Bonferroni-corrected independence threshold") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank(), legend.position = "right",
        plot.margin = margin(5, 30, 5, 5))

NoteReading the table: Why does r_max = 0.05 at D = 10 require more N than the Exact cells column?

At first glance, a row like D = 10 looks paradoxical: the Exact cells (2^D) = 1,024 column is smaller than the r_max = 0.05 column (≈ 3,152). Doesn’t that mean structured assignment needs fewer people than random assignment — even while achieving perfect orthogonality?

Yes — and that is exactly right, for a subtle reason. The two columns are measuring different things:

  • Exact cells (2^D): This is the total number of cells in a fully structured full-factorial design — every possible combination of D binary dimensions gets exactly one observation. By construction, all D factors are perfectly balanced and mutually uncorrelated (r = 0 exactly). No statistical test is needed; orthogonality is guaranteed by design. The 1,024 figure is simply the number of cells, not a sample size adequate for inference.

  • r_max columns: These are the N required to verify that a random, unstructured assignment achieved approximate orthogonality at the stated tolerance, using a Bonferroni-corrected independence test. With random assignment, there is no structural guarantee — you must check empirically, and the check requires enough power to detect any correlation exceeding r_max.

The full factorial is more efficient per cell because its structure eliminates random correlation entirely. But that efficiency evaporates quickly in practice: with only 1 observation per cell you cannot estimate variance or make inferential statements. A realistic full factorial at D = 10 needs at least 10–30 observations per cell, putting the total at 10,240–30,720 — far more than the 3,152 needed under random assignment with r_max = 0.05.

In short: the exact-cells column tells you the minimum to fill the design; the r_max columns tell you the minimum to trust random assignment. At moderate D, random assignment wins on cost; at small D and large cells, full factorial wins on precision.

WarningThe key tradeoff — and why most experiments are implicitly operating at high \(r_{\max}\)

The table above reveals an uncomfortable tension that sits at the heart of experimental practice:

Lower \(r_{\max}\) = a stricter standard = less confounding once achieved = better causal inference.

But lower \(r_{\max}\) also requires far more observations to reliably verify. N scales as \(1/r_{\max}^2\), so halving your tolerance quadruples the required sample.

ImportantThe core punchline

This is the central tension in designing experiments for complex constructs. The researcher faces a dilemma:

  • Set a strict \(r_{\max}\) (e.g., 0.05–0.10): Your causal inferences will be clean — but you need sample sizes that few lab studies can afford.
  • Accept a lenient \(r_{\max}\) (e.g., 0.30–0.40): Your sample size is achievable — but residual confounding may be large enough to bias your estimates substantially, especially when individual confounders have strong effects (β ≈ 0.5–1.0 SD) on Y.

A typical psychology or marketing experiment with N = 60–120 participants is, at best, achieving \(r_{\max} \approx 0.30\)\(0.40\) for a construct with D = 5–10 dimensions — regardless of whether Qualtrics randomized. Researchers who report a balance table with N = 80 and declare “randomization succeeded” are not wrong about the observable; they are implicitly operating at a very lenient latent \(r_{\max}\), where substantial confounding remains undetected.

In Module 1’s language: their randomization observable has low construct validity as an indicator of latent randomization, because the gap between the observable and the latent property is too wide to close at typical sample sizes.

With a narrow outcome like Visual Fixation Time (D = 2 dimensions): - To ensure \(|r| < 0.10\) with 95% confidence: N ≈ 503 — feasible - A study with N = 80 comfortably satisfies even a strict \(r_{\max} = 0.20\)

With a broad outcome like Sustainable Purchase Intention (D = 10 dimensions): - To ensure \(|r| < 0.10\): N ≈ 788 — rarely achieved in practice - To ensure \(|r| < 0.20\): N ≈ 197 — still demanding for most lab studies - A study with N = 200 barely meets the r_max < 0.20 threshold (required N = 197). A typical lab study with N = 80, however, is operating at an effective \(r_{\max} \approx 0.31\) — where a single strong confounder (β ≈ 0.8 SD) can inject bias of 0.24 SD or more into your treatment estimate

This is the direct experimental parallel to Module 1’s core finding about discriminant validity. A scale with many correlated sub-dimensions is hard to measure cleanly; a construct with many latent antecedents is hard to randomize cleanly. Both problems grow with construct breadth, and both require you to map the latent space before claiming clean measurement or clean identification.


13.7 Simulation: Correlated Dimensions — Two Countervailing Forces

The analysis so far assumed that the 10 latent dimensions of Y are independent of one another in the data-generating process. But is that realistic?

Think about the constructs driving Sustainable Purchase Intention. Environmental values, health motivation, social norm salience, and peer influence tend to cluster together — the “conscious consumer” profile. Similarly, price sensitivity, income, and habitual spending patterns cluster together — the “pragmatic buyer” profile. The latent dimensions are correlated, not orthogonal.

13.7.1 Two forces pull in opposite directions

Before running any simulation, consider the logic:

Force 1 — fewer independent checks (sometimes helps). If two dimensions are highly correlated (say, environmental values and social norm salience move together), then for the purpose of counting how many independent things can go wrong, you effectively have fewer than 10 independent dimensions. Fewer independent checks = lower probability that any one of them exceeds the threshold by chance = slightly easier to achieve orthogonality for a given N.

Force 2 — cascade failures (always hurts). The flip side is severe. When randomization does fail on one correlated dimension, it fails on all the correlated ones simultaneously. Imbalance on “environmental values” drags along imbalance on “social norms,” “health motivation,” and “peer influence” all at once. Each of these has its own independent β effect on Y, so the bias compounds. Instead of one confounder contributing β × r bias, you get four correlated confounders each contributing, with their effects accumulating.

13.7.2 When does correlation help vs. hurt?

The key variable is why the dimensions are correlated:

Scenario Structure Effect on randomization
Helpful: dimensions cluster around a few latent “super-factors” e.g., all “values-based” dimensions move together as one bloc Fewer effective checks; orthogonality easier to achieve at smaller N
Hurtful: dimensions are correlated and each carries independent variance in Y e.g., environmental values, health focus, and social norms each have separate β effects on WTP Cascade failures: one imbalanced assignment fails simultaneously on 3 dimensions with additive bias
Worst case: high correlation and strong independent effects on Y The dims are correlated with each other AND each has a large β One randomization failure contaminates every correlated dimension, compounding bias substantially

The concrete intuition: imagine you are a researcher measuring balance, and you check environmental values and find it balanced (r ≈ 0). If that dimension was highly correlated with health focus and social norms in your sample, you might infer those are balanced too — and you might be right. That is the helpful version. But if each of those three dimensions has an independent effect on WTP that wasn’t captured by environmental values alone, then any failure of balance cascades to all three simultaneously — that is the hurtful version.

This maps directly onto Module 1’s concept of correlated subscales. When subscales are highly correlated, you gain efficiency in measurement (fewer items needed to capture the same information). But if they are discriminantly invalid — correlated and capturing separate variance — then mismeasuring one means mismeasuring all.

13.7.3 Simulation

We run two contrasting scenarios, each featuring correlated dimensions but differing in whether the failure mode is primarily helpful or hurtful:

  • Scenario A — “Helpful” correlation (ρ = 0.75, uniform small β): Dimensions are very strongly correlated — effectively 10 measured dimensions but only ~2–3 independent blocs. Each dimension has a similar, small β effect on Y (β = 0.15). When the correlation structure is this tight, the effective number of independent checks drops sharply, making orthogonality easier to achieve.

  • Scenario B — “Hurtful” correlation (ρ = 0.35, heterogeneous large β): Dimensions have moderate correlation but each carries a meaningfully different and large independent effect on Y (ranging from β = 0.05 to β = 1.50). When randomization fails on one dimension, several correlated ones fail simultaneously — and because each carries its own large β, the cumulative bias explodes.

▶ Simulate: three DGP scenarios + Type-1 error rate (may take ~2 min)
# ── Helper: correlated binary covariates via Cholesky ────────────────────────
corr_binary <- function(N, D, rho) {
  Sigma <- matrix(rho, D, D); diag(Sigma) <- 1
  L     <- chol(Sigma)
  z     <- matrix(rnorm(N * D), nrow = N) %*% L
  (z > 0L) * 1L
}

# ── Simulation: P(max |r| < threshold) ───────────────────────────────────────
sim_orthog <- function(N, D, rho = 0, threshold = 0.10, n_sim = 1500) {
  mean(replicate(n_sim, {
    X        <- if (rho == 0) matrix(rbinom(N * D, 1, 0.5), nrow = N)
                else          corr_binary(N, D, rho)
    T_assign <- rbinom(N, 1, 0.5)
    cors     <- apply(X, 2, function(col) abs(cor(col, T_assign)))
    max(cors) < threshold
  }))
}

# ── Simulation: cumulative bias when any failure occurs ──────────────────────
sim_bias <- function(N = 200, D = 10, rho = 0, betas, threshold = 0.10, n_sim = 1000) {
  replicate(n_sim, {
    X        <- if (rho == 0) matrix(rbinom(N * D, 1, 0.5), nrow = N)
                else          corr_binary(N, D, rho)
    T_assign <- rbinom(N, 1, 0.5)
    cors     <- apply(X, 2, function(col) cor(col, T_assign))
    sum(abs(cors[abs(cors) > threshold]) * betas[abs(cors) > threshold])
  })
}

# ── Simulation: Type-1 and Type-2 error rates, conditional on orthogonality failing ──
# The key insight: under PERFECT randomization, Type-1 is always ~5% regardless of betas.
# The interesting question is: given your experiment FAILED the orthogonality check
# (as many real experiments do), how bad are your error rates?
# Large betas (Scenario B) turn small residual imbalance into large bias → inflated Type-1.
# Tiny betas (Scenario A) produce negligible bias even when imbalance exists → Type-1 near 5%.
sim_error_rates <- function(N, D, rho, betas, r_max = 0.10,
                            true_eff = 0.40, n_sim = 2000) {
  results <- t(replicate(n_sim, {
    X        <- if (rho == 0) matrix(rbinom(N * D, 1, 0.5), nrow = N)
                else          corr_binary(N, D, rho)
    T_assign <- rbinom(N, 1, 0.5)
    cors     <- apply(X, 2, function(x) abs(cor(x, T_assign)))
    orth_ok  <- max(cors) <= r_max          # did this experiment pass the balance check?
    Y_null   <- as.vector(X %*% betas) + rnorm(N, 0, 1)
    p_null   <- t.test(Y_null[T_assign == 1], Y_null[T_assign == 0])$p.value
    Y_alt    <- true_eff * T_assign + as.vector(X %*% betas) + rnorm(N, 0, 1)
    p_alt    <- t.test(Y_alt[T_assign == 1], Y_alt[T_assign == 0])$p.value
    c(t1 = as.integer(p_null < 0.05), t2 = as.integer(p_alt > 0.05),
      orth_ok = as.integer(orth_ok))
  }))
  failed <- results[, "orth_ok"] == 0L
  passed <- results[, "orth_ok"] == 1L
  # Type-1: false-positive rate CONDITIONAL ON orthogonality FAILING (violations present)
  t1_cond  <- if (sum(failed) > 10) mean(results[failed, "t1"]) else NA_real_
  # Type-2: miss rate CONDITIONAL ON orthogonality PASSING (no violations; p-values are valid here)
  t2_valid <- if (sum(passed) > 10) mean(results[passed, "t2"]) else NA_real_
  data.frame(t1_rate = t1_cond, t2_rate = t2_valid,
             pct_fail = mean(failed))
}

set.seed(42)

N_vals <- c(50, 100, 200, 400, 800)
D      <- 10

# Scenario betas — made more extreme to amplify the contrast
betas_helpful <- rep(0.15, D)                    # uniform tiny betas
betas_hurtful <- c(1.50, 1.20, 1.00, 0.90, 0.70,
                   0.50, 0.30, 0.20, 0.10, 0.05) # large heterogeneous betas

# P(orthogonal) across N, for three DGP types
prob_df <- bind_rows(
  data.frame(N   = N_vals, DGP = "Independent (rho = 0)",
             prob = sapply(N_vals, sim_orthog, D = D, rho = 0.00)),
  data.frame(N   = N_vals, DGP = "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)",
             prob = sapply(N_vals, sim_orthog, D = D, rho = 0.75)),
  data.frame(N   = N_vals, DGP = "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)",
             prob = sapply(N_vals, sim_orthog, D = D, rho = 0.35))
)

scen_colors <- c(
  "Independent (rho = 0)"                                  = "#4a90d9",
  "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)" = "#52b788",
  "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)"  = "#e63946"
)

p_prob <- ggplot(prob_df, aes(x = N, y = prob, color = DGP, group = DGP)) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 3.5) +
  geom_hline(yintercept = 0.95, linetype = "dashed", color = "gray40") +
  annotate("text", x = 820, y = 0.957, label = "95% target",
           size = 3.0, color = "gray40", hjust = 0) +
  scale_x_continuous(breaks = N_vals) +
  scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
  scale_color_manual(values = scen_colors, name = NULL) +
  labs(x = "Sample size (N)",
       y = "P(max |r| < 0.10 across all 10 dimensions)",
       title = "Probability of Achieving Orthogonality (D = 10)",
       subtitle = paste0("Scenario A (strong clustering, tiny betas): easier — correlation reduces effective checks\n",
                         "Scenario B (moderate correlation, large varied betas): hardest — and most dangerous when it fails")) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank(), legend.position = "top",
        legend.text = element_text(size = 9))

# ── Bias when randomization fails ─────────────────────────────────────────────
set.seed(42)
bias_indep   <- sim_bias(N = 200, D = D, rho = 0.00, betas = betas_hurtful)
bias_helpful <- sim_bias(N = 200, D = D, rho = 0.75, betas = betas_helpful)
bias_hurtful <- sim_bias(N = 200, D = D, rho = 0.35, betas = betas_hurtful)

bias_df <- bind_rows(
  data.frame(bias = bias_indep,   DGP = "Independent (rho = 0)"),
  data.frame(bias = bias_helpful, DGP = "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)"),
  data.frame(bias = bias_hurtful, DGP = "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)")
) |>
  mutate(DGP = factor(DGP, levels = c("Independent (rho = 0)",
                                       "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)",
                                       "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)")))

p_bias <- ggplot(bias_df |> filter(bias > 0),
                 aes(x = bias, fill = DGP)) +
  geom_histogram(bins = 35, alpha = 0.80, color = "white", linewidth = 0.2) +
  geom_vline(data = bias_df |> filter(bias > 0) |>
               group_by(DGP) |> summarise(med = median(bias), .groups = "drop"),
             aes(xintercept = med, color = DGP), linewidth = 1.1, linetype = "dashed") +
  facet_wrap(~DGP, nrow = 3, scales = "free") +
  scale_fill_manual(values = scen_colors, guide = "none") +
  scale_color_manual(values = scen_colors, guide = "none") +
  scale_x_continuous(labels = function(x) sprintf("%.2f SD", x)) +
  labs(x = "Cumulative OVB when randomization fails (SD units of Y)",
       y = "Frequency across 1,000 experiments",
       title = "When Randomization Fails: How Large Is the Bias? (N = 200)",
       subtitle = "Dashed = median cumulative bias  \u00b7  Only experiments with \u2265 1 failing dimension shown") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        strip.text = element_text(face = "bold", size = 8.5))

# ── Type-1 error rates across N (Type-2 intentionally excluded — see note below) ──
set.seed(42)
err_df <- bind_rows(
  lapply(N_vals, function(n) {
    cbind(N = n, DGP = "Independent (rho = 0)",
          sim_error_rates(n, D, rho = 0.00, betas = betas_hurtful))
  }),
  lapply(N_vals, function(n) {
    cbind(N = n,
          DGP = "Scenario A: Helpful (rho = 0.75, uniform small \u03b2)",
          sim_error_rates(n, D, rho = 0.75, betas = betas_helpful))
  }),
  lapply(N_vals, function(n) {
    cbind(N = n,
          DGP = "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)",
          sim_error_rates(n, D, rho = 0.35, betas = betas_hurtful))
  })
) |>
  mutate(N        = as.integer(N),
         t1_rate  = as.numeric(t1_rate),
         pct_fail = as.numeric(pct_fail),
         DGP      = as.character(DGP))

# Show Type-1 error only (Type-2 is not reported — see callout below)
p_err <- ggplot(err_df |> filter(DGP != "Independent (rho = 0)"),
                aes(x = N, y = t1_rate, color = DGP, group = DGP)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3.0) +
  geom_hline(yintercept = 0.05, linetype = "dashed", color = "gray40") +
  annotate("text", x = max(N_vals) * 1.02, y = 0.055, label = "5% nominal",
           color = "gray40", size = 3.2, hjust = 0) +
  scale_x_continuous(breaks = N_vals) +
  scale_y_continuous(labels = percent_format(accuracy = 1),
                     limits = c(0, NA)) +
  scale_color_manual(values = scen_colors[names(scen_colors) != "Independent (rho = 0)"],
                     name = NULL) +
  labs(x = "Sample size (N)",
       y = "Type-1 error rate (among failed experiments)",
       title = "False-Positive Rate Rises with N When Orthogonality Fails",
       subtitle = paste0(
         "Conditional on |r(T, X)| > 0.10  \u00b7  ",
         "Dashed = 5% nominal  \u00b7  ",
         "Scenario B's large \u03b2s amplify residual imbalance into detectable bias")) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(), legend.position = "top",
        legend.text = element_text(size = 8.5),
        plot.margin = margin(5, 50, 5, 5))

p_prob / p_bias / p_err

WarningReading the three plots together

Top plot — P(achieving orthogonality): Scenario A (ρ = 0.75, tiny uniform betas) is the easiest case. The very strong correlation among dimensions collapses the effective number of independent checks, so a smaller N achieves the 95% target. This is the helpful force at work. Scenario B (ρ = 0.35, large varied betas) offers little advantage over independent dimensions on this metric.

Middle plot — bias when randomization fails: This is where the scenarios diverge dramatically. When Scenario A fails, it fails on a highly clustered bloc with tiny β values — cumulative bias remains small. When Scenario B fails, it fails on moderately correlated dimensions each with large independent β effects — the bias distribution has a fat right tail, with individual experiments easily incurring > 1.0 SD of cumulative confound. Notice how much wider the Scenario B distribution is compared to A.

Bottom plot — Type-1 error rate (given orthogonality failed): Conditional on at least one covariate being imbalanced (|r| > 0.10), Scenario B’s large betas turn even small residual correlations into substantial bias — false positive rates rise well above 5% and continue rising as N increases. Scenario A stays near 5% because its tiny betas keep the omitted-variable bias small even when imbalance exists.

NoteWhy is Type-2 error (power) not shown?

Because the question of power is irrelevant when exchangeability is violated.

Power analysis asks: “If there is a true effect, how often will my test correctly detect it?” That question is only valid when the p-value is a valid test statistic — i.e., when treatment is exchangeable with control. When orthogonality fails, the p-value is not a valid measure of the treatment effect. A “detection” (p < 0.05) in an experiment where orthogonality failed could mean either (a) you correctly found a real effect, or (b) the confound inflated the test statistic into significance independently of whether an effect exists. You cannot distinguish them. Power therefore has no valid interpretation under a violated exchangeability assumption — the right question is how to restore valid inference, not how often the broken test produces p < 0.05.

ImportantCounterintuitive danger: larger N makes invalid tests more dangerous, not less

Most researchers assume larger samples make experiments safer. The bottom plot reveals the opposite: among experiments that fail the orthogonality check, the false-positive rate rises with N — the very opposite of what researchers intuitively expect.

The mechanism: the plot conditions on experiments where |r(T, X)| > 0.10. Among those “failing” experiments, the residual bias (r × β) is roughly constant across N — the failing correlation barely exceeds 0.10 in all cases. What does change is the standard error of the t-test, which shrinks as 1/√N. The same small bias that was invisible noise at small N becomes a statistically detectable “effect” at large N.

The compounding trap visualized: The plot below shows — for Scenario B (the dangerous case) — how Type-1 error (among experiments with violations) and Type-2 error/miss rate (among experiments without violations) evolve as N grows. Point size is proportional to the frequency of orthogonality violations. Read it as a race: the red line (false-positive danger) climbs faster than the blue line (miss rate) falls.

▶ Side-by-side: Type-1 error (violations) vs power (no violations) as N grows
# Extract Scenario B only; compute power = 1 - miss rate; reshape to long for facets
danger_df <- err_df |>
  filter(DGP == "Scenario B: Hurtful (rho = 0.35, large varied \u03b2)") |>
  dplyr::select(N, t1_rate, t2_rate, pct_fail) |>
  mutate(power = 1 - t2_rate) |>                     # power = 1 - miss rate
  tidyr::pivot_longer(c(t1_rate, power),
                      names_to  = "metric",
                      values_to = "rate") |>
  mutate(
    panel = dplyr::case_when(
      metric == "t1_rate" ~
        "Type-1 Error Rate\n(experiments WITH orthogonality violations — p-values INVALID)",
      metric == "power" ~
        "Power (1 \u2212 Miss Rate)\n(experiments WITHOUT violations — p-values VALID)"
    ),
    panel = factor(panel, levels = c(
      "Type-1 Error Rate\n(experiments WITH orthogonality violations — p-values INVALID)",
      "Power (1 \u2212 Miss Rate)\n(experiments WITHOUT violations — p-values VALID)"
    )),
    color_val = ifelse(metric == "t1_rate", "#e63946", "#457b9d")
  )

# Reference line data: 5% for Type-1 panel, 80% for power panel
ref_lines <- data.frame(
  panel = factor(c(
    "Type-1 Error Rate\n(experiments WITH orthogonality violations — p-values INVALID)",
    "Power (1 \u2212 Miss Rate)\n(experiments WITHOUT violations — p-values VALID)"
  ), levels = levels(danger_df$panel)),
  yref  = c(0.05, 0.80),
  label = c("5% nominal \u03b1", "80% power")
)

ggplot(danger_df, aes(x = N, y = rate, group = panel)) +
  geom_hline(data = ref_lines, aes(yintercept = yref),
             linetype = "dashed", color = "gray50", linewidth = 0.8) +
  geom_text(data = ref_lines, aes(x = 820, y = yref + 0.025, label = label),
            color = "gray40", size = 3.0, hjust = 0, inherit.aes = FALSE) +
  geom_line(aes(color = panel), linewidth = 1.4) +
  geom_point(aes(size = pct_fail, color = panel), alpha = 0.88) +
  facet_wrap(~panel, nrow = 1, scales = "free_y") +
  scale_color_manual(values = c(
    "Type-1 Error Rate\n(experiments WITH orthogonality violations — p-values INVALID)" = "#e63946",
    "Power (1 \u2212 Miss Rate)\n(experiments WITHOUT violations — p-values VALID)"      = "#457b9d"
  ), guide = "none") +
  scale_size_continuous(
    name   = "Proportion of experiments\nwith |r(T,X)| > 0.10",
    range  = c(2, 8),
    labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = N_vals) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(
    x        = "Sample size (N)",
    y        = NULL,
    title    = "Larger N: Type-1 Error Climbs Faster Than Power Grows — Among Experiments That Fail Orthogonality",
    subtitle = paste0(
      "Scenario B (rho = 0.35, large heterogeneous \u03b2s)  \u00b7  ",
      "Point size = proportion of experiments with at least one |r(T,X)| > 0.10\n",
      "Left: false-positive danger rises with N (invalid experiments)  \u00b7  ",
      "Right: power in valid experiments rises with N (as expected)")) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor  = element_blank(),
    legend.position   = "bottom",
    legend.direction  = "horizontal",
    legend.text       = element_text(size = 8),
    strip.text        = element_text(face = "bold", size = 9.5),
    plot.margin       = margin(5, 20, 5, 5)
  )

Reading the panels: The left panel shows Type-1 error (false-positive rate) conditional on orthogonality having failed — rising from ~6% at N = 50 to 20–30% at N = 800, far above the 5% nominal level. The right panel shows power in the experiments that passed the orthogonality check — rising as expected, because larger N does increase power for valid experiments. The critical asymmetry: point size (proportion of experiments with violations) shrinks as N grows, so the situation looks increasingly safe to a researcher monitoring only the balance check pass rate. Yet the rare failures that remain become dramatically more dangerous. The compounding trap: sample size DOES reduce the frequency of orthogonality violations — but it reduces violations slower than it amplifies the danger of the violations that remain.

Larger N does not license looser balance checking — it demands stricter checking. Tighten your orthogonality criterion (lower r_max) as N grows. Never assume you have achieved actual randomization; random assignment is a procedure, not a guarantee, and its validity must be verified empirically every time.

The practical lesson: correlated dimensions are not uniformly good or bad. The key question is what their β values look like. If correlated dimensions all carry similar, modest effects, correlation genuinely helps by reducing effective checks. If correlated dimensions carry large, heterogeneous effects, correlation is a liability — and that liability is invisible unless you examine conditional bias and error rates, not just P(orthogonality).

This mirrors a core lesson from Module 1: a tight subscale structure looks efficient and reliable, but if the subscales carry separate predictive variance (discriminant invalidity), a failure in one cascades to all. The structural parallel is exact.


13.8 Checking the Observables: Randomization Diagnostics

ImportantA fundamental limitation — you can only check what you measured

You can only discover failures of randomization on dimensions you thought to measure.

This is the direct experimental parallel to Module 1’s measurement problem. In Module 1, we showed that unmeasured sub-dimensions of Y produce a biased estimate of your latent construct — you cannot diagnose measurement failure on dimensions you never operationalized. The same logic applies here: if a potential confounder was never measured as a covariate, no balance table, no Love plot, and no omnibus test will catch it.

This constraint is not a limitation of statistics — it is a fundamental epistemological constraint. Its implication is that randomization checking is only as good as the covariate measurement you did before the study began. The tools from Module 1 — CFA to identify sub-dimensions of Y, pilot text analysis to map what’s in X and Y (see Section 2.5), discriminant validity checks — are not just measurement tools. They are pre-registration scaffolding that determines whether your eventual randomization check has the construct coverage to be meaningful.

The practical lesson: Before collecting data, use the construct-mapping tools from Module 1 to enumerate every plausible antecedent of Y. Measure them all. Then, after randomization, check balance on every one of them. The diagnostics below are only valuable if you did that groundwork first.

Even though you cannot directly observe the latent dimensions, you can check the observable proxies you did measure. A balance table checks whether measured characteristics are distributed similarly across conditions — a necessary but not sufficient condition for latent randomization.

Here we simulate a study where randomization partially failed (treatment assignment was inadvertently correlated with one confounder) and show how to detect it.

▶ Simulate study with imperfect randomization
set.seed(42)
N_study <- 200

observed_chars <- data.frame(
  env_values     = rbinom(N_study, 1, 0.50),
  income_high    = rbinom(N_study, 1, 0.45),
  social_norms   = rbinom(N_study, 1, 0.50),
  brand_trust    = rbinom(N_study, 1, 0.55),
  quality_percep = rbinom(N_study, 1, 0.50),
  health_motiv   = rbinom(N_study, 1, 0.40),
  peer_influence = rbinom(N_study, 1, 0.50),
  prior_green    = rbinom(N_study, 1, 0.45),
  price_sensitiv = rbinom(N_study, 1, 0.50),
  emotion_resp   = rbinom(N_study, 1, 0.50)
)

# "Random" assignment — inadvertently correlated with env_values and income
# (e.g., participants self-selected into a study they saw on an eco-forum)
treatment_prob <- 0.35 +
  0.25 * observed_chars$env_values +
  0.15 * observed_chars$income_high
treatment_prob <- pmin(pmax(treatment_prob, 0.05), 0.95)
treatment <- rbinom(N_study, 1, treatment_prob)

study_df <- cbind(observed_chars,
                  treatment = factor(treatment, labels = c("Control", "Eco-label")))

13.8.1 Balance Table

▶ Compute balance table with chi-square tests
balance_results <- lapply(names(observed_chars), function(var) {
  ctrl <- observed_chars[[var]][treatment == 0]
  trt  <- observed_chars[[var]][treatment == 1]
  chi  <- chisq.test(table(observed_chars[[var]], treatment), correct = FALSE)
  data.frame(
    Characteristic  = gsub("_", " ", var) |> tools::toTitleCase(),
    `Control (%)`   = round(mean(ctrl) * 100, 1),
    `Eco-label (%)` = round(mean(trt)  * 100, 1),
    Difference      = round((mean(trt) - mean(ctrl)) * 100, 1),
    p_value         = round(chi$p.value, 3),
    Flag            = ifelse(chi$p.value < 0.05, "\u26a0\ufe0f Imbalanced", "")
  )
}) |> bind_rows()

kable(balance_results,
      col.names = c("Characteristic", "Control (%)", "Eco-label (%)",
                    "Diff (pp)", "p-value (\u03c7\u00b2)", ""),
      caption = "Balance table: observed characteristics by condition")
Balance table: observed characteristics by condition
Characteristic Control (%) Eco-label (%) Diff (pp) p-value (χ²)
Env Values 42.0 66.1 24.0 0.001 ⚠️ Imbalanced
Income High 29.5 42.9 13.3 0.053
Social Norms 43.2 45.5 2.4 0.740
Brand Trust 61.4 53.6 -7.8 0.269
Quality Percep 48.9 43.8 -5.1 0.471
Health Motiv 39.8 35.7 -4.1 0.556
Peer Influence 46.6 54.5 7.9 0.269
Prior Green 53.4 36.6 -16.8 0.017 ⚠️ Imbalanced
Price Sensitiv 43.2 50.9 7.7 0.278
Emotion Resp 54.5 60.7 6.2 0.380

13.8.2 Visual Balance Check (Love Plot)

▶ Plot: standardized differences (love plot)
smd_df <- data.frame(
  Characteristic = gsub("_", " ", names(observed_chars)) |> tools::toTitleCase(),
  SMD = sapply(names(observed_chars), function(var) {
    ctrl <- observed_chars[[var]][treatment == 0]
    trt  <- observed_chars[[var]][treatment == 1]
    pooled_sd <- sqrt((var(trt) + var(ctrl)) / 2)
    if (pooled_sd == 0) return(0)
    (mean(trt) - mean(ctrl)) / pooled_sd
  })
) |>
  arrange(desc(abs(SMD))) |>
  mutate(Characteristic = factor(Characteristic, levels = Characteristic),
         flag           = abs(SMD) > 0.10)

ggplot(smd_df, aes(x = SMD, y = Characteristic, color = flag)) +
  geom_vline(xintercept =  0,    color = "gray40", linewidth = 0.8) +
  geom_vline(xintercept =  0.10, linetype = "dashed", color = "#de2d26", linewidth = 0.7) +
  geom_vline(xintercept = -0.10, linetype = "dashed", color = "#de2d26", linewidth = 0.7) +
  geom_segment(aes(x = 0, xend = SMD, y = Characteristic, yend = Characteristic),
               linewidth = 1.0) +
  geom_point(size = 4) +
  scale_color_manual(values = c("FALSE" = "#52b788", "TRUE" = "#de2d26"),
                     labels = c("|SMD| \u2264 0.10", "|SMD| > 0.10"), name = NULL) +
  labs(x = "Standardized Mean Difference (Treatment \u2212 Control)",
       y = NULL,
       title = "Love Plot: Covariate Balance by Condition",
       subtitle = "Dashed lines = |SMD| = 0.10 threshold  \u00b7  Red = potential imbalance") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank(), legend.position = "top")

13.8.3 Omnibus Randomization Test

For a single omnibus test, we use logistic regression predicting treatment from all characteristics. If randomization succeeded, no characteristic should significantly predict assignment, and the model R² should be near zero.

▶ Omnibus randomization test via logistic regression
treat_int <- as.integer(treatment) - 0L

omnibus_model <- glm(treat_int ~ ., data = cbind(observed_chars, treat_int),
                     family = binomial)

null_ll <- logLik(glm(treat_int ~ 1, data = cbind(observed_chars, treat_int),
                      family = binomial))
full_ll <- logLik(omnibus_model)
n       <- N_study

r2_nag <- (1 - exp(2 * (null_ll - full_ll) / n)) / (1 - exp(2 * null_ll / n))

lrt_p  <- anova(glm(treat_int ~ 1, data = cbind(observed_chars, treat_int),
                    family = binomial),
                omnibus_model, test = "LRT")$`Pr(>Chi)`[2]

cat(sprintf("Omnibus LRT: p = %.4f\n",  lrt_p))
Omnibus LRT: p = 0.0041
▶ Omnibus randomization test via logistic regression
cat(sprintf("Nagelkerke R\u00b2: %.3f\n", r2_nag))
Nagelkerke R²: 0.162
▶ Omnibus randomization test via logistic regression
cat(sprintf("Interpretation: Covariates explain %.1f%% of variance in treatment assignment.\n",
            r2_nag * 100))
Interpretation: Covariates explain 16.2% of variance in treatment assignment.
▶ Omnibus randomization test via logistic regression
cat("Good randomization should produce R\u00b2 \u2248 0 and p well above .05.\n")
Good randomization should produce R² ≈ 0 and p well above .05.

Handling mixed covariate types

In practice, your covariates will rarely all be the same type. You may have binary indicators (gender, prior exposure), ordered categories (education level, income bracket), unordered categories (nationality, recruitment channel), and continuous measures (age, prior attitude score). The omnibus logistic regression approach handles all of these — you just need to encode each type correctly before passing them to glm:

▶ Template: omnibus randomization test for mixed covariate types (not run)
# ── Step 1: Encode each covariate type correctly ──────────────────────────────
# Suppose your covariate data frame is called `covariates` with columns:
#   gender       : binary (0/1 or factor with 2 levels)
#   edu_level    : ordinal (e.g., 1=high school, 2=bachelor, 3=master, 4=PhD)
#   country      : nominal (unordered categories → MUST be factor)
#   age          : continuous
#   prior_att    : continuous (e.g., prior attitude score on 1–7 scale)

covariates$gender    <- factor(covariates$gender)         # binary → factor (1 df)
covariates$edu_level <- as.integer(covariates$edu_level)  # ordinal → numeric (linear trend)
covariates$country   <- factor(covariates$country)        # nominal → factor (K-1 dummies)
# age and prior_att stay as numeric — no conversion needed

# ── Step 2: Omnibus logistic regression ───────────────────────────────────────
# treatment must be a binary 0/1 variable
# glm automatically dummy-codes all factor variables
omnibus_fit  <- glm(treatment ~ ., data = covariates, family = binomial)
null_fit     <- glm(treatment ~ 1, data = covariates, family = binomial)

# ── Step 3: Likelihood-ratio test ─────────────────────────────────────────────
lrt <- anova(null_fit, omnibus_fit, test = "LRT")
cat(sprintf("Omnibus LRT p = %.4f\n", lrt$`Pr(>Chi)`[2]))

# Nagelkerke pseudo-R²
n_obs  <- nrow(covariates)
nag_r2 <- (1 - exp(2 * (logLik(null_fit) - logLik(omnibus_fit)) / n_obs)) /
           (1 - exp(2 * logLik(null_fit) / n_obs))
cat(sprintf("Nagelkerke R² = %.3f\n", nag_r2))

# ── Optional: non-linear ordinal check ────────────────────────────────────────
# If you suspect the ordinal variable has a non-linear relationship with
# treatment assignment, convert it to a factor (dummy-coded) instead:
#   covariates$edu_level <- factor(covariates$edu_level)
# This uses more degrees of freedom but makes no linearity assumption.

Key points:

  • Binary covariates: encode as factor — produces 1 dummy variable.
  • Ordinal covariates: encode as integer (assumes linear trend across levels). If the relationship might be non-linear, use factor() instead — this creates K-1 dummies but uses more degrees of freedom.
  • Nominal (unordered) covariates: must be factorglm automatically creates K-1 dummy variables (one per category minus the reference level).
  • Continuous covariates: pass as-is (numeric). No transformation needed.

The omnibus LRT tests whether any covariate — regardless of type — predicts treatment assignment better than chance. If p < .05 or R² > .05, investigate which covariates are driving the imbalance using the individual balance table and Love plot above.

WarningWhat to look for
Diagnostic Good randomization Potential problem
Balance table p-values All p > .05 Any p < .05
Love plot SMDs All |SMD| < 0.10 Any |SMD| > 0.10
Omnibus LRT p-value p well above .05 p < .05
Nagelkerke R² R² ≈ 0 R² > 0.05

These checks only evaluate the observable dimensions you measured. The latent dimensions — all the antecedents of Y you did not think to measure — remain unverified. The exchangeability assumption is always partly an act of faith, and that faith becomes more tenuous the broader and more complex Y is.


13.9 When You Can’t Check What You Can’t See: Diagnosing Latent Imbalance

The three diagnostics above — balance table, Love plot, omnibus test — are essential, but they share a fundamental limitation:

You can only check balance on dimensions you thought to measure.

If an unmeasured dimension of Y is correlated with treatment assignment, no balance table will catch it. And as the analysis in Section 3.6 showed, the broader your outcome construct, the more such unmeasured dimensions there are likely to be.

What can you do?

13.9.1 What you can do before data collection

1. Map the construct before running the study. Use Module 1’s toolkit — confirmatory factor analysis, convergent/discriminant validity checks, latent class analysis — on pilot data to identify all the major sub-dimensions of Y. Every dimension you identify is a confounder you can now measure and check balance on.

2. Blocked (stratified) randomization. If you know in advance that certain dimensions of Y are likely to be distributed unevenly in your sample, pre-stratify on them. Assign treatment within blocks defined by those dimensions. This guarantees balance on the dimensions you can observe — reducing the degrees of freedom for latent imbalance to exploit.

3. Collect more observations. As Section 3.6 showed quantitatively, the probability of achieving approximate orthogonality on all D dimensions grows with N. If your outcome is broad (large D), power calculations based on the treatment effect alone dramatically underestimate the required sample.

13.9.2 What you can do after data collection

4. Open-ended text as a nomological net scanner — and an identification strategy.

As we developed in detail in Section 2.5 (What’s in X? What’s in Y?), open-ended text responses from a small pilot are the most powerful tool for mapping the construct space before collecting main-study data. The nomological net scan reveals the exhaustive list of intermediary variables in the causal chain from X to Y. Every concept the text reveals is a mediator you should measure quantitatively — and a confounder you should check balance on.

The key insight from Section 2.5 bears repeating here: without the complete mediator list, the treatment coefficient in a regression is an uninterpretable mixture of all active pathways. With the list, you gain a structural identification strategy: decompose the total treatment effect into its component channels using the measured mediators.

The complete structural modeling demonstration — quantitative mediator scoring, parallel mediation, SEM with latent factors, and DGP recovery comparison — is developed step by step in Section 2.5 above, where it serves as the practical illustration of how to handle exclusion restriction violations once the nomological net has been mapped.