Module 2: Hypothesis Testing

The Logic of Randomization and What P-Values Actually Tell You

Overview

Null hypothesis significance testing is everywhere in management research, yet its logic is widely misunderstood. This module builds from the ground up: why does randomization justify causal claims? What does a p-value actually mean? And what are the hidden assumptions — about units, stimuli, and assignment mechanisms — that most researchers ignore?

We cover the permutation logic of p-values, Type I and Type II error, power, the multiple comparisons problem, researcher degrees of freedom, and the design features (between vs. within, stimulus sampling, and Latin squares) that change what you can conclude.


Learning Goals

By the end of this module you should be able to:

  • Derive the sampling distribution of a test statistic from the randomization distribution, not from asymptotic theory
  • Explain the difference between a two-sided p-value and a posterior probability
  • Calculate and interpret statistical power, and explain why most social science studies are underpowered
  • Identify researcher degrees of freedom and explain how they inflate false-positive rates
  • Explain the rationale for stimulus sampling and design studies that treat stimuli as random effects
  • Construct a Latin square design and explain when it is preferable to a fully crossed factorial

What This Tutorial Is About

There is a single thread running through this tutorial and its predecessor:

Your observable may not accurately reflect the latent construct you need.

In Module 1, that problem lived in your outcome variable Y — your scale picked up variance from constructs you never intended to measure. Here, the same problem appears in two places at once: in your treatment variable X, and in the act of randomization that is supposed to make X interpretable.

This tutorial covers four ideas:

  • Part 1 — P-values: What they actually mean, where their inferential power comes from, and what assumptions they require — demonstrated by building null distributions from scratch using permutation.

  • Part 2 — Randomization: Why “clicking randomize” (the observable) is not the same as “achieving randomization” (the latent property), and why this gap grows as the construct Y you are studying becomes broader and more complex.

  • Part 3 — Selection Effects: Why some data-collection environments make exchangeability structurally impossible, regardless of how randomization is conducted. Non-representativeness, self-selection, survivorship, attrition, and exclusion bias are not failures of randomization — they are failures that occur before, during, or after data collection, and they cannot be fixed by collecting more data within the same design.

  • Part 4 — The Exclusion Restriction: Just as a measured outcome Y can fail discriminant validity by absorbing multiple constructs, a manipulated treatment X can inject multiple independent signals into the system simultaneously. When assignment is random, the treatment coefficient still estimates the causal effect of the bundle participants actually received. What becomes ambiguous is the narrower mechanistic claim: the coefficient cannot tell you whether the effect came from eco-certification, the premium packaging, demand cues, or their interaction — and that mechanism ambiguity is the experimental analogue of discriminant invalidity in measurement.

NoteLearning Objectives

By the end of this tutorial, you will be able to:

  • Explain what a p-value is — and is not — using simulation rather than formulas
  • Build a null distribution from scratch using label permutation
  • Identify the three ingredients that give a p-value inferential power
  • Explain why a treatment manipulation can violate discriminant validity in the same way a measurement scale can
  • Use open-ended text to map the constructs injected by an experimental manipulation and those driving an outcome
  • Articulate the difference between observable randomization and latent randomization using the construct-validity language from Module 1
  • Calculate the minimum observations needed for approximate orthogonality given the complexity of Y
  • Run diagnostic checks to assess whether your observed randomization achieved approximate orthogonality
  • Identify the four classes of selection effect (non-representativeness, self-selection, survivorship/attrition, exclusion bias) and explain why each makes exchangeability structurally unachievable within the affected sample
  • Distinguish survivorship bias (missing observations never entered the sample) from attrition bias (observations entered the sample but left non-randomly)
  • Diagnose differential attrition across conditions and explain why conditioning on a post-randomization event breaks exchangeability even in a properly randomized experiment

▶ Load required packages
# install.packages(c("tidyverse","ggplot2","knitr","scales","patchwork"))

library(tidyverse)
library(ggplot2)
library(knitr)
library(scales)
library(patchwork)
if (!requireNamespace("lavaan", quietly = TRUE)) install.packages("lavaan")
library(lavaan)

set.seed(2025)