Module 2: Hypothesis Testing

The Logic of Randomization and What P-Values Actually Tell You

Overview

Null hypothesis significance testing is everywhere in management research, yet its logic is widely misunderstood. This module builds from the ground up: why does randomization justify causal claims? What does a p-value actually mean? And what are the hidden assumptions — about units, stimuli, and assignment mechanisms — that most researchers ignore?

We cover the permutation logic of p-values, Type I and Type II error, power, the multiple comparisons problem, researcher degrees of freedom, and the design features (between vs. within, stimulus sampling, and Latin squares) that change what you can conclude.

Learning Goals

By the end of this module you should be able to:

Derive the sampling distribution of a test statistic from the randomization distribution, not from asymptotic theory
Explain the difference between a two-sided p-value and a posterior probability
Calculate and interpret statistical power, and explain why most social science studies are underpowered
Identify researcher degrees of freedom and explain how they inflate false-positive rates
Explain the rationale for stimulus sampling and design studies that treat stimuli as random effects
Construct a Latin square design and explain when it is preferable to a fully crossed factorial

Recommended Reading

Paper	Why it matters
Cohen (1994) — The Earth is Round (p < .05)	The classic critique of NHST, still the best starting point
Simmons, Nelson & Simonsohn (2011) — False-Positive Psychology	Demonstrates how researcher degrees of freedom inflate false positives
Wells & Windschitl (1999) — Stimulus Sampling	Why treating stimuli as fixed effects is a validity threat
Bahník & Vranka (2017)	Shows that a well-known fluency–risk effect disappears entirely when tested on new stimuli — a direct empirical demonstration of why treating stimuli as fixed is a validity threat
André (2022) — How (Not) to Exclude Outliers	Researcher degrees of freedom in outlier exclusion
Smaldino (2017) — Models are Stupid, and We Need More of Them	Why formal models help you think clearly about hypothesis testing (book chapter in Computational Social Psychology)
Simonsohn, Montealegre & Evangelidis (2025)	Reimagines stimulus sampling with mix-and-match designs and stimulus-level visualization of results
Mellon (2025)	Documents 194 potential exclusion-restriction violations for studies using weather as an instrument — a cautionary tale for IV design
André & Reinholtz (2024)	Introduces Pre-Registered Interim Analysis Designs (PRIADs) for more cost-effective hypothesis testing
Chester & Lasko (2021)	Best-practices guide for construct validation of experimental manipulations
Shadish, Cook & Campbell (2002) — Experimental and Quasi-Experimental Designs	The definitive reference on quasi-experimental design and threats to validity
DeKay et al. (2022)	How metastudies reveal treatment effect heterogeneity that standard meta-analyses obscure

Useful Online Resources

Resource	What it covers
UCLA: Structural Equation Modeling in R	Hands-on tutorial for SEM using lavaan, including path models and model fit assessment

What This Tutorial Is About

There is a single thread running through this tutorial and its predecessor:

Your observable may not accurately reflect the latent construct you need.

In Module 1, that problem lived in your outcome variable Y — your scale picked up variance from constructs you never intended to measure. Here, the same problem appears in two places at once: in your treatment variable X, and in the act of randomization that is supposed to make X interpretable.

This tutorial covers four ideas:

Part 1 — P-values: What they actually mean, where their inferential power comes from, and what assumptions they require — demonstrated by building null distributions from scratch using permutation.
Part 2 — Randomization: Why “clicking randomize” (the observable) is not the same as “achieving randomization” (the latent property), and why this gap grows as the construct Y you are studying becomes broader and more complex.
Part 3 — Selection Effects: Why some data-collection environments make exchangeability structurally impossible, regardless of how randomization is conducted. Non-representativeness, self-selection, survivorship, attrition, and exclusion bias are not failures of randomization — they are failures that occur before, during, or after data collection, and they cannot be fixed by collecting more data within the same design.
Part 4 — The Exclusion Restriction: Just as a measured outcome Y can fail discriminant validity by absorbing multiple constructs, a manipulated treatment X can inject multiple independent signals into the system simultaneously. When assignment is random, the treatment coefficient still estimates the causal effect of the bundle participants actually received. What becomes ambiguous is the narrower mechanistic claim: the coefficient cannot tell you whether the effect came from eco-certification, the premium packaging, demand cues, or their interaction — and that mechanism ambiguity is the experimental analogue of discriminant invalidity in measurement.

Learning Objectives

By the end of this tutorial, you will be able to:

Explain what a p-value is — and is not — using simulation rather than formulas
Build a null distribution from scratch using label permutation
Identify the three ingredients that give a p-value inferential power
Explain why a treatment manipulation can violate discriminant validity in the same way a measurement scale can
Use open-ended text to map the constructs injected by an experimental manipulation and those driving an outcome
Articulate the difference between observable randomization and latent randomization using the construct-validity language from Module 1
Calculate the minimum observations needed for approximate orthogonality given the complexity of Y
Run diagnostic checks to assess whether your observed randomization achieved approximate orthogonality
Identify the four classes of selection effect (non-representativeness, self-selection, survivorship/attrition, exclusion bias) and explain why each makes exchangeability structurally unachievable within the affected sample
Distinguish survivorship bias (missing observations never entered the sample) from attrition bias (observations entered the sample but left non-randomly)
Diagnose differential attrition across conditions and explain why conditioning on a post-randomization event breaks exchangeability even in a properly randomized experiment

▶ Load required packages

# install.packages(c("tidyverse","ggplot2","knitr","scales","patchwork"))

library(tidyverse)
library(ggplot2)
library(knitr)
library(scales)
library(patchwork)
if (!requireNamespace("lavaan", quietly = TRUE)) install.packages("lavaan")
library(lavaan)

set.seed(2025)