9  Module 2: Hypothesis Testing

The Logic of Randomization and What P-Values Actually Tell You

9.1 Overview

Null hypothesis significance testing is everywhere in management research, yet its logic is widely misunderstood. This module builds from the ground up: why does randomization justify causal claims? What does a p-value actually mean? And what are the hidden assumptions — about units, stimuli, and assignment mechanisms — that most researchers ignore?

We cover the permutation logic of p-values, Type I and Type II error, power, the multiple comparisons problem, researcher degrees of freedom, and the design features (between vs. within, stimulus sampling, and Latin squares) that change what you can conclude.


9.2 Learning Goals

By the end of this module you should be able to:

  • Derive the sampling distribution of a test statistic from the randomization distribution, not from asymptotic theory
  • Explain the difference between a two-sided p-value and a posterior probability
  • Calculate and interpret statistical power, and explain why most social science studies are underpowered
  • Identify researcher degrees of freedom and explain how they inflate false-positive rates
  • Explain the rationale for stimulus sampling and design studies that treat stimuli as random effects
  • Construct a Latin square design and explain when it is preferable to a fully crossed factorial

9.4 What This Tutorial Is About

There is a single thread running through this tutorial and its predecessor:

Your observable may not accurately reflect the latent construct you need.

In Module 1, that problem lived in your outcome variable Y — your scale picked up variance from constructs you never intended to measure. Here, the same problem appears in two places at once: in your treatment variable X, and in the act of randomization that is supposed to make X interpretable.

This tutorial covers three ideas:

  • Part 1 — P-values: What they actually mean, where their inferential power comes from, and what assumptions they require — demonstrated by building null distributions from scratch using permutation.

  • Part 2 — The Exclusion Restriction: Just as a measured outcome Y can fail discriminant validity by absorbing multiple constructs, a manipulated treatment X can inject multiple independent signals into the system simultaneously. When this happens, your “experiment” is really a quasi-experiment — and the single coefficient on treatment conflates several distinct causal pathways.

  • Part 3 — Randomization: Why “clicking randomize” (the observable) is not the same as “achieving randomization” (the latent property), and why this gap grows as the construct Y you are studying becomes broader and more complex.

NoteLearning Objectives

By the end of this tutorial, you will be able to:

  • Explain what a p-value is — and is not — using simulation rather than formulas
  • Build a null distribution from scratch using label permutation
  • Identify the three ingredients that give a p-value inferential power
  • Explain why a treatment manipulation can violate discriminant validity in the same way a measurement scale can
  • Use open-ended text to map the constructs injected by an experimental manipulation and those driving an outcome
  • Articulate the difference between observable randomization and latent randomization using the construct-validity language from Module 1
  • Calculate the minimum observations needed for approximate orthogonality given the complexity of Y
  • Run diagnostic checks to assess whether your observed randomization achieved approximate orthogonality

▶ Load required packages
# install.packages(c("tidyverse","ggplot2","knitr","scales","patchwork"))

library(tidyverse)
library(ggplot2)
library(knitr)
library(scales)
library(patchwork)
if (!requireNamespace("lavaan", quietly = TRUE)) install.packages("lavaan")
library(lavaan)

set.seed(2025)