Part 1: What Does a P-value Mean?

▶ Load required packages

# install.packages(c("tidyverse","ggplot2","knitr","scales","patchwork"))

library(tidyverse)
library(ggplot2)
library(knitr)
library(scales)
library(patchwork)
if (!requireNamespace("lavaan", quietly = TRUE)) install.packages("lavaan")
library(lavaan)

set.seed(2025)

Part 1: What Does a P-value Actually Mean?

The Core Idea

A p-value answers one very specific question:

“How surprising is my observed result, if the null hypothesis were true?”

That’s it. The answer is a probability: the proportion of results at least as extreme as mine that would occur if I repeated the experiment infinitely many times under the null hypothesis.

Before we compute a single number, we need three things:

A null hypothesis — a concrete statement about what the world would look like if there were no effect
A null distribution — the distribution of our test statistic under that null world
A way to locate our observed result within that distribution

We will build all three by simulation.

The Study: Does Eco-Labeling Change Willingness to Pay?

A researcher studies whether eco-certification labels on coffee packages increase consumers’ willingness to pay (WTP). Sixty participants were recruited and randomly assigned to one of two conditions:

Condition	n	Stimulus
Control	30	Standard coffee package
Eco-label	30	Same package with eco-certification badge

After viewing the package, each participant stated their maximum WTP in dollars (on a continuous $1–$10 scale). The researcher observed that the eco-label group was willing to pay $0.80 more than the control group, on average.

Is this difference real — or could it be explained by chance?

The Null World: What “No Effect” Actually Means

The null hypothesis states:

H₀: Eco-labeling has no effect on WTP. Any observed difference between groups is due to sampling variability alone.

If H₀ is true, then all 60 participants are draws from the same underlying population — and the group labels (“Control” vs. “Eco-label”) are arbitrary. We could have assigned those labels any other way without changing anything meaningful about the data.

This is the crucial insight:

Under H₀, the treatment labels are exchangeable.

We can exploit this directly: shuffle the labels thousands of times, recompute the group difference each time, and build up the full distribution of differences that arise purely from chance. This is the null distribution.

▶ Simulate WTP data and compute observed difference

N    <- 60
mu   <- 5.50    # True population mean WTP ($)
sig  <- 1.50    # True population SD ($)

# All 60 participants are drawn from the same population (the null is true).
# We search for a draw where the eco-label group happened to score ~$0.80 higher
# purely by chance — matching the study description above.
# With N = 60 and SD = $1.50 the SE of the difference is ~$0.39, so a gap of $0.80
# corresponds to a t-statistic of ~2.1 and occurs in roughly 4% of random samples.
# That is the pedagogical point: this seemingly large difference can arise under H₀.
seed_use <- NA_integer_
for (s in seq_len(50000)) {
  set.seed(s)
  wtp_try <- pmax(pmin(round(rnorm(N, mu, sig), 2), 10), 1)
  d_try   <- mean(wtp_try[31:60]) - mean(wtp_try[1:30])
  if (d_try >= 0.75 && d_try <= 0.85) { seed_use <- s; break }
}
if (is.na(seed_use)) stop("No seed found in 50,000 tries — adjust target range")

set.seed(seed_use)
wtp_all   <- pmax(pmin(round(rnorm(N, mu, sig), 2), 10), 1)
group     <- factor(rep(c("Control", "Eco-label"), each = N / 2))
df_coffee <- data.frame(wtp = wtp_all, group = group)

obs_diff <- mean(df_coffee$wtp[df_coffee$group == "Eco-label"]) -
            mean(df_coffee$wtp[df_coffee$group == "Control"])

cat(sprintf("Observed mean difference: $%.2f\n", obs_diff))

Observed mean difference: $0.78

▶ Plot: raw WTP by condition

ggplot(df_coffee, aes(x = group, y = wtp, fill = group, color = group)) +
  geom_jitter(width = 0.15, alpha = 0.45, size = 2.5, shape = 16) +
  geom_boxplot(alpha = 0.20, outlier.shape = NA, width = 0.45, linewidth = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 5,
               fill = "white", color = "black", stroke = 1.5) +
  annotate("segment", x = 1, xend = 2, y = 10.4, yend = 10.4,
           color = "gray30", linewidth = 0.8) +
  annotate("text", x = 1.5, y = 10.7,
           label = sprintf("Observed \u0394 = $%.2f", obs_diff),
           size = 4, color = "gray20", fontface = "bold") +
  scale_fill_manual(values  = c("Control" = "#4a90d9", "Eco-label" = "#2d6a4f")) +
  scale_color_manual(values = c("Control" = "#2b6cb0", "Eco-label" = "#1b4332")) +
  scale_y_continuous(limits = c(1, 11), breaks = 1:10,
                     labels = function(x) paste0("$", x)) +
  labs(x = NULL, y = "Willingness to Pay ($)",
       title = "Eco-Coffee WTP by Condition",
       subtitle = "Diamonds show group means  \u00b7  Points are individual participants") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none", panel.grid.minor = element_blank())

What this plot is telling you — before we touch a single formula.

Each dot is one participant’s stated WTP. The box spans the middle 50% of values within each condition; the diamond marks the group mean. The bracket at the top labels the observed mean difference: Δ = $0.78$.

A few things worth noticing right now:

The distributions heavily overlap. The vast majority of the WTP values in both conditions occupy the same range. There is no clean separation — only a slight shift in the center of one distribution relative to the other.
Individual variation swamps the group difference. The spread within each group (roughly $1–$10) is many times larger than the $0.78$ gap between them. This is almost always true in behavioral experiments: people differ from one another far more than any treatment moves them.
This is precisely why “significant” requires more than a single number. The gap of $0.78$ could be real — or it could have appeared by chance simply because of which 30 people happened to land in each condition. The p-value we compute next turns that intuition into a probability.

Building the Null Distribution by Shuffling Labels

To ask “how often would chance alone produce a difference this large?”, we:

Pool all 60 WTP values
Randomly re-assign 30 to “Eco-label” and 30 to “Control” (shuffle the labels)
Record the new group difference
Repeat 10,000 times

Each iteration is a parallel universe in which the null is true and the labels are assigned differently. The collection of differences across all these universes is the null distribution.

▶ Run permutation test (10,000 label shuffles)

set.seed(42)
n_perm      <- 10000
n_per_group <- N / 2

null_dist <- replicate(n_perm, {
  shuffled <- sample(df_coffee$wtp)
  mean(shuffled[1:n_per_group]) - mean(shuffled[(n_per_group + 1):N])
})

p_val <- mean(abs(null_dist) >= abs(obs_diff))

cat(sprintf("Two-sided permutation p-value: %.3f\n", p_val))

Two-sided permutation p-value: 0.052

▶ Run permutation test (10,000 label shuffles)

cat(sprintf("95%% CI of null distribution: [$%.2f, $%.2f]\n",
            quantile(null_dist, 0.025), quantile(null_dist, 0.975)))

95% CI of null distribution: [$-0.78, $0.79]

Watching the Shuffle in Action

Before looking at the summary of 10,000 shuffles, it helps to watch what a single shuffle actually does. The nine panels below show the original data plus eight individual shuffles. In each panel the WTP values are exactly the same — only the group labels (the colors) are reassigned. Notice how the group means (crossbars) bounce around at random, sometimes positive, sometimes negative, sometimes almost zero.

▶ Plot: 9 panels — observed data + 8 individual shuffles

set.seed(77)
n_show <- 8

# Build the observed panel
obs_pnl <- data.frame(
  wtp       = df_coffee$wtp,
  group     = df_coffee$group,
  panel_id  = 0L,
  diff_k    = obs_diff
)

# Build 8 shuffle panels
shuf_list <- lapply(seq_len(n_show), function(k) {
  lbl  <- sample(df_coffee$group)
  dk   <- mean(df_coffee$wtp[lbl == "Eco-label"]) -
          mean(df_coffee$wtp[lbl == "Control"])
  data.frame(wtp = df_coffee$wtp, group = lbl,
             panel_id = k, diff_k = dk)
})

all_pnls <- bind_rows(obs_pnl, shuf_list)

# Build ordered panel labels
panel_meta <- all_pnls |>
  distinct(panel_id, diff_k) |>
  arrange(panel_id) |>
  mutate(panel_label = c(
    sprintf("\u2605 Observed  \u0394 = $%+.2f", diff_k[1]),
    sprintf("Shuffle %d  \u0394 = $%+.2f", seq_len(n_show), diff_k[-1])
  ))

all_pnls <- left_join(all_pnls, panel_meta |> dplyr::select(panel_id, panel_label),
                      by = "panel_id") |>
  mutate(panel_label = factor(panel_label, levels = panel_meta$panel_label))

ggplot(all_pnls, aes(x = group, y = wtp, color = group)) +
  geom_jitter(width = 0.14, alpha = 0.45, size = 1.6) +
  stat_summary(fun = mean, geom = "crossbar", width = 0.45,
               fatten = 2.5, linewidth = 0.7, color = "black") +
  facet_wrap(~panel_label, nrow = 3, ncol = 3) +
  scale_color_manual(values = c("Control" = "#4a90d9", "Eco-label" = "#2d6a4f")) +
  scale_y_continuous(limits = c(1, 10), breaks = c(2, 5, 8),
                     labels = function(x) paste0("$", x)) +
  labs(x = NULL, y = "WTP ($)",
       title = "The Labels Change — the Data Do Not",
       subtitle = "Each panel re-assigns the same 60 WTP values to groups  \u00b7  Crossbar = group mean") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        strip.text = element_text(size = 8.5, face = "bold"),
        axis.text.x = element_text(size = 8),
        panel.grid.minor = element_blank(),
        panel.spacing = unit(0.8, "lines"))

The starred panel (★ Observed) is the only one that reflects the actual group assignment. All other panels show what chance alone could have produced. The null distribution is nothing more than the collection of Δ values across all possible such panels.

The Full Null Distribution

▶ Plot: null distribution with observed difference highlighted

null_df <- data.frame(
  diff    = null_dist,
  extreme = abs(null_dist) >= abs(obs_diff)
)

q025 <- quantile(null_dist, 0.025)
q975 <- quantile(null_dist, 0.975)

# Shared break-points so both histogram layers align perfectly
null_breaks <- seq(
  floor(min(null_dist) * 20) / 20 - 0.025,
  ceiling(max(null_dist) * 20) / 20 + 0.025,
  by = 0.05
)

ggplot(null_df, aes(x = diff, fill = extreme)) +
  geom_histogram(aes(y = after_stat(density)),
                 breaks = null_breaks, color = "white", linewidth = 0.15) +
  geom_density(aes(x = diff), data = null_df, inherit.aes = FALSE,
               color = "#08519c", linewidth = 1.0) +
  geom_vline(xintercept =  obs_diff, color = "#d7301f", linewidth = 1.6) +
  geom_vline(xintercept = -obs_diff, color = "#d7301f", linewidth = 1.0,
             linetype = "dashed") +
  geom_vline(xintercept = q025, color = "#525252", linewidth = 0.7,
             linetype = "dotted") +
  geom_vline(xintercept = q975, color = "#525252", linewidth = 0.7,
             linetype = "dotted") +
  annotate("label", x = 0, y = max(density(null_dist)$y) * 0.82,
           label = sprintf("p = %.3f", p_val),
           fill = "#deebf7", color = "#08306b", fontface = "bold", size = 5.5,
           label.size = 0.4, label.padding = unit(0.35, "lines")) +
  annotate("text",
           x = obs_diff + sign(obs_diff) * 0.08,
           y = max(density(null_dist)$y) * 0.52,
           label = sprintf("Observed\n\u0394 = $%.2f", obs_diff),
           color = "#d7301f", fontface = "bold", size = 3.5,
           hjust = ifelse(obs_diff >= 0, 0, 1)) +
  annotate("text", x = q025 - 0.03, y = 0.16,
           label = "2.5th\npercentile", color = "#525252", size = 2.9, hjust = 1) +
  annotate("text", x = q975 + 0.03, y = 0.16,
           label = "97.5th\npercentile", color = "#525252", size = 2.9, hjust = 0) +
  scale_fill_manual(
    values = c("FALSE" = "#9ecae1", "TRUE" = "#de2d26"),
    labels = c(expression(paste("Consistent with ", H[0])),
               "As or more extreme than observed"),
    name   = NULL
  ) +
  scale_x_continuous(breaks = seq(-1.5, 1.5, 0.5),
                     labels = function(x) sprintf("$%.1f", x)) +
  labs(
    x = "Mean difference (Eco-label \u2212 Control) across 10,000 shuffles",
    y = "Density",
    title = "Null Distribution: What Chance Alone Produces",
    subtitle = sprintf(
      "Blue = consistent with H0  \u00b7  Red = as or more extreme (p = %.3f)  \u00b7  Dotted = 95%% CI of null",
      p_val)
  ) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.minor  = element_blank(),
    panel.grid.major  = element_line(color = "#eeeeee"),
    legend.position   = "bottom",
    plot.title        = element_text(face = "bold"),
    plot.subtitle     = element_text(color = "#555555", size = 10.5)
  )

The p-value of 0.052 tells us: 5.2% of shuffled datasets produced a difference as large as $0.78 or larger — meaning this gap arises by chance roughly 1 time in 19 under H₀. Look at where the red line lands relative to the tails of the null distribution. The conventional α = .05 cutoff is not a wall separating “real” effects from chance: it is a decision rule, and results near the boundary carry substantial uncertainty regardless of which side they fall on.

What the p-value is and is NOT

It IS: The probability of a result at least this extreme, given that H₀ is true.

It is NOT: - The probability that H₀ is true - The probability that the effect is real - A measure of the size or importance of the effect - A guarantee that you have not made an error

The p-value is a statement about a hypothetical distribution — the null world — not a statement about reality.

Where Does a P-value Get Its Power?

A p-value becomes small (informative) only when the observed result falls far into the tail of the null distribution. Three things push results into the tail:

A large true effect (δ) — the signal is genuinely far from zero
A large sample size (N) — random variability shrinks, so even moderate signals stand out
A low-noise outcome — tight distributions make small differences more visible

How big an effect can N = 60 actually detect?

Before running a study, it is worth asking: if the eco-label truly has effect δ, how many participants do I need to detect it 80% of the time?

▶ Compute minimum N for different true effect sizes

true_deltas <- c(0.20, 0.40, 0.60, 0.80, 1.00, 1.50)

power_tab <- data.frame(
  true_eff  = sprintf("$%.2f", true_deltas),
  cohens_d  = round(true_deltas / sig, 2),
  n_per_grp = sapply(true_deltas, function(d) {
    ceiling(power.t.test(delta = d, sd = sig, sig.level = 0.05,
                         power = 0.80, type = "two.sample")$n)
  }),
  check.names = FALSE
) |>
  mutate(total_n    = n_per_grp * 2,
         detectable = ifelse(total_n <= 60, "\u2713 Yes", "\u2717 No"))

names(power_tab) <- c("True effect (\u03b4)", "Cohen\u2019s d",
                       "N per group (80% power)", "Total N", "Detectable at N = 60?")

kable(power_tab,
      caption = sprintf(
        "Minimum N for 80%% power at \u03b1 = .05 (SD = $%.2f, two-sided t-test)", sig))

Minimum N for 80% power at α = .05 (SD = $1.50, two-sided t-test)
True effect (δ)	Cohen’s d	N per group (80% power)	Total N	Detectable at N = 60?
$0.20	0.13	884	1768	✗ No
$0.40	0.27	222	444	✗ No
$0.60	0.40	100	200	✗ No
$0.80	0.53	57	114	✗ No
$1.00	0.67	37	74	✗ No
$1.50	1.00	17	34	✓ Yes

▶ Plot: p-value distributions for different true effect sizes

sim_pval <- function(effect, N = 60, mu = 5.5, sig = 1.5) {
  y_c <- rnorm(N / 2, mu,          sig)
  y_t <- rnorm(N / 2, mu + effect, sig)
  t.test(y_t, y_c)$p.value
}

effects <- c(0, 0.30, 0.60, 1.20)
labels  <- c("No effect (\u03b4 = 0)",
             "Small (\u03b4 = $0.30)",
             "Medium (\u03b4 = $0.60)",
             "Large (\u03b4 = $1.20)")

set.seed(42)
pval_df <- map2_dfr(effects, labels, function(eff, lbl) {
  data.frame(effect = lbl, p = replicate(5000, sim_pval(eff)))
}) |>
  mutate(effect = factor(effect, levels = labels),
         sig    = p < 0.05)

pval_df |>
  group_by(effect) |>
  summarise(power = mean(sig), .groups = "drop") |>
  mutate(power = percent(power, accuracy = 0.1)) |>
  kable(col.names = c("True Effect", "% experiments with p < .05"),
        caption = "Statistical power at \u03b1 = .05 (N = 30 per group, total N = 60, SD = $1.50)")

Statistical power at α = .05 (N = 30 per group, total N = 60, SD = $1.50)
True Effect	% experiments with p < .05
No effect (δ = 0)	5.2%
Small (δ = $0.30)	12.5%
Medium (δ = $0.60)	33.9%
Large (δ = $1.20)	85.8%

▶ Plot: p-value histograms by effect size

ggplot(pval_df, aes(x = p, fill = sig)) +
  geom_histogram(breaks = seq(0, 1, by = 0.025), color = "white", alpha = 0.9) +
  geom_vline(xintercept = 0.05, linetype = "dashed", color = "black", linewidth = 0.9) +
  scale_fill_manual(values = c("FALSE" = "#9ecae1", "TRUE" = "#de2d26"),
                    labels = c("p \u2265 .05", "p < .05"), name = NULL) +
  facet_wrap(~effect, nrow = 1, scales = "free_y") +
  labs(x = "p-value", y = "Count (5,000 simulated experiments)",
       title = "Distribution of P-values Across Effect Sizes",
       subtitle = "Under H\u2080 (\u03b4 = 0): uniform  \u00b7  As \u03b4 grows: pile-up near zero") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom", panel.grid.minor = element_blank(),
        strip.text = element_text(size = 9.5))

The uniform distribution under H₀ is not a coincidence

When the null is exactly true, p-values are uniformly distributed between 0 and 1. This means that 5% of null experiments will produce p < .05, 1% will produce p < .01, and so on.

This is the false positive rate, not a coincidence. α is not a threshold for truth — it is the rate at which you agree to be wrong when nothing is actually going on.

A practical implication: across a literature of 100 experiments where H₀ is true in all of them, you would still expect ~5 published papers with p < .05.

The Assumptions Behind Every P-value

The permutation logic above works because of one foundational assumption:

Exchangeability: Under H₀, the labels are arbitrary — any assignment of observations to groups is equally likely.

This assumption is satisfied when: - Participants are a random sample from the population - Treatment was assigned randomly and independently

If either fails, the null distribution we construct by shuffling labels does not correspond to the distribution under the true null. The p-value becomes meaningless — or worse, systematically biased.

The act of randomization is what makes exchangeability plausible. Which brings us to Parts 2 and 3.

Researcher Checklist: P-Values and Statistical Inference

Key questions before reporting a p-value

Do you know what your p-value is testing? It measures how surprising your data would be if the null were true — not the probability the effect is real, not the probability the null is true.
Are the three ingredients in place? (1) The same data-generating process in both conditions. (2) Exchangeability under H₀ — groups are interchangeable if there is no effect. (3) A test statistic chosen before seeing the data. If any one fails, the p-value is invalid.
Is your observed effect large enough to matter, not just to be significant? Report effect size in interpretable units alongside the p-value. Large samples produce significant p-values for trivially small effects.
Did you pre-register the analysis? Any flexibility in stopping rules, exclusion criteria, or choice of test statistic inflates the false-positive rate beyond the nominal α.