Part 1: Discriminant Validity

▶ Load required packages

# Uncomment to install if needed:
# install.packages(c("lavaan", "semTools", "MASS", "ggplot2",
#                    "dplyr", "tidyr", "corrplot", "psych", "knitr", "lmtest",
#                    "mclust", "dbscan"))

library(lavaan)       # CFA and SEM (Parts 1 and 3)
library(semTools)     # htmt() and auxiliary SEM tools (Parts 1 and 3)
library(MASS)         # mvrnorm(): generate multivariate normal data (all parts)
library(ggplot2)      # Visualizations (all parts)
library(dplyr)        # Data manipulation (all parts)
library(tidyr)        # Data reshaping (Parts 2 and 3)
library(corrplot)     # Correlation heatmap (Part 1)
library(psych)        # EFA (Part 1)
library(knitr)        # Nicely formatted tables (all parts)
library(lmtest)       # Breusch-Pagan heteroskedasticity test (Part 2)
library(mclust)       # Gaussian mixture models / latent class analysis (Part 4)
library(dbscan)       # Local outlier factor and DBSCAN clustering (Part 5)

Part 1: Discriminant Validity

The Classical Test Theory Starting Point

Every measurement model in the social sciences begins from the same core assumption: the observed score you record is a sum of the true score and measurement error.

\[X_\text{observed} = T_\text{true} + \varepsilon\]

Classical test theory (CTT) assumes the error term $\varepsilon$ is random and symmetric — equally likely to push the observed score up or down, averaging to zero across many observations. Under this assumption, observable scores are noisy but unbiased indicators of the latent construct you care about.

But this assumption is almost always violated in practice. The measurement error is not symmetric — it is contaminated by something else. The five parts in this tutorial are five different ways that “something else” creeps into your observed score:

When your scale picks up a second construct (Part 1), ε includes that construct’s variance — and is no longer random.
When an omitted variable drives both your predictor and outcome (Part 2), the “error” in your regression coefficient is systematic, not random.
When scale items shift their baseline across groups (Part 3), ε differs systematically between conditions.
When your sample mixes two populations (Part 4), ε has a bimodal structure that standard models cannot represent.
When your data contain influential outliers (Part 5), ε includes extreme values that disproportionately distort regression coefficients — collective outliers are a version of the latent-subgroup problem.

Discriminant validity failure is the case where the measurement error ε in your observed scale is not random noise — it is structured variance from a second latent construct. Consider a case where we are trying to measure Green Purchase Intentions (GPI) but our measurement also captures a related but separable construct Environmental Concern (EC), then the CTT equation becomes:

\[X_\text{GPI} = T_\text{true GPI} + \lambda \cdot T_\text{EC} + \varepsilon_\text{random}\]

where $\lambda$ captures how much of Environmental Concern bleeds into your GPI items. This is no longer a pure measure of one construct.

The same problem in other disciplines

The discriminant validity problem is not unique to scale-based research. The same underlying issue — an observable index picking up multiple latent sources of variance — shows up across many fields under different names:

Econometrics / identification: A regression coefficient is said to be unidentified (or not identified) when the predictor of interest cannot be separated from another variable that moves with it. In structural equation modeling, “identification” requires that each latent variable has its own distinct set of indicators — exactly the discriminant validity condition.
Epidemiology / biomedical research: When a biomarker (e.g., C-reactive protein) is elevated by both inflammation and metabolic syndrome simultaneously, researchers say it lacks specificity. A test that responds to multiple conditions cannot clearly indicate which one is present — the biomedical equivalent of discriminant validity failure.
Psychometrics: The problem is described as construct contamination — the scale measures variance beyond the intended construct. The closely related concept of construct deficiency is the mirror image: the scale misses important facets of the construct it is supposed to measure.
Political science / social measurement: Composite indices (e.g., democracy scores, corruption indices) frequently collapse multiple distinct dimensions into a single number. Researchers debate whether “democracy” as measured is really one thing or whether it conflates civil liberties, electoral competitiveness, and rule of law — each with different causes and consequences.

The common thread: whenever you use a single observed variable (or a scale of items) to proxy a latent construct, you need to verify that the observable is specific to that construct and not a blend of several.

The Study Scenario

We are simulating data from a green marketing experiment with 400 participants. The study was designed to answer: does exposure to eco-friendly brand messaging increase consumers’ green purchase intentions?

Experimental design:

Half the participants saw standard product advertising (Control)
Half saw the same ads emphasising eco-friendly credentials (Green Marketing)

Constructs measured (all on 7-point Likert scales):

Construct	Abbreviation	# Items	Role
Environmental Concern	EC	4	Covariate
Green Purchase Intention	GPI	4	Dependent Variable
Brand Attitude	BA	3	Covariate

The hidden problem we planted in the data: The GPI scale was designed to measure purchase intention — but its items ended up being nearly indistinguishable from items measuring environmental concern. In real research, this often happens when constructs are conceptually close (caring about the environment vs. intending to buy green products). The two scales fail discriminant validity.

In all five parts in this tutorial, the core problem is the same: the observed Y you are working with is an imperfect, contaminated proxy for the latent Y you are trying to study.

What “failing discriminant validity” means here

The GPI (DV) scale picks up variation from two latent constructs — Green Purchase Intention and Environmental Concern — rather than cleanly measuring just one. This means any effect of the marketing campaign on GPI is hard to interpret: are participants more willing to buy, or just more environmentally concerned?

Remember: These methods only apply in specific situations

Discriminant validity is always a concern whenever you have two related constructs in your study — but the methods below (HTMT and DVI) only work in a specific setting:

You need multi-item scales (e.g., 4-item Likert scales for GPI and EC)
You need at least 4 Likert-type items per construct — fewer items and the statistics become unreliable
These methods do NOT apply to single-item outcomes, willingness-to-pay (WTP), choice data, behavioral measures (e.g., actual purchases), or any non-Likert response format

If your DV is WTP or a behavioral measure, you cannot use HTMT or DVI to check discriminant validity — but that doesn’t mean discriminant validity isn’t a problem! It just means the problem is harder to detect.

Simulating the Data

We use MASS::mvrnorm() to generate correlated latent factor scores, then create item scores from those factors plus random measurement error. The items are rounded to a 7-point Likert scale.

The key feature we’re building in: Environmental Concern (EC) and Green Purchase Intention (GPI) are correlated at 0.93 in the population — extremely high, and well above the 0.90 threshold commonly used in HTMT-based discriminant validity tests (Henseler et al., 2015). When the HTMT ratio approaches or exceeds 0.90, two constructs are considered statistically indistinguishable and cannot be treated as separate in a model.

▶ Simulate green marketing dataset (n=400)

set.seed(2025)
n <- 400

# ── Step 1: Assign participants to conditions ──────────────────────────────────
treatment <- sample(c(0L, 1L), size = n, replace = TRUE)

# ── Step 2: Define the true factor correlation matrix ─────────────────────────
# EC and GPI are correlated at .93 — too high for discriminant validity
# EC–BA and GPI–BA are at more typical, moderate levels
phi_pop <- matrix(
  c(1.00, 0.93, 0.45,   # EC row
    0.93, 1.00, 0.50,   # GPI row
    0.45, 0.50, 1.00),  # BA row
  nrow = 3, byrow = TRUE,
  dimnames = list(c("EC", "GPI", "BA"), c("EC", "GPI", "BA"))
)

# ── Step 3: Generate latent factor scores ─────────────────────────────────────
latent_base <- MASS::mvrnorm(n = n, mu = c(0, 0, 0), Sigma = phi_pop)

EC_lat  <- latent_base[, 1]
GPI_lat <- latent_base[, 2] + 0.40 * treatment  # Green marketing raises GPI by .40 SD
BA_lat  <- latent_base[, 3]

# ── Step 4: Define item loadings ──────────────────────────────────────────────
# These represent how strongly each item reflects its latent construct
lambda_EC  <- c(0.78, 0.82, 0.74, 0.76)
lambda_GPI <- c(0.80, 0.76, 0.82, 0.78)
lambda_BA  <- c(0.72, 0.76, 0.70)

# ── Step 5: Generate continuous item scores (latent score + measurement error) ─
gen_items <- function(latent, loadings) {
  sapply(loadings, function(lam) {
    lam * latent + sqrt(1 - lam^2) * rnorm(length(latent))
  })
}

EC_cont  <- gen_items(EC_lat,  lambda_EC)
GPI_cont <- gen_items(GPI_lat, lambda_GPI)
BA_cont  <- gen_items(BA_lat,  lambda_BA)

# ── Step 6: Round to 7-point Likert scale ─────────────────────────────────────
# Cut the continuous distribution into 7 ordered categories
to_likert7 <- function(x) {
  z <- (x - mean(x)) / sd(x)   # standardise each item
  breaks <- c(-Inf, -1.5, -0.75, -0.25, 0.25, 0.75, 1.5, Inf)
  as.integer(cut(z, breaks = breaks, labels = 1:7))
}

EC_lik  <- apply(EC_cont,  2, to_likert7)
GPI_lik <- apply(GPI_cont, 2, to_likert7)
BA_lik  <- apply(BA_cont,  2, to_likert7)

# ── Step 7: Assemble the final data frame ─────────────────────────────────────
df <- data.frame(
  id        = 1:n,
  condition = factor(treatment, levels = c(0, 1),
                     labels = c("Control", "Green Marketing")),
  EC1  = EC_lik[, 1], EC2  = EC_lik[, 2],
  EC3  = EC_lik[, 3], EC4  = EC_lik[, 4],
  GPI1 = GPI_lik[, 1], GPI2 = GPI_lik[, 2],
  GPI3 = GPI_lik[, 3], GPI4 = GPI_lik[, 4],
  BA1  = BA_lik[, 1], BA2  = BA_lik[, 2], BA3  = BA_lik[, 3]
)

# Quick look at the data
head(df, 5)

id	condition	EC1	EC2	EC3	EC4	GPI1	GPI2	GPI3	GPI4	BA1	BA2	BA3
1	Control	6	5	4	3	4	3	3	2	2	6	6
2	Green Marketing	7	7	7	7	7	7	7	7	5	7	7
3	Green Marketing	5	6	4	4	5	3	5	4	5	5	6
4	Green Marketing	4	4	1	2	5	3	4	4	7	3	6
5	Control	4	4	3	3	3	3	4	2	5	3	3

▶ Descriptive statistics by condition

# Count participants per condition and compute scale means
df |>
  group_by(condition) |>
  summarise(
    n          = n(),
    Mean_EC    = round(rowMeans(across(EC1:EC4))  |> mean(), 2),
    Mean_GPI   = round(rowMeans(across(GPI1:GPI4)) |> mean(), 2),
    Mean_BA    = round(rowMeans(across(BA1:BA3))  |> mean(), 2)
  ) |>
  kable(col.names = c("Condition", "N", "Mean EC", "Mean GPI", "Mean BA"),
        caption = "Sample sizes and scale means by condition")

Sample sizes and scale means by condition
Condition	N	Mean EC	Mean GPI	Mean BA
Control	187	4.12	3.77	4.00
Green Marketing	213	3.92	4.19	4.01

Note

The green marketing manipulation works as intended: participants in the green marketing condition score higher on GPI (Green Purchase Intention) than those in the control condition.

First Look: Convergent and Discriminant Validity in the Raw Data

Before running any formal tests, let’s look at the raw correlation structure of the 11 items. This heatmap tells you two things at once:

Convergent validity: Items within the same scale should correlate strongly with each other. If they don’t, the items aren’t all measuring the same construct.
Discriminant validity: Items from different scales should correlate noticeably less than items within the same scale. If they don’t, the two scales cannot be told apart.

A healthy pattern: dark red within-scale blocks (strong convergent validity) and lighter between-scale blocks (good discriminant validity). The warning sign: when the between-scale block looks just as dark as the within-scale blocks.

▶ Plot: item correlation heatmap

# Extract just the item columns (no ID or condition)
items_df <- df |> select(EC1:BA3)

# Compute correlation matrix
item_cors <- cor(items_df)

# Color-coded heatmap
# Warm colors = high positive correlation; cool colors = low/negative
corrplot(item_cors,
         method   = "color",
         type     = "lower",
         order    = "original",     # keep original order so EC, GPI, BA stay grouped
         tl.col   = "black",
         tl.cex   = 0.85,
         addCoef.col = "white",     # print correlation values
         number.cex  = 0.65,
         col      = colorRampPalette(c("#313695", "#74add1", "#e0f3f8",
                                       "#fee090", "#f46d43", "#a50026"))(200),
         title    = "Item Correlation Matrix",
         mar      = c(0, 0, 1.5, 0))

What to look for in the heatmap

Compare the red blocks in the heatmap:

The EC–EC block (top-left): high correlations ✓ (items measure the same thing)
The GPI–GPI block (middle): high correlations ✓
The EC–GPI block (the rectangle connecting EC and GPI items): also very high correlations ⚠️

When between-scale correlations (EC–GPI) are as strong as within-scale correlations (EC–EC, GPI–GPI), the two scales cannot be told apart. This is the discriminant validity problem in visual form.

The table below makes the problem even clearer by summarizing the average within-scale versus between-scale correlations. Think of it this way:

Within-scale (e.g., EC1 with EC2, EC3, EC4): These are correlations among items that are supposed to be measuring the same thing. They should be high — that’s convergent validity working.
Between-scale (e.g., EC1 with GPI1, GPI2, GPI3, GPI4): These are correlations among items from different constructs. They should be noticeably lower than the within-scale correlations. If they’re not, the two scales are too similar to distinguish — that’s a discriminant validity failure.

▶ Compute average within- and between-scale correlations

# Compute average correlations within each scale and between scales
cors <- cor(items_df)

within_EC  <- mean(cors[1:4, 1:4][lower.tri(cors[1:4, 1:4])])
within_GPI <- mean(cors[5:8, 5:8][lower.tri(cors[5:8, 5:8])])
within_BA  <- mean(cors[9:11, 9:11][lower.tri(cors[9:11, 9:11])])

between_EC_GPI <- mean(abs(cors[1:4, 5:8]))
between_EC_BA  <- mean(abs(cors[1:4, 9:11]))
between_GPI_BA <- mean(abs(cors[5:8, 9:11]))

avg_cor_summary <- data.frame(
  Type  = c("Within EC",  "Within GPI", "Within BA",
            "Between EC–GPI ⚠️", "Between EC–BA", "Between GPI–BA"),
  Avg_r = round(c(within_EC, within_GPI, within_BA,
                  between_EC_GPI, between_EC_BA, between_GPI_BA), 3)
)
kable(avg_cor_summary, col.names = c("Correlation Type", "Average |r|"),
      caption = "Average within-scale and between-scale correlations")

Average within-scale and between-scale correlations
Correlation Type	Average \|r\|
Within EC	0.587
Within GPI	0.583
Within BA	0.494
Between EC–GPI ⚠️	0.529
Between EC–BA	0.265
Between GPI–BA	0.293

What the numbers are telling you

Look at the table and ask yourself: How different are the within-scale and between-scale numbers for EC and GPI?

Imagine you’re a researcher who has no idea the data have a problem. You see: - EC items correlate with each other at around, say, 0.68 - GPI items correlate with each other at around 0.67 - But EC items and GPI items correlate with each other at around 0.65

Those three numbers are nearly identical. That means knowing someone’s EC score tells you almost as much about their GPI items as their own GPI scores do. The two scales are effectively measuring the same thing, just with different item wording. That’s the discriminant validity failure, visible in plain numbers before you run any formal test.

The HTMT ratio formalises exactly this comparison. If within-scale and between-scale correlations are similar, HTMT will be close to 1.0.

Factor Analysis and Convergent Validity

Visual inspection of the correlation heatmap gives a first diagnostic pass. Before turning to formal discriminant validity tests, we need to establish convergent validity more formally — confirming that items designed to measure the same construct actually cluster together as theory predicts. Factor analysis is the primary tool for this.

Factor analysis starts from the same classical test theory logic introduced at the beginning of this section: each observed item score is a noisy reflection of a latent construct plus error. Factor analysis inverts the question — given the pattern of correlations among many observed items, it estimates how many latent constructs (factors) are needed to explain those correlations, and how strongly each item loads on each factor.

Two related approaches are used at different stages of scale development:

Exploratory Factor Analysis (EFA): Lets the data reveal the latent structure with minimal constraints. EFA is your first look — use it to check whether the factor structure you expected actually emerges before committing to a specific confirmatory model.
Confirmatory Factor Analysis (CFA): Tests a specific, theory-driven structure. You specify which items should load on which factors; CFA evaluates how well that constrained model fits the data and reports formal fit indices.

Both tools answer the same convergent validity question: do the items meant to measure the same construct actually cluster together?

Exploratory Factor Analysis

EFA begins with no assumptions about which items belong to which factor. Two diagnostics guide how many factors to retain:

Eigenvalues measure how much total item variance each factor explains. The Kaiser criterion (eigenvalue > 1.0) is a rough rule of thumb: factors that explain more variance than a single observed item are worth retaining.

Parallel analysis is more rigorous: it compares observed eigenvalues to eigenvalues extracted from random data with the same dimensions. Retain only factors where the observed eigenvalue clearly exceeds the random baseline — those factors are capturing real structure, not chance variation.

▶ Scree plot with parallel analysis

# fa.parallel() overlays observed eigenvalues on eigenvalues from 200 randomly
# generated datasets of the same size. Points where the observed line (solid)
# sits clearly above the random baseline (dashed) indicate real factors.
fa.parallel(
  items_df,
  fa          = "fa",    # factor analysis (not principal components)
  fm          = "ml",    # maximum likelihood extraction
  n.iter      = 200,     # random datasets for the parallel baseline
  main        = "Scree Plot with Parallel Analysis",
  show.legend = TRUE
)

Parallel analysis suggests that the number of factors =  3  and the number of components =  NA

Reading the scree plot

Solid line / triangles: eigenvalues from your actual data
Dashed line: eigenvalues from randomly generated data with no factor structure

Retain the number of factors where the solid line is clearly above the dashed line. Where the two lines meet or cross, the factor is explaining no more than chance variation.

What to expect here: With 11 items designed as 3 distinct constructs, we hope to see the plot clearly suggest 3 factors. But because EC and GPI are simulated with a 0.93 correlation, the data may only clearly support 2 factors — treating EC and GPI as an undifferentiated block. That ambiguity between 2 and 3 factors is itself a warning signal about discriminant validity.

Now run the EFA requesting 3 factors. Because psychological constructs measured in the same survey are typically correlated with each other, we use oblique rotation (rotate = "oblimin"), which allows the extracted factors to correlate — a more realistic assumption than the orthogonality imposed by varimax.

▶ EFA with 3 factors (oblique rotation)

efa_fit <- fa(
  items_df,
  nfactors = 3,           # number of factors to extract
  rotate   = "oblimin",   # oblique rotation — allows correlated factors
  fm       = "ml",        # maximum likelihood estimation
  scores   = "regression"
)

# Print loadings; values below 0.30 suppressed to highlight the pattern
print(loadings(efa_fit), cutoff = 0.30, sort = FALSE)


Loadings:
     ML1    ML2    ML3   
EC1   0.775              
EC2   0.782              
EC3   0.782              
EC4   0.787              
GPI1  0.691              
GPI2  0.698              
GPI3  0.636              
GPI4  0.613              
BA1          0.586       
BA2          0.783       
BA3          0.724       

                 ML1   ML2   ML3
SS loadings    4.193 1.505 0.300
Proportion Var 0.381 0.137 0.027
Cumulative Var 0.381 0.518 0.545

How to read the EFA loadings table

Each row is an item; each column is a factor. A loading shows how strongly that item is associated with that factor — values above 0.40 are typically considered meaningful, values above 0.70 are strong.

Signs of good convergent validity: All EC items (EC1–EC4) load strongly on the same factor, all GPI items on a different factor, and all BA items on a third factor.

What to watch for here: Because EC and GPI are simulated with a 0.93 latent correlation, the EFA may not cleanly separate them. GPI items may show substantial loadings on the same factor as EC items — a pattern called cross-loading. This is the factor-analytic signature of discriminant validity failure, and it directly foreshadows what HTMT and DVI will quantify in the next section.

▶ Variance explained by each factor

efa_var <- data.frame(
  Factor      = paste0("Factor ", 1:3),
  SS_Loadings = round(efa_fit$Vaccounted["SS loadings",     ], 3),
  Prop_Var    = round(efa_fit$Vaccounted["Proportion Var",  ], 3),
  Cum_Var     = round(efa_fit$Vaccounted["Cumulative Var",  ], 3)
)
kable(efa_var,
      col.names = c("Factor", "SS Loadings",
                    "Proportion of Variance", "Cumulative Variance"),
      caption   = "EFA: variance explained by each factor")

EFA: variance explained by each factor
	Factor	SS Loadings	Proportion of Variance	Cumulative Variance
ML1	Factor 1	4.319	0.393	0.393
ML2	Factor 2	1.592	0.145	0.537
ML3	Factor 3	0.353	0.032	0.569

Confirmatory Factor Analysis for Convergent Validity

EFA is exploratory — it finds factors, but does not test whether a specific theoretically motivated structure fits. CFA imposes the structure you designed: EC items load only on EC, GPI items only on GPI, BA items only on BA — no cross-loadings permitted. The question is how well this constrained model fits the observed data.

Key fit indices:

Index	What it measures	Good fit
CFI (Comparative Fit Index)	How much better the model fits than a null model	≥ 0.95
TLI (Tucker–Lewis Index)	Similar to CFI; penalises model complexity	≥ 0.95
RMSEA	Absolute discrepancy per degree of freedom	≤ 0.06
SRMR	Average residual correlation across all item pairs	≤ 0.08

Standardised factor loadings are the central convergent validity evidence. Each loading tells you how strongly an item reflects its intended latent construct — values above 0.50 are acceptable; above 0.70 indicate strong item–construct correspondence.

▶ CFA: test the hypothesised 3-factor structure

cfa_model_conv <- '
  EC  =~ EC1 + EC2 + EC3 + EC4
  GPI =~ GPI1 + GPI2 + GPI3 + GPI4
  BA  =~ BA1  + BA2  + BA3
'

# std.lv = TRUE: fixes factor variances to 1 so loadings and factor
# correlations are directly interpretable on a standardised scale
fit_conv <- cfa(cfa_model_conv, data = items_df, std.lv = TRUE)
summary(fit_conv, fit.measures = TRUE, standardized = TRUE)

lavaan 0.6-21 ended normally after 24 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        25

  Number of observations                           400

Model Test User Model:
                                                      
  Test statistic                                45.994
  Degrees of freedom                                41
  P-value (Chi-square)                           0.273

Model Test Baseline Model:

  Test statistic                              2071.349
  Degrees of freedom                                55
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.998
  Tucker-Lewis Index (TLI)                       0.997

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -7482.985
  Loglikelihood unrestricted model (H1)      -7459.988
                                                      
  Akaike (AIC)                               15015.971
  Bayesian (BIC)                             15115.757
  Sample-size adjusted Bayesian (SABIC)      15036.430

Root Mean Square Error of Approximation:

  RMSEA                                          0.017
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.040
  P-value H_0: RMSEA <= 0.050                    0.996
  P-value H_0: RMSEA >= 0.080                    0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.027

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  EC =~                                                                 
    EC1               1.246    0.073   16.967    0.000    1.246    0.754
    EC2               1.273    0.071   17.856    0.000    1.273    0.781
    EC3               1.287    0.073   17.621    0.000    1.287    0.774
    EC4               1.267    0.075   17.008    0.000    1.267    0.755
  GPI =~                                                                
    GPI1              1.316    0.073   18.104    0.000    1.316    0.788
    GPI2              1.262    0.074   16.984    0.000    1.262    0.753
    GPI3              1.318    0.073   17.990    0.000    1.318    0.784
    GPI4              1.245    0.077   16.253    0.000    1.245    0.730
  BA =~                                                                 
    BA1               1.100    0.087   12.616    0.000    1.100    0.648
    BA2               1.150    0.083   13.802    0.000    1.150    0.704
    BA3               1.270    0.084   15.028    0.000    1.270    0.762

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  EC ~~                                                                 
    GPI               0.905    0.020   45.112    0.000    0.905    0.905
    BA                0.489    0.051    9.577    0.000    0.489    0.489
  GPI ~~                                                                
    BA                0.539    0.049   11.091    0.000    0.539    0.539

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .EC1               1.182    0.100   11.773    0.000    1.182    0.432
   .EC2               1.035    0.092   11.309    0.000    1.035    0.390
   .EC3               1.109    0.097   11.441    0.000    1.109    0.401
   .EC4               1.213    0.103   11.754    0.000    1.213    0.430
   .GPI1              1.060    0.094   11.249    0.000    1.060    0.380
   .GPI2              1.215    0.103   11.836    0.000    1.215    0.433
   .GPI3              1.087    0.096   11.316    0.000    1.087    0.385
   .GPI4              1.361    0.112   12.145    0.000    1.361    0.467
   .BA1               1.668    0.152   10.960    0.000    1.668    0.580
   .BA2               1.347    0.139    9.713    0.000    1.347    0.505
   .BA3               1.163    0.145    8.015    0.000    1.163    0.419
    EC                1.000                               1.000    1.000
    GPI               1.000                               1.000    1.000
    BA                1.000                               1.000    1.000

▶ Standardised factor loadings table

std_loads <- standardizedSolution(fit_conv) |>
  filter(op == "=~") |>
  select(lhs, rhs, est.std, se, pvalue) |>
  mutate(across(where(is.numeric), \(x) round(x, 3)))

kable(std_loads,
      col.names = c("Construct", "Item", "Std. Loading", "SE", "p-value"),
      caption   = "CFA standardised factor loadings — convergent validity evidence")

CFA standardised factor loadings — convergent validity evidence
Construct	Item	Std. Loading	SE
EC	EC1	0.754	0.025
EC	EC2	0.781	0.024
EC	EC3	0.774	0.024
EC	EC4	0.755	0.025
GPI	GPI1	0.788	0.023
GPI	GPI2	0.753	0.025
GPI	GPI3	0.784	0.023
GPI	GPI4	0.730	0.027
BA	BA1	0.648	0.039
BA	BA2	0.704	0.037
BA	BA3	0.762	0.035

▶ Fit indices summary

fi <- fitMeasures(fit_conv,
      c("cfi","tli","rmsea","rmsea.ci.lower","rmsea.ci.upper","srmr"))

fit_summary <- data.frame(
  Index     = c("CFI","TLI","RMSEA","RMSEA 90% CI","SRMR"),
  Value     = c(round(fi["cfi"],  3),
                round(fi["tli"],  3),
                round(fi["rmsea"],3),
                paste0("[", round(fi["rmsea.ci.lower"],3),
                       ", ", round(fi["rmsea.ci.upper"],3), "]"),
                round(fi["srmr"], 3)),
  Benchmark = c("≥ 0.95","≥ 0.95","≤ 0.06","—","≤ 0.08")
)
kable(fit_summary,
      col.names = c("Fit Index","Value","Good Fit Benchmark"),
      caption   = "CFA model fit indices")

CFA model fit indices
	Fit Index	Value	Good Fit Benchmark
cfi	CFI	0.998	≥ 0.95
tli	TLI	0.997	≥ 0.95
rmsea	RMSEA	0.017	≤ 0.06
	RMSEA 90% CI	[0, 0.04]	—
srmr	SRMR	0.027	≤ 0.08

What the CFA results reveal — a two-part story

Good news — convergent validity: The standardised loadings for EC, GPI, and BA items are all strong (> 0.70). Each item reliably reflects its intended construct. The items cohere within each scale as theory predicted.

Emerging problem — discriminant validity preview: Now look at the factor correlations in the summary() output (the ~~ rows under Standardized). The correlation between the EC and GPI factors is very high — close to 0.93. Strong within-scale loadings on nearly identical latent factors is the hallmark of discriminant validity failure.

This reveals the key insight: convergent validity and discriminant validity are distinct assessments that can give opposite verdicts simultaneously. Items can hang together strongly within each scale (good convergent validity) while those scales themselves remain statistically indistinguishable from each other (poor discriminant validity). The next section quantifies exactly how severe this problem is using formal discriminant validity tests.

Discriminant Validity Testing Methods

The factor analysis above established that items converge well within their intended scales. The question now is whether those scales can be distinguished from each other — discriminant validity. Two formal methods are available for multi-item Likert scales.

Method 1: HTMT Analysis

What HTMT Measures

HTMT stands for Heterotrait-Monotrait ratio of correlations — which sounds technical, but the idea is simple:

Monotrait = correlations between items from the same scale (within-scale)
Heterotrait = correlations between items from different scales (between-scale)

The HTMT ratio asks: “How large are the between-scale correlations relative to the within-scale correlations?” If this ratio is close to 1.0, your scales are not distinguishable. If it’s clearly below 1.0, the scales capture different things.

The threshold rules of thumb:

HTMT < 0.85 → discriminant validity supported (strict)
HTMT < 0.90 → discriminant validity supported (lenient)
HTMT ≥ 0.90 → discriminant validity violated

We will use HTMT2, the updated version of the index that uses the geometric mean of within-scale correlations. This version is more accurate and is now recommended over the original.

Running the HTMT Analysis

Data preparation for htmt()

What htmt() needs: A data frame containing only the scale items — no participant IDs, no condition variable, no demographics.

items_df <- df |> select(EC1:BA3)   # drop id and condition

If you accidentally include non-item columns, the function will try to treat them as scale items and give you meaningless results.

▶ Run HTMT2 analysis

# ── Model specification ────────────────────────────────────────────────────────
# This tells htmt() which items belong to which construct.
# It uses the same syntax as lavaan's CFA model specification.
cfa_model <- '
  EC  =~ EC1 + EC2 + EC3 + EC4
  GPI =~ GPI1 + GPI2 + GPI3 + GPI4
  BA  =~ BA1  + BA2  + BA3
'

# ── Run the HTMT analysis ──────────────────────────────────────────────────────
htmt_result <- semTools::htmt(
  model    = cfa_model,   # factor structure: which items belong to which scale
  data     = items_df,    # item-level data ONLY (no id, no condition variable)
  htmt2    = TRUE,        # use HTMT2 (geometric mean) — recommended
  absolute = TRUE         # use absolute values of correlations
)

print(htmt_result)

       EC   GPI    BA
EC  1.000            
GPI 0.902 1.000      
BA  0.491 0.538 1.000

Key arguments to pay attention to

Argument	What it does	What to set
`model`	Specifies which items belong to which construct	Same as your CFA model syntax
`data`	The raw item data	Items only — no ID or grouping variables
`htmt2`	Switches between original and updated HTMT formula	Set `TRUE` for HTMT2 (recommended)
`absolute`	Whether to take absolute values of correlations	Keep `TRUE` (the default)
`missing`	How to handle missing data	`"listwise"` is fine for most cases

Visualizing the HTMT Results

A table of numbers is fine, but a chart makes it much easier to see which pairs are problematic. The bar chart below marks the HTMT thresholds for easy comparison.

▶ Plot: HTMT2 bar chart with thresholds

# Extract the HTMT matrix and reshape for plotting
htmt_mat <- as.matrix(htmt_result)

# semTools fills only one triangle; we need a robust helper to extract a pair
get_htmt <- function(mat, r, c) {
  val <- mat[r, c]
  if (is.na(val)) val <- mat[c, r]   # try the other triangle
  as.numeric(val)
}

htmt_pairs <- data.frame(
  pair  = c("EC ↔ GPI", "EC ↔ BA", "GPI ↔ BA"),
  htmt  = c(get_htmt(htmt_mat, "EC", "GPI"),
            get_htmt(htmt_mat, "EC", "BA"),
            get_htmt(htmt_mat, "GPI", "BA")),
  label = c("Focal problem pair", "OK", "OK")
)

ggplot(htmt_pairs, aes(x = reorder(pair, -htmt), y = htmt, fill = label)) +
  geom_col(width = 0.55, colour = "white") +
  # Threshold lines
  geom_hline(yintercept = 0.90, linetype = "dashed",
             colour = "firebrick", linewidth = 0.9) +
  geom_hline(yintercept = 0.85, linetype = "dotted",
             colour = "darkorange", linewidth = 0.9) +
  # Value labels on bars
  geom_text(aes(label = round(htmt, 3)), vjust = -0.5,
            fontface = "bold", size = 4) +
  # Threshold annotations
  annotate("text", x = 3.45, y = 0.915,
           label = "Strict threshold (0.90)", colour = "firebrick",
           size = 3.3, hjust = 1) +
  annotate("text", x = 3.45, y = 0.865,
           label = "Lenient threshold (0.85)", colour = "darkorange",
           size = 3.3, hjust = 1) +
  scale_fill_manual(values = c("Focal problem pair" = "#d73027",
                               "OK"                 = "#4575b4"),
                    name = NULL) +
  scale_y_continuous(limits = c(0, 1.08), expand = c(0, 0)) +
  labs(
    x        = NULL,
    y        = "HTMT Value",
    title    = "HTMT2 Analysis: Discriminant Validity Check",
    subtitle = "Bars above the dashed line indicate discriminant validity concerns"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom",
        panel.grid.major.x = element_blank())

Interpreting the HTMT Results

Reading the chart

EC ↔︎ GPI: The HTMT value exceeds both thresholds (0.85 and 0.90). This is a clear discriminant validity violation. The GPI scale cannot be statistically distinguished from the EC scale.
EC ↔︎ BA and GPI ↔︎ BA: Both well below the thresholds. Brand Attitude discriminates fine from the other two constructs.

Conclusion from HTMT: The GPI (our DV) and EC scales are too similar. Before drawing any conclusions about the marketing campaign’s effect on GPI, we need to address this problem — or at least be transparent about it.

What HTMT doesn’t tell you: HTMT gives you a single number per pair and a rule-of-thumb threshold. It does not account for how reliable your scales are, and it doesn’t provide a formal statistical decision with confidence intervals around the DVI estimate. That’s where the Pieters et al. (2025) method comes in.

Method 2: The DVI Method (Pieters et al., 2025)

Why We Need Something More

The HTMT is a great screening tool, but it has two gaps:

No formal inference: The 0.85 and 0.90 thresholds are rules of thumb, not statistically principled tests.
Ignores scale reliability: A pair of highly reliable scales should be able to discriminate even when their factor correlation is somewhat high. HTMT doesn’t account for this.

Pieters et al. (2025) propose the Discriminant Validity Index (DVI), which directly tests whether a scale’s reliability is high enough — relative to the factor correlation — to support discriminant validity. The method follows a clear two-step decision procedure.

The Two-Step Logic

Step 1 — The Phi Test: Is the factor correlation (φ) meaningfully less than 1.0?

\[\text{DVI}_1 = 1 - |\phi|\]

If DVI₁ is significantly greater than zero (the confidence interval doesn’t include zero), the two constructs aren’t perfectly correlated and at least some distinction exists. If DVI₁ isn’t significant, stop: discriminant validity has failed.

Step 2 — The CR Test: Even if φ < 1, is the scale reliability high enough to “rise above” the factor correlation?

\[\text{DVI}_{CR} = \sqrt{CR} - |\phi|\]

where CR is the Congeneric Reliability of the scale (sometimes called McDonald’s omega). If √CR > |φ| (i.e., DVI_CR > 0 and its CI excludes zero), the scale has enough internal consistency to distinguish itself from the other construct.

An intuitive way to think about Step 2

Imagine φ = 0.93 (constructs are highly correlated) and CR = 0.83 (pretty reliable scale). Then √CR ≈ 0.91. The question is: can √CR (0.91) rise above φ (0.93)? Here, 0.91 < 0.93 — so DVI_CR is negative. The scale isn’t reliable enough to pull ahead of the factor correlation. Discriminant validity fails.

The intuition: reliability sets an upper bound on how distinctly a scale can behave. If that upper bound is itself lower than the factor correlation, the scale is fundamentally compromised.

Step 1: Specifying the CFA Model with Defined Parameters

The key insight from Pieters et al. (2025) is to define CR and DVI inside the lavaan model syntax using the := operator. This lets lavaan automatically compute these values and their standard errors, giving us confidence intervals for free.

Data preparation for the CFA

Same as HTMT: The CFA model should be fit on the item-level data only. The condition variable is not part of the measurement model.

Additionally, you must label every loading and every error variance in the model. These labels are needed to compute CR and DVI using :=. Without them, lavaan won’t know which parameters to use in the formulas.

▶ Specify the labeled CFA/DVI model

dvi_model <- '
  # Measurement model
  EC  =~ lam_EC1*EC1 + lam_EC2*EC2 + lam_EC3*EC3 + lam_EC4*EC4
  GPI =~ lam_GPI1*GPI1 + lam_GPI2*GPI2 + lam_GPI3*GPI3 + lam_GPI4*GPI4
  BA  =~ lam_BA1*BA1 + lam_BA2*BA2 + lam_BA3*BA3

  # Error variances
  EC1  ~~ th_EC1*EC1
  EC2  ~~ th_EC2*EC2
  EC3  ~~ th_EC3*EC3
  EC4  ~~ th_EC4*EC4

  GPI1 ~~ th_GPI1*GPI1
  GPI2 ~~ th_GPI2*GPI2
  GPI3 ~~ th_GPI3*GPI3
  GPI4 ~~ th_GPI4*GPI4

  BA1  ~~ th_BA1*BA1
  BA2  ~~ th_BA2*BA2
  BA3  ~~ th_BA3*BA3

  # Factor correlations
  EC  ~~ phi_EC_GPI*GPI
  EC  ~~ phi_EC_BA*BA
  GPI ~~ phi_GPI_BA*BA

  # Congeneric reliability
  CR_EC  := ((lam_EC1 + lam_EC2 + lam_EC3 + lam_EC4)^2) / (((lam_EC1 + lam_EC2 + lam_EC3 + lam_EC4)^2) + th_EC1 + th_EC2 + th_EC3 + th_EC4)
  CR_GPI := ((lam_GPI1 + lam_GPI2 + lam_GPI3 + lam_GPI4)^2) / (((lam_GPI1 + lam_GPI2 + lam_GPI3 + lam_GPI4)^2) + th_GPI1 + th_GPI2 + th_GPI3 + th_GPI4)
  CR_BA  := ((lam_BA1 + lam_BA2 + lam_BA3)^2) / (((lam_BA1 + lam_BA2 + lam_BA3)^2) + th_BA1 + th_BA2 + th_BA3)

  # DVI
  DVI_1      := 1 - sqrt(phi_EC_GPI^2)
  DVI_CR_EC  := sqrt(CR_EC) - sqrt(phi_EC_GPI^2)
  DVI_CR_GPI := sqrt(CR_GPI) - sqrt(phi_EC_GPI^2)
'

Key features of the model specification to pay attention to

Feature	Why it matters
Labels on every loading (`lam_EC1*EC1`)	Required to compute CR with `:=`
Labels on every error variance (`th_EC1*EC1`)	Required to compute CR with `:=`
Factor correlations labelled (`phi_EC_GPI*GPI`)	The phi label feeds into DVI formulas
`sqrt(phi^2)` instead of `phi`	Takes the absolute value (in case phi is negative)
`:=` operator	Defines derived parameters; lavaan computes their SEs automatically

Step 2: Fitting the CFA

▶ Fit CFA and get Wald confidence intervals

# Fit the CFA using the labelled model
# std.lv = TRUE: fixes factor variances to 1, so phi values ARE correlations
fit_wald <- cfa(
  model  = dvi_model,
  data   = items_df,
  std.lv = TRUE      # ← IMPORTANT: without this, phi is not a correlation
)

# Check that the model converged and fits reasonably
summary(fit_wald, fit.measures = TRUE, standardized = TRUE)

lavaan 0.6-21 ended normally after 24 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        25

  Number of observations                           400

Model Test User Model:
                                                      
  Test statistic                                45.994
  Degrees of freedom                                41
  P-value (Chi-square)                           0.273

Model Test Baseline Model:

  Test statistic                              2071.349
  Degrees of freedom                                55
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.998
  Tucker-Lewis Index (TLI)                       0.997

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -7482.985
  Loglikelihood unrestricted model (H1)      -7459.988
                                                      
  Akaike (AIC)                               15015.971
  Bayesian (BIC)                             15115.757
  Sample-size adjusted Bayesian (SABIC)      15036.430

Root Mean Square Error of Approximation:

  RMSEA                                          0.017
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.040
  P-value H_0: RMSEA <= 0.050                    0.996
  P-value H_0: RMSEA >= 0.080                    0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.027

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  EC =~                                                                 
    EC1    (l_EC1)    1.246    0.073   16.967    0.000    1.246    0.754
    EC2    (l_EC2)    1.273    0.071   17.856    0.000    1.273    0.781
    EC3    (l_EC3)    1.287    0.073   17.621    0.000    1.287    0.774
    EC4    (l_EC4)    1.267    0.075   17.008    0.000    1.267    0.755
  GPI =~                                                                
    GPI1  (l_GPI1)    1.316    0.073   18.104    0.000    1.316    0.788
    GPI2  (l_GPI2)    1.262    0.074   16.984    0.000    1.262    0.753
    GPI3  (l_GPI3)    1.318    0.073   17.990    0.000    1.318    0.784
    GPI4  (l_GPI4)    1.245    0.077   16.253    0.000    1.245    0.730
  BA =~                                                                 
    BA1    (l_BA1)    1.100    0.087   12.616    0.000    1.100    0.648
    BA2    (l_BA2)    1.150    0.083   13.802    0.000    1.150    0.704
    BA3    (l_BA3)    1.270    0.084   15.028    0.000    1.270    0.762

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  EC ~~                                                                 
    GPI   (p_EC_G)    0.905    0.020   45.112    0.000    0.905    0.905
    BA    (p_EC_B)    0.489    0.051    9.577    0.000    0.489    0.489
  GPI ~~                                                                
    BA      (p_GP)    0.539    0.049   11.091    0.000    0.539    0.539

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .EC1    (t_EC1)    1.182    0.100   11.773    0.000    1.182    0.432
   .EC2    (t_EC2)    1.035    0.092   11.309    0.000    1.035    0.390
   .EC3    (t_EC3)    1.109    0.097   11.441    0.000    1.109    0.401
   .EC4    (t_EC4)    1.213    0.103   11.754    0.000    1.213    0.430
   .GPI1  (t_GPI1)    1.060    0.094   11.249    0.000    1.060    0.380
   .GPI2  (t_GPI2)    1.215    0.103   11.836    0.000    1.215    0.433
   .GPI3  (t_GPI3)    1.087    0.096   11.316    0.000    1.087    0.385
   .GPI4  (t_GPI4)    1.361    0.112   12.145    0.000    1.361    0.467
   .BA1    (t_BA1)    1.668    0.152   10.960    0.000    1.668    0.580
   .BA2    (t_BA2)    1.347    0.139    9.713    0.000    1.347    0.505
   .BA3    (t_BA3)    1.163    0.145    8.015    0.000    1.163    0.419
    EC                1.000                               1.000    1.000
    GPI               1.000                               1.000    1.000
    BA                1.000                               1.000    1.000

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    CR_EC             0.850    0.012   69.436    0.000    0.850    0.850
    CR_GPI            0.848    0.012   68.598    0.000    0.848    0.849
    CR_BA             0.748    0.022   34.341    0.000    0.748    0.748
    DVI_1             0.095    0.020    4.727    0.000    0.095    0.095
    DVI_CR_EC         0.017    0.021    0.820    0.412    0.017    0.017
    DVI_CR_GPI        0.016    0.021    0.777    0.437    0.016    0.016

std.lv = TRUE — why it matters

When you set std.lv = TRUE, lavaan fixes the variance of each latent factor to 1. This means the factor covariance between EC and GPI is directly interpretable as a correlation (values between –1 and +1). Without this, the raw covariance would be in an arbitrary scale, and computing DVI from it would not be valid.

Always use std.lv = TRUE when applying the DVI method.

Step 3: Extracting and Interpreting the Wald Confidence Intervals

The summary() output above already contains the DVI values and their Wald confidence intervals. The code below pulls out just the parameters we care about.

▶ Extract DVI estimates (Wald CIs)

# Extract parameter estimates with 95% Wald confidence intervals
pe_wald <- parameterEstimates(fit_wald, level = 0.95)

# Filter to the derived parameters (DVI, CR, and the factor correlation)
key_params <- c("phi_EC_GPI", "CR_EC", "CR_GPI",
                "DVI_1", "DVI_CR_EC", "DVI_CR_GPI")

dvi_wald_table <- pe_wald |>
  filter(label %in% key_params) |>
  select(label, est, se, ci.lower, ci.upper, pvalue) |>
  mutate(across(where(is.numeric), \(x) round(x, 3)))

kable(
  dvi_wald_table,
  col.names = c("Parameter", "Estimate", "SE", "CI Lower", "CI Upper", "p-value"),
  caption   = "DVI Results with Wald 95% Confidence Intervals"
)

DVI Results with Wald 95% Confidence Intervals
Parameter	Estimate	SE	CI Lower	CI Upper	p-value
phi_EC_GPI	0.905	0.020	0.866	0.944	0.000
CR_EC	0.850	0.012	0.826	0.874	0.000
CR_GPI	0.848	0.012	0.824	0.873	0.000
DVI_1	0.095	0.020	0.056	0.134	0.000
DVI_CR_EC	0.017	0.021	-0.023	0.057	0.412
DVI_CR_GPI	0.016	0.021	-0.024	0.056	0.437

How to read this table

Work through the rows in order:

phi_EC_GPI: The estimated correlation between the EC and GPI latent factors. Our simulation set this at 0.93, so you should see something close to that. Values close to 1.0 are the core of the problem.
CR_EC / CR_GPI: The reliability of each scale — essentially, how well the four items hang together. A value of, say, 0.84 means √CR ≈ 0.92. That 0.92 needs to exceed the factor correlation (0.93) for Step 2 to pass. When the factor correlation is that high, even a decent reliability of 0.84 isn’t enough.
DVI_1: Step 1. Estimate is 1 − |0.93| = 0.07. The question is whether this tiny gap is statistically reliable (CI excludes zero). If the CI includes zero, the constructs are statistically indistinguishable from perfect overlap.
DVI_CR_EC / DVI_CR_GPI: Step 2. Estimate is √CR − 0.93. If √CR is around 0.91 and the factor correlation is 0.93, you get 0.91 − 0.93 = −0.02 — a negative number, meaning the scale’s reliability doesn’t reach above the factor correlation. Discriminant validity fails.

A positive DVI value with a CI that doesn’t cross zero = evidence for discriminant validity. A near-zero or negative DVI value, or one whose CI crosses zero = discriminant validity is not supported.

Step 4: Bootstrap Confidence Intervals

Wald confidence intervals rely on asymptotic normality — they can be slightly off, especially for parameters that are constrained (like DVI_CR which can’t go below –1). Bootstrap confidence intervals make no such assumptions and are more trustworthy.

How bootstrap CIs work (in plain language)

Instead of a mathematical formula for the CI, we repeatedly resample the data (with replacement), refit the CFA each time, and compute DVI. After doing this thousands of times, we look at the 2.5th and 97.5th percentiles of the DVI values we got. Those percentiles form our 95% CI.

This is slower to compute but more honest about uncertainty.

▶ Re-fit with bootstrap SEs (~60 sec)

# Fit the model again with bootstrap standard errors
# bootstrap = 1000 is a reasonable minimum; use 2000+ for final reporting
set.seed(31415927)
fit_boot <- cfa(
  model     = dvi_model,
  data      = items_df,
  std.lv    = TRUE,
  se        = "bootstrap",   # switch from Wald to bootstrap
  bootstrap = 1000           # number of bootstrap resamples
)

# Extract parameter estimates with percentile bootstrap CIs
pe_boot <- parameterEstimates(
  fit_boot,
  boot.ci.type = "perc",   # percentile method (most commonly recommended)
  level        = 0.95
)

dvi_boot_table <- pe_boot |>
  filter(label %in% key_params) |>
  select(label, est, ci.lower, ci.upper) |>
  mutate(across(where(is.numeric), \(x) round(x, 3)))

kable(
  dvi_boot_table,
  col.names = c("Parameter", "Estimate", "Bootstrap CI Lower", "Bootstrap CI Upper"),
  caption   = "DVI Results with Percentile Bootstrap 95% Confidence Intervals (B = 1,000)"
)

DVI Results with Percentile Bootstrap 95% Confidence Intervals (B = 1,000)
Parameter	Estimate	Bootstrap CI Lower	Bootstrap CI Upper
phi_EC_GPI	0.905	0.865	0.945
CR_EC	0.850	0.822	0.872
CR_GPI	0.848	0.819	0.872
DVI_1	0.095	0.055	0.135
DVI_CR_EC	0.017	-0.025	0.059
DVI_CR_GPI	0.016	-0.026	0.055

Bootstrap vs. Wald CIs — which to report?

In practice, report both. The Wald CIs are faster to compute and good for a first check. Use bootstrap CIs (with B ≥ 2,000) for the final analysis you report in a paper. If they tell the same story, you can be confident in your conclusions.

Step 5: Visualizing the DVI Results

A forest plot makes it easy to see all three DVI values and whether their CIs cross zero.

▶ Plot: DVI forest plot

# Build a plotting data frame from the bootstrap results
dvi_plot_df <- dvi_boot_table |>
  filter(label %in% c("DVI_1", "DVI_CR_EC", "DVI_CR_GPI")) |>
  mutate(
    label_nice = case_when(
      label == "DVI_1"      ~ "DVI₁: Phi criterion\n(1 – |φ|)",
      label == "DVI_CR_EC"  ~ "DVI_CR (EC): CR criterion\n(√CR_EC – |φ|)",
      label == "DVI_CR_GPI" ~ "DVI_CR (GPI): CR criterion\n(√CR_GPI – |φ|)"
    ),
    supported = ci.lower > 0   # TRUE if entire CI is above zero
  )

ggplot(dvi_plot_df, aes(x = est, y = reorder(label_nice, est), colour = supported)) +
  # CI bar
  geom_segment(aes(x = ci.lower, xend = ci.upper,
                   y = reorder(label_nice, est), yend = reorder(label_nice, est)),
               linewidth = 1.5) +
  # Point estimate
  geom_point(size = 4) +
  # Zero line (the null: no discriminant validity)
  geom_vline(xintercept = 0, linetype = "dashed", colour = "black", linewidth = 0.8) +
  # Colour: green = supported, red = not supported
  scale_colour_manual(values = c("TRUE"  = "#2c7bb6",
                                 "FALSE" = "#d7191c"),
                      labels = c("TRUE"  = "DV supported (CI > 0)",
                                 "FALSE" = "DV NOT supported (CI ≤ 0)"),
                      name   = NULL) +
  labs(
    x        = "DVI Estimate (with 95% Bootstrap CI)",
    y        = NULL,
    title    = "DVI Forest Plot: EC ↔ GPI Pair",
    subtitle = "If the CI bar crosses the dashed line (0), discriminant validity is not supported"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom",
        panel.grid.minor = element_blank())

Step 6: The Decision Procedure

Pieters et al. (2025) propose a structured two-step decision procedure. Work through it in order:

▶ Two-step decision table

# Determine outcomes based on bootstrap CIs
dvi_results <- dvi_boot_table |>
  filter(label %in% c("DVI_1", "DVI_CR_EC", "DVI_CR_GPI")) |>
  mutate(
    significant = ci.lower > 0,
    verdict     = if_else(significant, "Supported ✓", "Not supported ✗")
  ) |>
  select(label, est, ci.lower, ci.upper, verdict)

kable(
  dvi_results,
  col.names = c("DVI Metric", "Estimate", "95% CI Lower", "95% CI Upper", "Verdict"),
  caption   = "Two-Step Decision Summary for the EC–GPI Pair"
)

Two-Step Decision Summary for the EC–GPI Pair
DVI Metric	Estimate	95% CI Lower	95% CI Upper	Verdict
DVI_1	0.095	0.055	0.135	Supported ✓
DVI_CR_EC	0.017	-0.025	0.059	Not supported ✗
DVI_CR_GPI	0.016	-0.026	0.055	Not supported ✗

Applying the two-step decision rule

Step 1 — DVI₁ (Phi criterion): Is DVI₁ significantly > 0?

If YES: The factor correlation is meaningfully less than 1.0. Some distinction exists. Proceed to Step 2.
If NO: Discriminant validity has completely failed. EC and GPI are statistically indistinguishable — you cannot treat them as separate constructs in your analysis.

Step 2 — DVI_CR (CR criterion): Is DVI_CR significantly > 0 for BOTH scales?

If DVI_CR > 0 for both scales: Discriminant validity is fully supported. The scales are reliable enough to rise above the factor correlation.
If DVI_CR > 0 for only one scale: The problem is asymmetric. One scale is reliable enough to distinguish itself; the other isn’t. That weaker scale is where your attention should focus — its items may be too conceptually similar to the other construct.
If DVI_CR ≤ 0 for both scales: Even though the constructs aren’t perfectly correlated (Step 1 passed), neither scale is reliable enough to pull ahead of the factor correlation. In practical terms: you cannot draw clean conclusions about which construct your DV is measuring.

In our data: Because we set the EC–GPI correlation at 0.93, you should see Step 1 barely passing (or failing), and Step 2 failing for at least one scale — likely both. The scales are simply too correlated to be treated as measuring distinct constructs, regardless of how the items were worded.

Comparing the Two Methods

Side-by-Side Summary

▶ HTMT vs. DVI side-by-side summary

# Assemble phi and HTMT values for the comparison table
phi_est     <- round(dvi_boot_table$est[dvi_boot_table$label == "phi_EC_GPI"], 3)
phi_ci_lo   <- round(dvi_boot_table$ci.lower[dvi_boot_table$label == "phi_EC_GPI"], 3)
phi_ci_hi   <- round(dvi_boot_table$ci.upper[dvi_boot_table$label == "phi_EC_GPI"], 3)
htmt_val    <- round(get_htmt(htmt_mat, "EC", "GPI"), 3)

comparison <- data.frame(
  Feature         = c("What it tests",
                      "Key output",
                      "Threshold / decision rule",
                      "Accounts for reliability?",
                      "Formal statistical test?",
                      "EC ↔ GPI result"),
  HTMT            = c("Between-scale vs. within-scale correlations",
                      paste0("HTMT2 = ", htmt_val),
                      "< 0.85 (strict) or < 0.90 (lenient)",
                      "No",
                      "No — rule of thumb only",
                      if_else(htmt_val >= 0.90, "FAILED ✗ (≥ 0.90)", "OK ✓")),
  `DVI (Pieters et al., 2025)` = c("Factor correlation and scale reliability",
                      paste0("φ = ", phi_est, " [", phi_ci_lo, ", ", phi_ci_hi, "]"),
                      "DVI > 0 with CI excluding zero",
                      "Yes — uses Congeneric Reliability",
                      "Yes — bootstrap CIs",
                      "See DVI table above")
)

kable(
  comparison,
  col.names = c("Feature", "HTMT", "DVI (Pieters et al., 2025)"),
  caption   = "Comparing the two discriminant validity methods"
)

Comparing the two discriminant validity methods
Feature	HTMT	DVI (Pieters et al., 2025)
What it tests	Between-scale vs. within-scale correlations	Factor correlation and scale reliability
Key output	HTMT2 = 0.902	φ = 0.905 [0.865, 0.945]
Threshold / decision rule	< 0.85 (strict) or < 0.90 (lenient)	DVI > 0 with CI excluding zero
Accounts for reliability?	No	Yes — uses Congeneric Reliability
Formal statistical test?	No — rule of thumb only	Yes — bootstrap CIs
EC ↔︎ GPI result	FAILED ✗ (≥ 0.90)	See DVI table above

What to Do When Discriminant Validity Fails

If HTMT and DVI both flag the same pair of constructs, you have a genuine problem. Here are your options:

Reconceptualise: Are EC and GPI genuinely different constructs, or have you (like many sustainability researchers) been treating two facets of the same underlying construct as if they were distinct?
Revise the scale: Remove or rewrite GPI items that are too close in meaning to EC items. Re-collect data.
Combine the scales: If the constructs truly cannot be separated, consider treating them as a single broader construct (e.g., “pro-environmental orientation”).
Report transparently: If none of the above is feasible, at minimum report the discriminant validity failure and discuss what it implies for interpreting your effects.

The core take-home message

A marketing effect that appears on a GPI scale which fails discriminant validity from EC is not clearly interpretable. You cannot claim the campaign increased purchase intention if the scale measuring “purchase intention” is statistically indistinguishable from a scale measuring environmental concern. The effect could reflect either or both. Discriminant validity testing is not optional — it is a prerequisite for meaningful inference about relationships between constructs.

Other Methods for Assessing Discriminant Validity

The HTMT and DVI approaches covered here are the current best practice for multi-item Likert scales, but other methods exist:

Average Variance Extracted (AVE) criterion (Fornell & Larcker, 1981): Compare the square root of each construct’s AVE against its inter-construct correlations. Widely used but has known limitations — the criterion is often too lenient, and AVE itself is sensitive to the number of items. The Fornell–Larcker criterion was the dominant approach before HTMT.
Multitrait-multimethod (MTMM) analysis (Campbell & Fiske, 1959): The classical approach — measure multiple traits using multiple methods and examine the pattern of convergent and discriminant correlations. Resource-intensive but gold standard for construct validation.
Network psychometrics: Gaussian graphical models (GGMs) can reveal the partial correlation structure among items, making discriminant validity failures visible as dense cross-scale connections.

Validity Testing for Non-Likert Data

The Same Classical Test Theory Logic Applies to Every Data Type

The classical test theory equation introduced at the start of this section — observed score = true score + error — is not a property of Likert scales. It is a property of measurement itself. Every datum, regardless of format, carries both signal and noise, and that noise is not always random.

A 5-point Likert response of “4” might reflect a true belief of 3 plus a 1-point upward nudge from a positively framed question. A willingness-to-pay (WTP) response of $20 might reflect a true utility of $15 plus $5 of upward error from luxury packaging that inflated perceived market price. A text response of “fine” carries very different true sentiment depending on whether the speaker is British, exasperated, or genuinely content. In every case, the observed signal blends multiple latent sources:

\[X_\text{observed} = T_\text{true construct} + \lambda \cdot T_\text{other construct} + \varepsilon_\text{random}\]

The discriminant validity failure mode is identical regardless of measurement format: the $\lambda$ term is non-zero, meaning the observed measure is blending the construct you care about with something else. What changes across data types is not the problem but the tools available to diagnose it. HTMT and DVI are specifically designed for multi-item Likert scales and do not directly apply to WTP, binary choice, or open-ended text. But the question they answer — is this measure specific enough to support meaningful causal inference? — applies universally.

The core constraint for non-Likert data

For non-Likert outcomes, we typically have fewer off-the-shelf statistical tests. This makes critical thinking about the data-generating process the most valuable diagnostic tool. Before running any model, ask: what else, besides my intended construct, could plausibly move this observed measure? If you can identify more than one or two credible alternatives, discriminant validity is a genuine concern.

Willingness to Pay Data

The measurement problem: WTP is intended to capture a respondent’s personal utility for a product — how much value they derive from owning or using it. But WTP responses also reflect perceived market price: some respondents answer the question “how much is this worth in the market?” more than “how much is this worth to me?” These two latent sources — utility and price belief — are conceptually distinct but empirically entangled in the single observed number.

\[X_\text{WTP} = T_\text{eco utility} + \lambda \cdot T_\text{perceived price} + \varepsilon\]

In our green marketing scenario, suppose each participant gives open-ended WTP for six eco-cleaning products. Products 1–3 are evaluated in plain packaging; products 4–6 are presented with luxury metallic packaging — a cue that should raise perceived market price without changing the actual eco-utility of the product. If WTP for the luxury-packaged products rises sharply while independent liking ratings and forced-choice preferences barely move, your WTP measure is not purely measuring utility. It is blending utility with price beliefs — exactly the same failure mode as Likert scale items bleeding across construct boundaries.

▶ Simulate WTP data: eco utility vs. perceived price confound

set.seed(2025)
N_wtp <- 300

# Two latent constructs
eco_utility  <- rnorm(N_wtp, 0, 1)
# Price belief correlated with utility: eco-conscious consumers expect eco products to cost more
price_belief <- 0.40 * eco_utility + rnorm(N_wtp, 0, sqrt(1 - 0.40^2))

# Products 1–3: plain packaging — WTP driven primarily by eco utility
# Products 4–6: luxury packaging — WTP also inflated by price belief
# Log-scale WTP ensures positive values and stabilises variance (currency is right-skewed)
wtp_df <- data.frame(
  plain1 = round(exp(3.0 + 0.70 * eco_utility + 0.15 * price_belief + rnorm(N_wtp, 0, 0.30))),
  plain2 = round(exp(3.0 + 0.72 * eco_utility + 0.12 * price_belief + rnorm(N_wtp, 0, 0.30))),
  plain3 = round(exp(3.0 + 0.68 * eco_utility + 0.18 * price_belief + rnorm(N_wtp, 0, 0.30))),
  lux4   = round(exp(3.0 + 0.40 * eco_utility + 0.60 * price_belief + rnorm(N_wtp, 0, 0.30))),
  lux5   = round(exp(3.0 + 0.38 * eco_utility + 0.62 * price_belief + rnorm(N_wtp, 0, 0.30))),
  lux6   = round(exp(3.0 + 0.42 * eco_utility + 0.58 * price_belief + rnorm(N_wtp, 0, 0.30)))
)
log_wtp_df <- as.data.frame(log(wtp_df))
colnames(log_wtp_df) <- paste0("log_", names(wtp_df))

cat("Mean WTP by packaging type:\n")

Mean WTP by packaging type:

▶ Simulate WTP data: eco utility vs. perceived price confound

cat(sprintf("  Plain packaging (products 1–3): $%.1f\n", mean(unlist(wtp_df[,1:3]))))

  Plain packaging (products 1–3): $28.4

▶ Simulate WTP data: eco utility vs. perceived price confound

cat(sprintf("  Luxury packaging (products 4–6): $%.1f\n", mean(unlist(wtp_df[,4:6]))))

  Luxury packaging (products 4–6): $29.6

Convergent Validity of WTP: EFA on Log-WTP Judgments

Just as we ran EFA on Likert items to check whether items measuring the same construct cluster together, we can run EFA on multiple WTP judgments for products within the same category. The logic is identical: if WTP responses for plain-packaged products are driven by the same latent eco-utility factor, they should load together. If the luxury-packaged products tap a different latent source (price belief), they should load on a separate factor.

▶ EFA on log-WTP: do plain and luxury products cluster separately?

wtp_efa <- fa(log_wtp_df, nfactors = 2, rotate = "oblimin", fm = "ml")
print(loadings(wtp_efa), cutoff = 0.30)


Loadings:
           ML1    ML2   
log_plain1         0.955
log_plain2         0.964
log_plain3         0.833
log_lux4    0.939       
log_lux5    0.987       
log_lux6    0.857       

                 ML1   ML2
SS loadings    2.602 2.547
Proportion Var 0.434 0.424
Cumulative Var 0.434 0.858

What the WTP-EFA reveals

If the EFA cleanly separates plain-packaging products (loading on Factor 1) from luxury-packaging products (loading on Factor 2), the two sets of WTP responses are being driven by different latent sources. That split is the discriminant validity failure made visible in factor-analytic terms — the same pattern as EC and GPI items cross-loading on the same factor in the Likert analysis.

Discriminant Validity of WTP: Comparing Latent Models

For Likert scales, HTMT and DVI are the formal discriminant validity tests. For WTP, the equivalent is confirmatory model comparison: fit competing CFA models that differ in how many latent sources of variance they assume, and evaluate which fits the data better.

▶ CFA model comparison: 1-factor vs. 2-factor WTP

# Model 1: All six WTP products driven by a single latent utility construct
wtp_1f <- '
  utility =~ log_plain1 + log_plain2 + log_plain3 +
             log_lux4   + log_lux5   + log_lux6
'

# Model 2: Two latent sources — eco utility (plain) and price-inflated utility (luxury)
wtp_2f <- '
  eco_util  =~ log_plain1 + log_plain2 + log_plain3
  price_wtp =~ log_lux4   + log_lux5   + log_lux6
'

fit_wtp_1f <- cfa(wtp_1f, data = log_wtp_df, std.lv = TRUE)
fit_wtp_2f <- cfa(wtp_2f, data = log_wtp_df, std.lv = TRUE)

fi_1f <- fitMeasures(fit_wtp_1f, c("cfi","rmsea","srmr"))
fi_2f <- fitMeasures(fit_wtp_2f, c("cfi","rmsea","srmr"))

wtp_compare <- data.frame(
  Model   = c("1-factor (single utility construct)",
               "2-factor (eco utility + price belief)"),
  CFI     = round(c(fi_1f["cfi"],   fi_2f["cfi"]),   3),
  RMSEA   = round(c(fi_1f["rmsea"], fi_2f["rmsea"]), 3),
  SRMR    = round(c(fi_1f["srmr"],  fi_2f["srmr"]),  3)
)
kable(wtp_compare,
      col.names = c("Model","CFI","RMSEA","SRMR"),
      caption   = "WTP model comparison: fit indices")

WTP model comparison: fit indices
Model	CFI	RMSEA	SRMR
1-factor (single utility construct)	0.832	0.368	0.072
2-factor (eco utility + price belief)	0.996	0.063	0.015

▶ CFA model comparison: 1-factor vs. 2-factor WTP

# Likelihood ratio test
print(lavTestLRT(fit_wtp_1f, fit_wtp_2f))


Chi-Squared Difference Test

           Df    AIC    BIC   Chisq Chisq diff  RMSEA Df diff Pr(>Chisq)    
fit_wtp_2f  8 2406.7 2454.8  17.453                                         
fit_wtp_1f  9 2761.8 2806.2 374.566     357.11 1.0895       1  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reading the WTP model comparison

If the 2-factor model fits substantially better than the 1-factor model (higher CFI, lower RMSEA and SRMR; significant likelihood ratio test), the data are telling you that plain-packaging and luxury-packaging WTP are not measuring the same latent construct. Your WTP measure is blending utility with price beliefs — a discriminant validity failure.

The interpretive consequence is identical to the Likert case: if a green marketing treatment raises WTP, you cannot cleanly claim it raised eco-utility. The effect may instead reflect that the treatment made the product seem more premium or expensive. The ambiguity comes from measurement contamination, not from the manipulation itself.

Two Sharp Validity Checks Without a Full Model

When a full CFA comparison is not feasible, two targeted checks provide strong discriminant validity evidence and map directly onto the Likert logic:

Known-groups test (analogue of checking whether EC and GPI diverge in expected directions): Groups with high genuine eco-utility — self-identified environmental activists, members of eco-certification organisations — should show higher WTP for plain-packaged eco products. But they should show similar WTP to non-activists for luxury-packaged products if the luxury cue is moving price beliefs rather than utility. If the groups converge on luxury WTP while diverging on plain WTP, the plain-packaging measure is more construct-valid.

Experimental falsification (analogue of the manipulation check in scale development): Manipulate something that should move price beliefs but not utility — luxury packaging, an inflated “RRP” anchor, a prestige brand cue. If WTP shifts significantly more than liking ratings or forced-choice preferences in response to this manipulation, WTP is contaminated by the price-belief dimension. This is the direct WTP analogue of an EC item cross-loading onto the GPI factor: the measure moves for the wrong reason.

The MTMM principle for WTP

The gold standard is a multitrait-multimethod (MTMM) design — the same framework introduced earlier in this section for Likert scales. Collect multiple measures of eco-utility across multiple methods: open-ended WTP, liking ratings on a Likert scale, pairwise forced choice, an incentivised BDM-auction bid, or observed purchase. Convergent validity holds when the same eco-utility construct shows up consistently across methods. Discriminant validity holds when eco-utility measures are more similar to each other than to measures of a different construct (e.g., perceived market price, brand prestige). The methods differ from the Likert context; the principle is the same.

Textual Data

The measurement problem: “Fine” means something very different depending on context. In a British customer review, “fine” typically signals mild disappointment; in an American survey, neutrality; written under time pressure, disengagement. The word carries measurement error that is not random — it is structured by dialect, pragmatic convention, and respondent state. In CTT terms:

\[X_\text{text response} = T_\text{intended construct} + \lambda \cdot T_\text{context / register / fatigue} + \varepsilon\]

When you extract a “green purchase intention” signal from open-ended text, you need to be confident that signal is specific to purchase intention and not merely general environmental rhetoric, brand enthusiasm, or survey-response acquiescence. The $\lambda$ contamination problem is as real for text as it is for a Likert scale — it is simply harder to see, because the “items” are words and phrases rather than numbered responses.

The factor analysis analogy: Topic modeling (LDA, STM) is to text what factor analysis is to Likert items. Both infer latent structure from observed indicators: in factor analysis the indicators are item scores; in topic modeling they are words and documents. Both ask: how many underlying dimensions are needed to explain the co-occurrence pattern? And both can be used to assess whether the constructs they uncover are specific enough to support causal claims.

In our green marketing scenario, suppose participants also wrote brief open-ended responses — what stood out about the product, whether and why they would consider buying it, and their overall brand impression. You fit a 3-topic STM model hoping to recover EC-text, GPI-text, and BA-text dimensions. The convergent and discriminant validity questions are then:

Convergent validity: Does the GPI-text topic score correlate most strongly with the GPI Likert scale, and less strongly with EC and BA? Does EC-text converge with EC-Likert?
Discriminant validity: Is the GPI-text topic’s vocabulary distinct from the EC-text topic’s vocabulary? Or is the supposed purchase-intention topic simply dominated by environmental language — “sustainable,” “eco,” “planet” — that belongs to the EC dimension?

▶ MTMM correlations: text topics vs. Likert scales (two scenarios)

# In practice you would:
#   1. Collect open-ended responses
#   2. Fit a 3-topic STM: stm(documents, vocab, K=3, prevalence=~condition, data=meta)
#   3. Extract per-participant topic proportions (the theta matrix)
#   4. Correlate theta columns with the corresponding Likert scale scores

# Below we simulate what those MTMM correlations look like
# under two scenarios: clean validity vs. discriminant validity failure

set.seed(42)
ec_score  <- rowMeans(df[, c("EC1","EC2","EC3","EC4")])
gpi_score <- rowMeans(df[, c("GPI1","GPI2","GPI3","GPI4")])
ba_score  <- rowMeans(df[, c("BA1","BA2","BA3")])

# Scenario A: text topics cleanly recover the three intended constructs
ec_text_A  <- 0.65 * ec_score  + 0.10 * gpi_score + 0.05 * ba_score + rnorm(400, 0, 0.5)
gpi_text_A <- 0.10 * ec_score  + 0.62 * gpi_score + 0.08 * ba_score + rnorm(400, 0, 0.5)
ba_text_A  <- 0.08 * ec_score  + 0.07 * gpi_score + 0.60 * ba_score + rnorm(400, 0, 0.5)

# Scenario B: GPI-text is dominated by environmental language (EC-contaminated)
ec_text_B  <- 0.65 * ec_score  + 0.10 * gpi_score + 0.05 * ba_score + rnorm(400, 0, 0.5)
gpi_text_B <- 0.58 * ec_score  + 0.20 * gpi_score + 0.07 * ba_score + rnorm(400, 0, 0.5) # ← failure
ba_text_B  <- 0.08 * ec_score  + 0.07 * gpi_score + 0.60 * ba_score + rnorm(400, 0, 0.5)

# MTMM correlation tables (text rows × Likert columns)
mtmm_A <- round(cor(
  data.frame(EC_Likert=ec_score, GPI_Likert=gpi_score, BA_Likert=ba_score,
             EC_text=ec_text_A, GPI_text=gpi_text_A, BA_text=ba_text_A)
)[4:6, 1:3], 2)

mtmm_B <- round(cor(
  data.frame(EC_Likert=ec_score, GPI_Likert=gpi_score, BA_Likert=ba_score,
             EC_text=ec_text_B, GPI_text=gpi_text_B, BA_text=ba_text_B)
)[4:6, 1:3], 2)

cat("Scenario A — Good discriminant validity (text topics recover constructs cleanly):\n")

Scenario A — Good discriminant validity (text topics recover constructs cleanly):

▶ MTMM correlations: text topics vs. Likert scales (two scenarios)

print(mtmm_A)

         EC_Likert GPI_Likert BA_Likert
EC_text       0.90       0.74      0.42
GPI_text      0.73       0.89      0.47
BA_text       0.50       0.52      0.85

▶ MTMM correlations: text topics vs. Likert scales (two scenarios)

cat("\nScenario B — Discriminant validity failure (GPI-text bleeds into EC):\n")


Scenario B — Discriminant validity failure (GPI-text bleeds into EC):

▶ MTMM correlations: text topics vs. Likert scales (two scenarios)

print(mtmm_B)

         EC_Likert GPI_Likert BA_Likert
EC_text       0.89       0.75      0.46
GPI_text      0.89       0.81      0.46
BA_text       0.47       0.52      0.85

Reading the MTMM table

Each row is a text-derived measure; each column is a Likert-based measure. Diagonal entries (EC-text vs. EC-Likert, GPI-text vs. GPI-Likert) are convergent validity correlations — they should be the highest value in each row. Off-diagonal entries are discriminant validity checks — they should be noticeably lower.

Scenario A (valid): GPI-text correlates most strongly with GPI-Likert and clearly less with EC-Likert. The text topic is specific to purchase intention.

Scenario B (failure): GPI-text correlates almost as strongly with EC-Likert as with GPI-Likert. The topic model has identified a “purchase intention” topic dominated by environmental language. A treatment effect on this text-based GPI measure would be uninterpretable — you cannot tell whether the ad increased willingness to buy, or simply increased environmental-concern rhetoric in the responses.

Beyond MTMM correlations, two diagnostics are specific to topic models and map directly onto the Likert validity logic:

Topic exclusivity (analogue of item specificity in scale development): Does each topic use distinctive vocabulary, or do the EC-text and GPI-text topics share most of their high-probability words? In STM, the FREX metric combines exclusivity (words distinctive to this topic) with frequency (words common enough to matter). Low exclusivity for the GPI topic means its vocabulary is not separable from EC — the text analogue of GPI items cross-loading on the EC factor.

Topic correlations (analogue of the factor correlation φ): STM’s topicCorr() estimates correlations among topic proportions across respondents — the direct text analogue of the latent factor correlation in CFA. If the EC-text and GPI-text topic scores correlate at 0.93 across respondents, the discriminant validity failure is quantitatively the same as φ = 0.93 between the latent EC and GPI factors. If you were to apply the Pieters et al. DVI logic: when the topic correlation approaches 1.0, and neither topic can rise above that correlation in terms of how coherently it appears across documents, discriminant validity has failed in exactly the same formal sense.

The interpretive warning — same as Likert scales

In Scenario B, the GPI-text topic is dominated by environmental language. A treatment effect on this measure is not cleanly interpretable as an effect on purchase intention — it may equally reflect that the eco-friendly ad prompted more environmental concern language in the responses. This is precisely the same ambiguity as the Likert GPI scale failing discriminant validity from EC.

The data type changed. The problem did not.

Summary: validity assessment across data types

	Likert scales	WTP	Text
Convergent validity	EFA loadings, CFA fit, α	EFA/CFA on log-WTP	Text–Likert MTMM correlations
Discriminant validity	HTMT, DVI	CFA model comparison, falsification	Topic exclusivity, topic correlations
Gold standard design	MTMM across methods	MTMM (WTP + liking + choice)	MTMM (text + Likert + behavior)
Key failure mode	EC–GPI cross-loading	Price belief contamination	Environmental language in GPI topic
Interpretive consequence	Effect reflects EC and/or GPI	Effect reflects utility and/or price	Effect reflects concern and/or intention

Regardless of data type, the critical thinking question is always the same: what else, besides my intended construct, could plausibly cause this observed measure to move?

Interlude: Reliability Is Not Validity

The Core Distinction

Case 1 showed how to detect discriminant validity failures using HTMT and DVI. But there is a related, subtler failure mode that trips up researchers constantly: a scale can be highly reliable and still be invalid.

Reliability measures consistency — do the items within a scale hang together? Cronbach’s alpha is the most common index. A high alpha (typically > 0.70) means the items are consistently measuring something, but says nothing about what that something is, or whether it can be distinguished from other constructs.

Discriminant validity asks a different question: is what you are measuring distinct from other constructs? A scale can achieve a Cronbach’s alpha of 0.95 and simultaneously have an HTMT of 1.20 — perfectly reliable at measuring something indistinguishable from an entirely different construct.

This distinction matters because most researchers check reliability and stop there. But:

A scale with high alpha and poor discriminant validity will produce inflated correlations between constructs — those correlations are partially measuring the same thing twice.
Mediation models and regression coefficients become uninterpretable: which construct is actually driving the effect?
Results replicate reliably — but are meaningless because the constructs were never distinct.

Why Cronbach’s Alpha Cannot Detect Discriminant Validity Failures

The reason is structural. Cronbach’s alpha is computed entirely within a scale:

\[\alpha = \frac{k\,\bar{\rho}}{1 + (k-1)\,\bar{\rho}}\]

where $k$ is the number of items and $\bar{\rho}$ is the average inter-item correlation within the scale. Notice what is absent: any reference to other constructs or between-scale correlations. Alpha is mathematically blind to discriminant validity by design.

The Item-Addition Problem (Spearman–Brown)

A compounding issue: adding more items always increases Cronbach’s alpha, regardless of whether those items tap your intended construct or something adjacent to it. This is the Spearman–Brown prophecy. Researchers often respond to low alpha by writing more items. Each new item pushes alpha higher. But if the new items bleed into an adjacent construct, discriminant validity erodes at exactly the same time that alpha improves.

The result: a scale that looks excellent by reliability standards (α = 0.92) while simultaneously failing every discriminant validity check.

Interactive Simulator: Reliability vs. Discriminant Validity

Use the controls below to explore how Cronbach’s alpha and HTMT respond — independently — to changes in scale properties. The key insight: you can set alpha as high as you like without changing HTMT at all, and vice versa.

What to try

Add items and watch alpha climb while HTMT stays flat. Increase items from 4 to 12. Alpha rises from moderate to excellent. HTMT does not move. Reliability and discriminant validity are measuring completely different things.
The “looks great, is broken” scenario. Set: 8 items, within-correlation = 0.55, between-correlation = 0.50. Alpha = 0.91 (Excellent!). HTMT = 0.91 (Discriminant validity violated). This is not a contrived edge case — it is common in sustainability, well-being, and attitude research where adjacent constructs are inherently correlated.
Set within-correlation = between-correlation. Alpha remains unchanged. HTMT reaches 1.0 — the two scales are statistically identical. High reliability, zero discriminant validity.

viewof sim_k = Inputs.range([2, 15], {
  step: 1, value: 4,
  label: "Number of items per scale (k)"
})
viewof sim_within_r = Inputs.range([0.10, 0.95], {
  step: 0.01, value: 0.50,
  label: "Within-scale inter-item correlation (ρ̄)"
})
viewof sim_between_r = Inputs.range([0.00, 0.95], {
  step: 0.01, value: 0.30,
  label: "Between-scale correlation (discriminant validity threat)"
})

rel_alpha = (sim_k * sim_within_r) / (1 + (sim_k - 1) * sim_within_r)
rel_htmt  = sim_between_r / sim_within_r

alpha_col = rel_alpha >= 0.70 ? "#2a9d8f" : "#e63946"
htmt_col  = rel_htmt  <  0.85 ? "#2a9d8f"
           : rel_htmt  <  0.90 ? "#e9c46a"
           : "#e63946"

alpha_label = rel_alpha >= 0.90 ? "Excellent"
            : rel_alpha >= 0.80 ? "Good"
            : rel_alpha >= 0.70 ? "Acceptable"
            : "Too low"

htmt_label = rel_htmt < 0.85 ? "✓ Discriminant validity supported"
           : rel_htmt < 0.90 ? "⚠ Borderline — approaching threshold"
           : "✗ Discriminant validity VIOLATED"

html`<div style="display:flex;gap:16px;flex-wrap:wrap;margin:16px 0;">
  <div style="flex:1;min-width:200px;border:2px solid ${alpha_col};border-radius:8px;padding:16px;text-align:center;background:${alpha_col}18;">
    <div style="font-size:.8em;color:#555;text-transform:uppercase;letter-spacing:.05em;">Cronbach's Alpha</div>
    <div style="font-size:2.4em;font-weight:700;color:${alpha_col};">${rel_alpha.toFixed(3)}</div>
    <div style="font-size:.9em;color:${alpha_col};font-weight:600;">${alpha_label}</div>
    <div style="font-size:.75em;color:#888;margin-top:4px;">Threshold ≥ 0.70</div>
  </div>
  <div style="flex:1;min-width:200px;border:2px solid ${htmt_col};border-radius:8px;padding:16px;text-align:center;background:${htmt_col}18;">
    <div style="font-size:.8em;color:#555;text-transform:uppercase;letter-spacing:.05em;">HTMT</div>
    <div style="font-size:2.4em;font-weight:700;color:${htmt_col};">${rel_htmt.toFixed(3)}</div>
    <div style="font-size:.9em;color:${htmt_col};font-weight:600;">${htmt_label}</div>
    <div style="font-size:.75em;color:#888;margin-top:4px;">Threshold < 0.85</div>
  </div>
</div>`

k_vals = Array.from({length: 14}, (_, i) => i + 2)

alpha_long = k_vals.flatMap(k => [
  {k, metric: "Cronbach's α", value: (k * sim_within_r) / (1 + (k - 1) * sim_within_r)},
  {k, metric: "HTMT",         value: sim_between_r / sim_within_r}
])

Plot.plot({
  width: 640, height: 280,
  marginLeft: 55, marginRight: 20, marginTop: 20, marginBottom: 40,
  color: {
    legend: true,
    domain: ["Cronbach's α", "HTMT"],
    range:  ["#457b9d",      "#e63946"]
  },
  x: {label: "Items per scale (k)", tickValues: [2,4,6,8,10,12,14,15]},
  y: {
    label: "Value",
    domain: [0, Math.min(1.3, Math.max(1.1, rel_htmt + 0.15))]
  },
  marks: [
    Plot.ruleY([0.70], {stroke: "#f4a261", strokeDasharray: "5,3", strokeWidth: 1.4}),
    Plot.ruleY([0.85], {stroke: "#e63946", strokeDasharray: "5,3", strokeWidth: 1.4}),
    Plot.line(alpha_long, {x: "k", y: "value", stroke: "metric", strokeWidth: 2.5}),
    Plot.dot(alpha_long.filter(d => d.k === sim_k),
             {x: "k", y: "value", fill: "metric", r: 7})
  ]
})

(a)

(b)

(c)

Figure 5.1: Cronbach’s alpha climbs with each additional item (blue). HTMT is completely unaffected (red). Dashed lines show common decision thresholds. The two metrics are measuring entirely different things.

The bottom line

A high Cronbach’s alpha confirms your items are consistently measuring something. It does not confirm that the something is your intended construct, or that it can be distinguished from adjacent constructs. Discriminant validity testing is not optional — it is a prerequisite for interpreting any correlation, regression, or mediation involving Likert-scale constructs.

Researcher Checklist: Discriminant Validity

Key questions before using your scales

Does each measured scale correspond to a single latent dimension? Items should load strongly on one factor and weakly on all others (cross-loadings < 0.30).
Do you have evidence that your constructs are empirically distinct? HTMT < 0.85 for all construct pairs is the standard threshold. A high Cronbach’s alpha does not answer this question.
If HTMT is high, is the problem in the items or in the theory? High cross-loadings mean the items are contaminated; a high factor correlation may mean the constructs are theoretically redundant.
Will discriminant validity failure bias your downstream analysis? Two discriminantly-invalid predictors in the same regression produce unstable coefficients. A mediator that overlaps with the outcome inflates indirect effects.