▶ Load required packages
library(lavaan); library(semTools); library(MASS); library(ggplot2)
library(dplyr); library(tidyr); library(knitr)library(lavaan); library(semTools); library(MASS); library(ggplot2)
library(dplyr); library(tidyr); library(knitr)In Part 1 of this module we introduced the Classical Test Theory (CTT) decomposition that underlies all measurement in the behavioural sciences:
\[X_{\text{obs}} = T + \varepsilon\]
where \(X_{\text{obs}}\) is the observed score, \(T\) is the respondent’s true latent value on the construct of interest, and \(\varepsilon\) is measurement error. The entire edifice of psychometric measurement — reliability, validity, structural equation modelling — rests on one fundamental assumption about \(\varepsilon\): it is symmetric, centred at zero, with equal probability of positive and negative deviations. This symmetry is what allows the observed mean to function as an unbiased estimator of the true latent mean: the errors cancel in expectation.
This assumption is testable. In a well-specified measurement model, the estimated residuals \(\hat{\varepsilon}_i = X_{\text{obs},i} - \hat{T}_i\) should be symmetrically distributed — mean zero, skewness near zero. A simple Q-Q plot of residuals, D’Agostino’s skewness test, or the Jarque–Bera test can detect systematic asymmetry. If the residuals are skewed, the symmetry assumption is violated and the observed mean is no longer an unbiased proxy for the true latent mean — it will be pulled toward the center of the scale (ceiling effects bias means downward; floor effects bias them upward).
The symmetry assumption is more often assumed than checked. One important way it fails is when participants are constrained in the range of responses they can express — most commonly because the scale itself runs out of room. The mechanism described below assumes a truncated latent-score model: the construct has a continuous latent distribution, and the observed score is that latent value clipped to the scale’s endpoints. This is the right model for many bounded continuous measures (sliders, 0–100 ratings, VAS scales), though different scale types — ordinal Likert items, count outcomes, or corner-solution variables with genuine mass at a bound — follow a different data-generating process and may require different diagnostics.
Consider a respondent whose true score is 90 on a 0–100 scale, with genuine response variability of ±15 points. Downward deviations are unconstrained: they can score 75, 80, 85. But upward deviations hit the ceiling: a genuine tendency to score 105 is recorded as 100. Under this truncated model, the error distribution is right-truncated. The respondent’s \(\varepsilon\) is no longer symmetric — it is negatively skewed, and every deviation toward the ceiling is suppressed.
This is not a psychometric failure or a respondent quirk. Under a truncated latent-score DGP, it is a structural consequence of bounded measurement that grows more severe the closer the true score sits to the ceiling or floor. The aggregate result — biased observed means, compressed variances, and skewed residuals — is what this part of the module explores. The takeaway: the symmetry of \(\varepsilon\) should be checked, not assumed, and any researcher using bounded continuous scales should inspect the residual distribution before treating observed means as unbiased.
Almost every scale used in behavioral research has bounds — rating scales run from 1 to 7, percentages from 0 to 100, frequency measures from “never” to “always.” These bounds are usually treated as a minor formatting detail. In practice, they introduce a systematic bias that most researchers never think about.
Bounded scales are ubiquitous across the social, health, economic, and statistical sciences — and so is the bias they create. But the problem appears under different names depending on the discipline: ceiling and floor effects in psychometrics, corner solutions and censored data in econometrics, bounded outcomes in health statistics, item difficulty effects in item response theory. Despite the different terminology, the underlying mathematics is identical: a hard limit on what a scale can record truncates the true distribution, and the truncated mean departs systematically from the truth.
This measurement-layer problem cascades upward through the rest of the inferential chain. Biased observed means corrupt hypothesis tests — because what is being tested is not the construct being theorized about. Those corrupted tests in turn compromise causal inference — because treatment-effect estimates compare biased observations rather than true latent values. A researcher who finds a null result comparing a control group (true mean 50) to a treated group (true mean 75, both on a 0–100 scale with SD 20) may be observing a genuine effect masked by differential truncation bias: the treated group’s distribution is clipped more severely because it sits closer to the ceiling. The measurement layer is not preliminary housekeeping — it shapes every conclusion drawn above it.
Here is the mechanism under a truncated latent-score model. Suppose a participant’s true level of some construct is 85 on a 0–100 scale, and they have a genuine tendency to vary ±15 points around that true value due to measurement noise (mood, attention, question framing). On the low side, they can express that variation freely — they might score 70, 75, 80. On the high side, they hit a ceiling. Scores of 95, 100, 105 are impossible — those attempts to go higher just pile up at 100. The result: the observed distribution of scores is not centered at 85. It is pulled toward the center of the scale. The ceiling cuts off the upper tail, making the observed mean lower than the true mean.
The same thing happens at the lower bound: if the true value is close to the floor, the floor cuts off the lower tail and pulls the observed mean upward. Both effects push observations toward the middle of the scale. (Note: for ordinal Likert items, the response function is categorical rather than continuous-truncation, so the magnitude and direction of bias can differ.)
Three factors make this worse:
Under a truncated latent-score DGP, scales bounded on both ends (0–100 slider, bounded continuous rating) produce symmetric pull toward the center: ceiling effects at the top and floor effects at the bottom, each biasing estimates toward the midpoint. Ordinal Likert scales share this topology but follow a different response function — the bias pattern is qualitatively similar but the magnitude depends on item thresholds, not just the raw bounds.
Scales bounded on one end only — like response time (lower bound at 0, no upper bound) or non-negative count data — produce asymmetric pull toward the bounded end only. The unbounded tail is free; the bounded tail is clipped. For corner-solution outcomes (variables with genuine probability mass at zero, such as dollars spent or number of purchases), the econometric treatment differs: a Tobit or hurdle model is more appropriate than a truncation correction.
For right-skewed continuous distributions common in bounded-at-zero measures (reaction time, income, sales), the floor effect is usually the active constraint, and under truncation the bias pushes mean estimates upward.
The simulator below shows the observed (truncated) distribution — what researchers actually record after the scale bounds clip the tails of the true distribution. Vertical reference lines mark the true mean (dashed black), the observed mean (solid blue), and the observed median (dashed red). The statistics panel shows the true and observed mean, SD, and skewness — each with its bias relative to the true value. As you move the true mean toward a bound or increase the SD, watch how the observed distribution narrows (SD compression), shifts (mean bias), and tilts (skewness) — and how the mean–median gap signals the direction of the skew. This is the CTT symmetry assumption failing in real time.
Means are biased toward the scale midpoint. When comparing groups or conditions, if one group has a true mean closer to a bound than the other, the truncation bias differs between groups — the apparent treatment effect is distorted by the scale itself, not just the construct. This is particularly dangerous in pre–post designs, where an intervention may move participants toward a ceiling, making the post-treatment distribution more severely clipped and therefore making the estimated gain look smaller than it truly is.
The problem gets worse as variability increases. High within-person variability (which researchers often interpret as low reliability) amplifies truncation bias. Ironically, a noisy measure produces more bias, not just more noise. This also means that between-group comparisons of variance — common in psychometric analyses — are confounded whenever the groups sit at different distances from a bound.
Transformations can help, but not fully eliminate the problem. Logit-transforming proportions (e.g., log(p/(1-p))) or arcsine-transforming bounded scores can partially stabilize the bias. Tobit regression (Wooldridge, 2010) models the truncated distribution explicitly and estimates the latent mean directly — it was developed in econometrics for exactly this structure, where household expenditure is zero for non-buyers and positive (and therefore bounded below) for buyers.
The problem has different names across disciplines — but the same solution. In econometrics it is censored data; in health research it is ceiling and floor effects in patient-reported outcome (PRO) instruments; in psychometrics it appears as item difficulty in item response theory; in statistics it is truncated distribution estimation. Each field has converged on similar remedies: model the latent distribution explicitly (Tobit / truncated regression), use response scales whose range substantially exceeds the likely spread of true scores, and — as a minimum — report the proportion of respondents at or near each bound before interpreting means and treatment comparisons.
Pieters, R., Srivastava, J., & Bagchi, R. (2025). Improving the discriminant validation of multi-item scales. Journal of Marketing Research. https://doi.org/10.1177/00222437251322089
Henseler, J., Ringle, C. M., & Sarstedt, M. (2015). A new criterion for assessing discriminant validity in variance-based structural equation modeling. Journal of the Academy of Marketing Science, 43(1), 115–135.
Millsap, R. E., & Kwok, O.-M. (2004). Evaluating the impact of partial factor loading and intercept invariance on selection in two populations. Psychological Methods, 9(1), 93–115.
Putnick, D. L., & Bornstein, M. H. (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41, 71–90.
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289–317.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. ACM SIGMOD Record, 29(2), 93–104.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–231.
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed.). MIT Press.