▶ Load required packages
library(lavaan); library(semTools); library(MASS); library(ggplot2)
library(dplyr); library(tidyr); library(knitr)library(lavaan); library(semTools); library(MASS); library(ggplot2)
library(dplyr); library(tidyr); library(knitr)Almost every scale used in behavioral research has bounds — rating scales run from 1 to 7, percentages from 0 to 100, frequency measures from “never” to “always.” These bounds are usually treated as a minor formatting detail. In practice, they introduce a systematic bias that most researchers never think about.
Bounded scales are ubiquitous across the social, health, economic, and statistical sciences — and so is the bias they create. But the problem appears under different names depending on the discipline: ceiling and floor effects in psychometrics, corner solutions and censored data in econometrics, bounded outcomes in health statistics, item difficulty effects in item response theory. Despite the different terminology, the underlying mathematics is identical: a hard limit on what a scale can record truncates the true distribution, and the truncated mean departs systematically from the truth.
This measurement-layer problem cascades upward through the rest of the inferential chain. Biased observed means corrupt hypothesis tests — because what is being tested is not the construct being theorized about. Those corrupted tests in turn compromise causal inference — because treatment-effect estimates compare biased observations rather than true latent values. A researcher who finds a null result comparing a control group (true mean 50) to a treated group (true mean 75, both on a 0–100 scale with SD 20) may be observing a genuine effect masked by differential truncation bias: the treated group’s distribution is clipped more severely because it sits closer to the ceiling. The measurement layer is not preliminary housekeeping — it shapes every conclusion drawn above it.
Here is the mechanism. Suppose a participant’s true level of some construct is 85 on a 0–100 scale, and they have a genuine tendency to vary ±15 points around that true value due to measurement noise (mood, attention, question framing). On the low side, they can express that variation freely — they might score 70, 75, 80. On the high side, they hit a ceiling. Scores of 95, 100, 105 are impossible — those attempts to go higher just pile up at 100. The result: the observed distribution of scores is not centered at 85. It is pulled toward the center of the scale. The ceiling cuts off the upper tail, making the observed mean lower than the true mean.
The same thing happens at the lower bound: if the true value is close to the floor, the floor cuts off the lower tail and pulls the observed mean upward. Both effects push observations toward the middle of the scale.
Three factors make this worse:
Scales bounded on both ends (1–7 Likert, 0–100 slider) produce symmetric pull toward the center: ceiling effects at the top and floor effects at the bottom, each biasing estimates toward the midpoint.
Scales bounded on one end only — like response time (lower bound at 0, no upper bound), log-transformed variables, or count data — produce asymmetric pull toward the bounded end only. The unbounded tail is free; the bounded tail is clipped. This asymmetry means the direction of bias depends on which bound participants approach.
For right-skewed distributions common in bounded-at-zero measures (reaction time, income, sales), the floor effect is usually the active constraint, and the bias pushes mean estimates upward.
The simulator below overlays two density curves. The gray shaded curve is the true distribution — what scores would look like if the scale had no bounds. The blue shaded curve is the observed distribution — what researchers actually record after the bounds clip the tails. Vertical reference lines mark the true mean (dashed black), the observed mean (solid blue), and the observed median (dashed red). Adjust the sliders to see how the gap between true and observed means grows as the true mean approaches a bound or as variability increases.
Means are biased toward the scale midpoint. When comparing groups or conditions, if one group has a true mean closer to a bound than the other, the truncation bias differs between groups — the apparent treatment effect is distorted by the scale itself, not just the construct. This is particularly dangerous in pre–post designs, where an intervention may move participants toward a ceiling, making the post-treatment distribution more severely clipped and therefore making the estimated gain look smaller than it truly is.
The problem gets worse as variability increases. High within-person variability (which researchers often interpret as low reliability) amplifies truncation bias. Ironically, a noisy measure produces more bias, not just more noise. This also means that between-group comparisons of variance — common in psychometric analyses — are confounded whenever the groups sit at different distances from a bound.
Transformations can help, but not fully eliminate the problem. Logit-transforming proportions (e.g., log(p/(1-p))) or arcsine-transforming bounded scores can partially stabilize the bias. Tobit regression (Wooldridge, 2010) models the truncated distribution explicitly and estimates the latent mean directly — it was developed in econometrics for exactly this structure, where household expenditure is zero for non-buyers and positive (and therefore bounded below) for buyers.
The problem has different names across disciplines — but the same solution. In econometrics it is censored data; in health research it is ceiling and floor effects in patient-reported outcome (PRO) instruments; in psychometrics it appears as item difficulty in item response theory; in statistics it is truncated distribution estimation. Each field has converged on similar remedies: model the latent distribution explicitly (Tobit / truncated regression), use response scales whose range substantially exceeds the likely spread of true scores, and — as a minimum — report the proportion of respondents at or near each bound before interpreting means and treatment comparisons.
Pieters, R., Srivastava, J., & Bagchi, R. (2025). Improving the discriminant validation of multi-item scales. Journal of Marketing Research. https://doi.org/10.1177/00222437251322089
Henseler, J., Ringle, C. M., & Sarstedt, M. (2015). A new criterion for assessing discriminant validity in variance-based structural equation modeling. Journal of the Academy of Marketing Science, 43(1), 115–135.
Millsap, R. E., & Kwok, O.-M. (2004). Evaluating the impact of partial factor loading and intercept invariance on selection in two populations. Psychological Methods, 9(1), 93–115.
Putnick, D. L., & Bornstein, M. H. (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41, 71–90.
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289–317.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. ACM SIGMOD Record, 29(2), 93–104.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–231.
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed.). MIT Press.