7 Part 5: Outlier Detection and Robust Estimation

▶ Load required packages

# Uncomment to install if needed:
# install.packages(c("lavaan", "semTools", "MASS", "ggplot2",
#                    "dplyr", "tidyr", "corrplot", "knitr", "lmtest",
#                    "mclust", "dbscan"))

library(lavaan)       # CFA and SEM (Cases 1 and 3)
library(semTools)     # htmt() and auxiliary SEM tools (Cases 1 and 3)
library(MASS)         # mvrnorm(): generate multivariate normal data (all cases)
library(ggplot2)      # Visualizations (all cases)
library(dplyr)        # Data manipulation (all cases)
library(tidyr)        # Data reshaping (Cases 2 and 3)
library(corrplot)     # Correlation heatmap (Case 1)
library(knitr)        # Nicely formatted tables (all cases)
library(lmtest)       # Breusch-Pagan heteroskedasticity test (Case 2)
library(mclust)       # Gaussian mixture models / latent class analysis (Case 4)
library(dbscan)       # Local outlier factor and DBSCAN clustering (Case 5)

8 Part 5: Outlier Detection and Robust Estimation

8.1 Connecting to the Earlier Cases

Part 4 showed what happens when your sample contains two fundamentally different types of respondents. Part 5 extends this logic to observational data — specifically, to the problem of outliers.

Not all outliers are the same. Two important distinctions that are easy to conflate:

Statistical outliers vs. conceptual outliers

A statistical outlier is any observation that is unusual relative to the rest of the sample — it has a high LOF score, a large residual, or sits far from the regression line. The statistical definition is purely data-driven and says nothing about whether the observation represents a problem.

A conceptual outlier is an observation that comes from a fundamentally different data-generating process than the rest of your sample. Luxury flagship stores in a dataset of convenience stores are conceptual outliers: they do not just deviate from the regression line, they operate on entirely different economics. The same statistical method (advertising spend → sales) applies, but the parameters are different.

Why the distinction matters: Statistical outliers that are not conceptual outliers (a data entry error, an extreme but real observation) should be checked, corrected if erroneous, and handled with robust methods. Conceptual outliers that are also statistical outliers are a sampling problem, not a statistical noise problem — they reveal that your sample is actually a mixture of populations. The right response is Case 4 logic (latent subgroup analysis), not just downweighting the extreme values.

Within the statistical outlier category, two types behave differently:

Collective outliers are a group of observations that form their own cluster, far from the rest of the data. They usually represent a different population — like luxury flagship stores in a dataset of regular convenience stores. These are a version of the latent-subgroup problem from Part 4: if you include them, your regression line tries to average over two different data-generating processes. These are the conceptual outliers described above.
Point outliers (or global outliers) are individual observations that are simply extreme on one or more variables. They may represent data errors, rare events, or genuinely unusual cases. They pull the regression line toward them and inflate standard errors. These are more likely to be pure statistical outliers — unusual, but not necessarily from a different population.

Both types distort your results — but in different ways, and the solutions are different.

The connection to discriminant validity and latent subgroups

Collective outliers are, structurally, the same problem as latent subgroups. A cluster of luxury flagship stores isn’t “wrong data” — it’s a different population with a different data-generating process. Including them in your pooled model is like combining Green Champions and Price Skeptics: the resulting coefficient doesn’t accurately describe either group.

Point outliers, by contrast, are statistical problems rather than conceptual ones. A single data entry error or an unusually extreme store doesn’t imply a hidden population — it’s just a disruptive data point that needs to be handled carefully.

8.2 Simulating the Dataset

We create a store-level dataset with three groups:

Regular stores (n=500): Standard advertising–sales relationship
Luxury flagships (n=8): Collective outliers — much higher advertising and sales, forming their own cluster
Point outliers (n=5): Individual stores with implausibly extreme values

▶ Simulate store dataset with collective and point outliers

set.seed(2025)

# ── Regular stores ─────────────────────────────────────────────────────────────
n_reg <- 500
adv_reg   <- rnorm(n_reg, mean = 5, sd = 1.5)
sales_reg <- 0.4 * adv_reg + rnorm(n_reg, sd = 1.2)

# ── Luxury flagship stores (collective outliers) ───────────────────────────────
# They advertise much more AND have much higher sales — different economic model
n_lux <- 8
adv_lux   <- rnorm(n_lux, mean = 18, sd = 1.2)
sales_lux <- 0.3 * adv_lux + rnorm(n_lux, sd = 0.8) + 10   # elevated baseline

# ── Point outliers (individual extreme observations) ───────────────────────────
n_pt  <- 5
adv_pt   <- c(22, 1, 3, 19, 2)          # extreme advertising values
sales_pt <- c(2, 18, -1, 25, 15)        # extreme / inconsistent sales

# ── Combine into single dataset ────────────────────────────────────────────────
df5 <- data.frame(
  advertising = c(adv_reg,    adv_lux,    adv_pt),
  sales       = c(sales_reg,  sales_lux,  sales_pt),
  store_type  = c(rep("Regular",   n_reg),
                  rep("Luxury flagship", n_lux),
                  rep("Point outlier",   n_pt))
)

8.3 Step 1: Visualising the Three Types of Stores

▶ Plot: stores colored by true type

ggplot(df5, aes(x = advertising, y = sales, colour = store_type, shape = store_type)) +
  geom_point(aes(size = store_type), alpha = 0.7) +
  geom_smooth(data = filter(df5, store_type == "Regular"),
              method = "lm", se = TRUE, colour = "#4575b4",
              linewidth = 1.0, inherit.aes = FALSE,
              aes(x = advertising, y = sales)) +
  scale_colour_manual(
    values = c("Regular"         = "#4575b4",
               "Luxury flagship" = "#d73027",
               "Point outlier"   = "#1a9641"),
    name   = "Store type"
  ) +
  scale_shape_manual(
    values = c("Regular" = 16, "Luxury flagship" = 17, "Point outlier" = 8),
    name   = "Store type"
  ) +
  scale_size_manual(
    values = c("Regular" = 1.5, "Luxury flagship" = 3.5, "Point outlier" = 4),
    name   = "Store type"
  ) +
  labs(
    x        = "Advertising Spend (standardised)",
    y        = "Sales (standardised)",
    title    = "Three Types of Observations in the Dataset",
    subtitle = "Luxury flagships form their own cluster; point outliers are scattered extremes"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

What to notice

Regular stores (blue): tight cluster with a clear advertising→sales relationship. This is the population you actually want to model.
Luxury flagships (red triangles): a separate cluster in the upper right. They form a coherent group — this is the collective outlier problem. Including them will rotate your regression line toward theirs.
Point outliers (green stars): scattered in unusual positions — very high advertising with low sales, or very low advertising with high sales. These are individually anomalous.

8.4 Step 2: Detecting Outliers with Local Outlier Factor (LOF)

Local Outlier Factor (LOF) is a density-based method: it compares how densely packed each observation’s neighbourhood is relative to its neighbours’ neighbourhoods. If a point is in a sparse region surrounded by denser regions, it gets a high LOF score.

LOF ≈ 1: Normal point (density similar to neighbours)
LOF > 2–3: Moderate outlier (noticeably sparser than surroundings)
LOF >> 3: Strong outlier

▶ Compute LOF scores and visualize

# Standardise the variables before computing LOF
df5_scaled <- data.frame(
  advertising = scale(df5$advertising),
  sales       = scale(df5$sales)
)

# Compute LOF scores (minPts = number of neighbours to consider)
df5$lof_score <- dbscan::lof(df5_scaled, minPts = 10)

# Visualise LOF scores
ggplot(df5, aes(x = advertising, y = sales,
                colour = lof_score, size = lof_score)) +
  geom_point(alpha = 0.8) +
  scale_colour_gradient2(low = "#4575b4", mid = "#ffffbf", high = "#d73027",
                         midpoint = 2, name = "LOF score") +
  scale_size_continuous(range = c(1, 5), name = "LOF score") +
  labs(
    x        = "Advertising Spend",
    y        = "Sales",
    title    = "Local Outlier Factor (LOF) Scores",
    subtitle = "Warmer colour and larger point = more outlier-like"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "right")

▶ Table: top outliers by LOF score

# Show the top outliers by LOF score
df5 |>
  arrange(desc(lof_score)) |>
  select(advertising, sales, store_type, lof_score) |>
  head(15) |>
  mutate(across(where(is.numeric), \(x) round(x, 2))) |>
  kable(
    col.names = c("Advertising", "Sales", "Store Type", "LOF Score"),
    caption   = "Top 15 Observations by LOF Score (highest = most outlier-like)"
  )

Top 15 Observations by LOF Score (highest = most outlier-like)
Advertising	Sales	Store Type	LOF Score
1.00	18.00	Point outlier	11.69
2.00	15.00	Point outlier	10.76
22.00	2.00	Point outlier	5.01
1.06	-3.10	Regular	2.40
8.16	6.12	Regular	2.04
2.76	5.03	Regular	1.90
3.77	5.29	Regular	1.77
15.51	14.46	Luxury flagship	1.71
17.27	14.67	Luxury flagship	1.71
17.12	14.42	Luxury flagship	1.71
17.14	15.15	Luxury flagship	1.71
16.29	13.75	Luxury flagship	1.70
17.04	13.26	Luxury flagship	1.70
8.73	2.20	Regular	1.68
1.52	-1.56	Regular	1.61

Reading the LOF scores — and what to do with them

What LOF is telling you: Each store’s LOF score compares how densely packed its neighborhood is relative to its neighbors’ neighborhoods. A score near 1.0 means the store is in a region as dense as its surroundings — it looks like a typical store. A score of 2 or 3 means the store is in a region roughly 2–3 times sparser than its neighbors’ neighborhoods — it is noticeably isolated. Very high scores (5+) indicate severe outliers.

The critical point about classification: LOF scores do NOT give you clean “outlier” vs “not outlier” categories. There is no universal threshold that separates outliers from normal observations — you have to choose one, and that choice involves judgment. Common practice is to flag observations with LOF > 2 or LOF > 3 for investigation, but these cutoffs are conventions, not laws of nature.

Practical steps for each high-LOF observation: 1. Look at the observation. What are its actual values? Does anything look unusual? 2. Is it a data quality issue (entry error, unit mismatch)? Fix or remove. 3. Is it a legitimately extreme but real observation (a genuinely exceptional store)? Keep and investigate — it may be telling you something important about your data. 4. Does it form part of a coherent group of unusual observations? If yes, that’s a collective outlier — treat it as a Part 4 problem (latent subgroup), not a Part 5 problem.

Never delete observations just because they have high LOF scores. LOF is a signal to investigate, not an automatic deletion criterion.

8.5 Step 3: DBSCAN Clustering to Find the Collective Outlier Cluster

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters of any shape in your data, and labels anything too sparse to belong to a cluster as noise (cluster 0). This is useful for finding collective outliers: they form their own cluster (separate from the main group), while true point outliers get labelled as noise.

▶ DBSCAN clustering — find collective outlier cluster

# Fit DBSCAN
# eps: maximum distance to be considered a "neighbour"
# minPts: minimum cluster size
dbscan_fit <- dbscan::dbscan(df5_scaled, eps = 0.4, minPts = 4)
df5$cluster <- factor(dbscan_fit$cluster)

ggplot(df5, aes(x = advertising, y = sales,
                colour = cluster, shape = cluster)) +
  geom_point(size = 2.5, alpha = 0.8) +
  scale_colour_manual(
    values = c("0" = "#d73027",   # noise = red (point outliers)
               "1" = "#4575b4",   # main cluster = blue
               "2" = "#1a9641",   # luxury cluster = green
               "3" = "#fdae61"),  # any additional cluster
    labels = c("0" = "Noise / point outlier",
               "1" = "Main cluster (regular stores)",
               "2" = "Collective outlier cluster (luxury)",
               "3" = "Additional cluster"),
    name   = "DBSCAN cluster",
    drop   = FALSE
  ) +
  scale_shape_manual(
    values = c("0" = 8, "1" = 16, "2" = 17, "3" = 15),
    labels = c("0" = "Noise / point outlier",
               "1" = "Main cluster (regular stores)",
               "2" = "Collective outlier cluster (luxury)",
               "3" = "Additional cluster"),
    name   = "DBSCAN cluster",
    drop   = FALSE
  ) +
  labs(
    x        = "Advertising Spend",
    y        = "Sales",
    title    = "DBSCAN Clustering: Finding Collective and Point Outliers",
    subtitle = "Cluster 0 = noise (point outliers); separate clusters = collective outlier groups"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

How to read the DBSCAN output — and the limits of inferred classification

What the clusters mean: - Cluster 0 (noise/red): Points too isolated to belong to any cluster. These are candidates for point outliers — individually anomalous observations. Check each one. - Cluster 1 (main mass): The main group of regular stores. This is your target population for most analyses. - Cluster 2+ (secondary clusters): Coherent groups separated from the main mass — collective outliers. These are a latent subgroup problem (Part 4), not a statistical noise problem.

Important: these are inferred, probabilistic classifications. DBSCAN does not know “the truth” about which stores are outliers any more than you do. It finds density-based patterns in the data, but: - The clusters depend on your choice of eps and minPts — different settings will produce different cluster assignments - A store assigned to cluster 0 (noise) may simply be on the outskirts of a real cluster that DBSCAN missed with your parameter settings - A store assigned to the main cluster may still be unusual in ways DBSCAN cannot see with just two variables

The classifications are a starting point, not a verdict. Use them to flag observations for further investigation. Do not treat cluster assignment as definitive.

Choosing eps and minPts: There is no single correct answer. A practical approach: plot the k-nearest-neighbor distances (sorted) and look for an “elbow” — the point where the distance curve bends sharply. That elbow value is a good starting point for eps. For minPts, a common default is 2 × (number of dimensions in your data).

8.6 Step 4: Robust Regression — No Need to Delete Outliers

The traditional advice — delete the outliers and refit — creates its own problems: it introduces researcher degrees of freedom (which ones do you delete?), it loses information, and it’s post-hoc. A better approach is robust regression, which automatically downweights observations that don’t fit the main data pattern.

MASS::rlm() uses M-estimation: instead of minimising the sum of squared residuals (which gives disproportionate influence to large residuals), it minimises a less sensitive function that penalises large residuals less severely.

▶ OLS vs. robust regression (rlm) coefficient comparison

# ── Fit regression models ─────────────────────────────────────────────────────
m_ols_all <- lm(sales ~ advertising, data = df5)          # OLS with all data
m_robust  <- MASS::rlm(sales ~ advertising, data = df5)   # Robust on all data
# True DGP: sales = 0.4 * advertising (known from simulation — slope = 0.40)

# ── Compare coefficients against true DGP ────────────────────────────────────
coef_compare5 <- data.frame(
  Model = c("OLS — all stores (distorted by outliers)",
            "True DGP (known from simulation)",
            "Robust regression rlm — all stores"),
  `Advertising coefficient` = round(c(coef(m_ols_all)["advertising"],
                                      0.40,
                                      coef(m_robust)["advertising"]), 3)
)

kable(coef_compare5,
      col.names = c("Model", "Advertising Coefficient"),
      caption   = "OLS vs. Robust Regression: How Much Do Outliers Matter?")

OLS vs. Robust Regression: How Much Do Outliers Matter?
Model	Advertising Coefficient
OLS — all stores (distorted by outliers)	0.683
True DGP (known from simulation)	0.400
Robust regression rlm — all stores	0.602

▶ Plot: OLS vs. robust regression lines

# Generate prediction lines
x_seq <- seq(min(df5$advertising), max(df5$advertising), length.out = 300)
newdat <- data.frame(advertising = x_seq)

pred_lines <- data.frame(
  advertising = rep(x_seq, 3),
  sales       = c(predict(m_ols_all, newdata = newdat),
                  0.40 * x_seq,                          # True DGP: slope = 0.40
                  predict(m_robust,  newdata = newdat)),
  model       = rep(c("OLS — all stores (distorted)",
                      "True DGP (slope = 0.40)",
                      "Robust rlm — all stores"),
                    each = length(x_seq))
)

line_colors <- c(
  "OLS — all stores (distorted)" = "#d73027",
  "True DGP (slope = 0.40)"     = "#4575b4",
  "Robust rlm — all stores"     = "#1a9641"
)
line_types <- c(
  "OLS — all stores (distorted)" = "dotted",
  "True DGP (slope = 0.40)"     = "solid",
  "Robust rlm — all stores"     = "dashed"
)

ggplot() +
  geom_point(data = df5,
             aes(x = advertising, y = sales, colour = store_type),
             alpha = 0.45, size = 1.5) +
  geom_line(data = pred_lines,
            aes(x = advertising, y = sales, colour = model, linetype = model),
            linewidth = 1.2) +
  scale_colour_manual(
    values = c(
      "Regular"                      = "grey65",
      "Luxury flagship"              = "#e8998d",
      "Point outlier"                = "#a8c5a0",
      "OLS — all stores (distorted)" = "#d73027",
      "True DGP (slope = 0.40)"     = "#4575b4",
      "Robust rlm — all stores"     = "#1a9641"
    ),
    breaks = c("OLS — all stores (distorted)",
               "True DGP (slope = 0.40)",
               "Robust rlm — all stores"),
    name = "Regression line"
  ) +
  scale_linetype_manual(values = line_types, name = "Regression line") +
  guides(colour   = guide_legend(order = 1, override.aes = list(linewidth = 1.5)),
         linetype = guide_legend(order = 1)) +
  labs(
    x        = "Advertising Spend",
    y        = "Sales",
    title    = "OLS vs. Robust Regression: Effect of Outliers on Slope Estimates",
    subtitle = "Red dotted = OLS distorted by outliers  |  Blue solid = true DGP (slope 0.40)  |  Green dashed = robust rlm"
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position  = "bottom",
        legend.direction = "horizontal",
        legend.key.width = unit(1.5, "cm"))

The core lesson from robust regression

OLS (all stores): Pulled toward the luxury cluster and point outliers, producing a distorted slope well above the true value of 0.40. If you use this coefficient to predict sales for a regular store, you will be systematically wrong.

True DGP (slope = 0.40): The known data-generating process for regular stores — included in the plot because we simulated the data and know the ground truth. In real research you would not see this line.

Robust regression (rlm, all stores): Automatically downweights observations with large residuals (the outliers and luxury stores fit poorly under the regular-store model, so they receive low weights). The resulting coefficient is close to the true value of 0.40 — without having to decide in advance which observations to delete.

Practical advice: 1. Always plot your data and check for visual outliers before running any regression 2. Use LOF and/or DBSCAN to systematically detect outliers you might miss visually 3. For collective outliers (like luxury stores), investigate whether they represent a distinct subgroup — if so, treat them as a Part 4 problem 4. For point outliers, verify the data and report results with and without them as a robustness check 5. Use robust regression as your primary estimator when you suspect outliers but are not certain which observations to remove

8.7 Checking How Much Your Results Depend on Outliers

A simple sensitivity check: compare OLS and robust regression. If they give substantially different results, outliers are influencing your conclusions. If they agree, outliers aren’t driving your findings.

▶ Sensitivity table: OLS vs. rlm coefficients

# Extract and compare key summary stats
sens_df <- data.frame(
  Method = c("OLS (all stores)", "Robust rlm (all stores)"),
  `Intercept` = round(c(coef(m_ols_all)[1], coef(m_robust)[1]), 3),
  `Advertising slope` = round(c(coef(m_ols_all)[2], coef(m_robust)[2]), 3)
)

kable(sens_df,
      col.names = c("Method", "Intercept", "Advertising Slope"),
      caption   = "Sensitivity Check: OLS vs. Robust Regression Coefficients")

Sensitivity Check: OLS vs. Robust Regression Coefficients
Method	Intercept	Advertising Slope
OLS (all stores)	-1.272	0.683
Robust rlm (all stores)	-0.977	0.602

Reporting guidance

When you run a robust regression as a sensitivity check and it tells a different story from OLS:

Report both estimates in your paper
Investigate which observations are driving the OLS estimates (use LOF scores, DBSCAN clusters, or Cook’s distance)
Describe these observations substantively: are they a separate population? A data quality issue?
Be transparent about the sensitivity of your conclusions to these observations

When OLS and robust regression agree: briefly note the robustness check as evidence that your results are not driven by a small number of influential observations.

8.8 Other Methods for Outlier Detection and Robust Estimation

LOF + DBSCAN and robust regression are practical entry points. The broader toolkit:

Mahalanobis distance: A multivariate distance metric that accounts for the covariances among variables. Points far from the centroid in Mahalanobis distance space are multivariate outliers. Computationally simple; assumes roughly elliptical distributions.
Cook’s distance: In regression, measures how much each observation influences the fitted values. Observations with Cook’s D > 4/n are conventionally flagged as influential. Complementary to LOF (which doesn’t know about the regression).
Minimum Covariance Determinant (MCD) — robustcov / rrcov packages: A robust covariance estimator that finds the subset of observations with the smallest covariance determinant, effectively excluding outliers from the covariance estimation. Use it to get robust Mahalanobis distances.
Isolation Forest (isotree package in R): A tree-based outlier detection method. Anomalies are easier to isolate than normal points, so they end up in shorter trees. Scales well to large datasets. No distance computation needed.
One-class SVM: Learns the boundary of “normal” observations and flags points outside it. Useful when you have a large clean training set and want to detect anomalies in new data.
Quantile regression (quantreg package): Instead of modeling the mean, model multiple quantiles of Y. Outliers that pull the mean line don’t affect quantile regression as severely. Useful when you care about the full distribution, not just the central tendency.
Sandwich standard errors (sandwich package): If heteroskedasticity (the fan pattern from Part 2) is due to outliers or model misspecification, heteroskedasticity-consistent (HC) standard errors give valid inference without reweighting observations.