rm(list = ls())
library(tidyverse) # For tibble, ggplot2, etc.
library(broom) # Loaded just for the tidy() commandOmitting a Relevant Variable
Statistical Methods for Causal Inference and Prediction (TPM039A)
Introduction
This lecture note simulates the consequences of omitting a relevant variable from the regression model. We specify a data generating process (DGP) and run two types of regressions on many samples drawn from the DGP. The first regression omits a relevant variable, while the second regression includes it. We visualize and compare the sampling distributions of the slope estimators from both types of regressions.
Refer to Wooldridge (2025, Chapter 3) for an explanation of the problem of omitted variable bias.
We re-purpose the code used in an earlier lecture note.
Setup
Clean the environment, load packages, and set the seed for the random number generator:
Simulate Data
As in previous lecture notes, we write a function to generate a tibble with \(n\) data points. The function is named gen_exog().
# Function to generate sample of size n
gen_exog <- function(n = 100) {
sampl <- tibble(
z = rnorm(n), # Covariate
x = rnorm(n), # Treatment, exposure, main x-variable
v = rnorm(n), # Error term
y = 10 + 2*x + z + v # Population model for outcome variable
)
return(sampl)
}The data generating process (DGP) is as follows:
z, a covariate that influences y, is normally distributed.
x is the treatment variable that also influences y. It is normally distributed.
v is a normally distributed error term.
y is the outcome variable. The population model does not treat x in any special way – x is just one of the determinants of y. But we, as researchers, treat x as the main variable of interest. We want to estimate its effect on y.
We generate r samples of size n each. Each sample is stored as an element of a new list.
# Generate r samples of n data points, calling gen_exog()
mydata <- list() # Initialize list
r <- 1000 # Number of samples
n <- 50 # Sample size
set.seed(12345) # Set seed for reproducibility
# Loop from 1 to r, storing r samples
for(j in 1:r) {
mydata[[j]] <- gen_exog(n)
}Run Regressions
We run two types of regressions on each of the r samples. We run a simple regression of y on x only, omitting the variable z. We also run a multiple regression of y on both x and z. We store the regression results in two separate lists.
# Loops, storing simple and multiple regressions in separate lists
myregs_simpl <- list() # Initialize list
myregs_multi <- list() # Initialize list
# Loop from 1 to r, storing r simple regressions
for(j in 1:r) {
myregs_simpl[[j]] <- tidy(lm(y ~ x, data = mydata[[j]]))
}
# Loop from 1 to r, storing r multiple regressions
for(j in 1:r) {
myregs_multi[[j]] <- tidy(lm(y ~ x + z, data = mydata[[j]]))
}We show the regression results for the first sample only:
myregs_simpl[[1]]# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 10.2 0.247 41.1 4.31e-39
2 x 2.03 0.211 9.65 8.13e-13
myregs_multi[[1]]# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.95 0.169 58.8 1.14e-45
2 x 2.04 0.143 14.3 1.08e-18
3 z 1.13 0.148 7.60 1.02e- 9
For now, we do not interpret the output at all, but simply conclude that the code does what it’s supposed to be doing. Note that the tibble holding the multiple regression contains an extra row for z, storing the slope coefficient of the covariate that we omitted in the simple regression.
We process the regression output further to make it easier to analyze later.
# Convert lists to long tibbles
myregs_simpl <- bind_rows(myregs_simpl, .id = "sim")
myregs_multi <- bind_rows(myregs_multi, .id = "sim")
# Combine tibbles
myregs <- bind_rows(simpl = myregs_simpl, multi = myregs_multi, .id = "regtype")
# Make simulation number numeric
myregs$sim <- as.numeric(myregs$sim)Analyze Regression Results
We are interested in the estimates of the slope coefficient on x in both types of regressions. We create a density plot to visualize their distributions. The dashed vertical line indicates the true population parameter:
# Create density plot of the slope estimates
ggplot(filter(myregs, term == "x"), aes(x = estimate, fill = regtype)) +
geom_density(alpha = 0.5) +
labs(title = "Distribution of slope coefficient estimates",
x = "Estimated slope coefficient",
y = "Density") +
geom_vline(xintercept = 2, linetype = "dashed", color = "red") 
In case the central tendency of the estimators is not obvious, we also calculate the mean of the estimates over all r samples for both types of regressions:
# Calculate mean of estimates
myregs |>
filter(term == "x") |>
group_by(regtype) |>
summarize(mean_estimate = mean(estimate))# A tibble: 2 × 2
regtype mean_estimate
<chr> <dbl>
1 multi 1.99
2 simpl 1.99
In either regression, the mean of the estimates is very close to the true population value of 2.
z is a relevant variable that influences y. However, omitting z does not bias the estimator of the effect of x on y. Both types of regression yield correct estimates of the effect of x on y on average.
Why does simple regression work? Because x and z are uncorrelated. The omitted variable z does not confound the effect of x on y. The true model, y = 10 + 2*x + z + v, can be written as y = 10 + 2*x + u, where u = z + v is a new error term. Since x and z are uncorrelated, x and u are also uncorrelated. Thus, the zero-conditional-mean assumption is satisfied, and the simple OLS estimator is unbiased.
In the above DGP, x was an exogenous variable. We now simulate a different DGP where x is endogenous.
A Different DGP with Endogenous x
We now change the DGP such that x and z are correlated. We do this by generating x as a function of z plus a normally distributed error term.
# New function to generate sample of size n
gen_endo <- function(n = 100) {
sampl <- tibble(
z = rnorm(n), # Covariate
x = z + rnorm(n), # Only this part of the DGP has changed
v = rnorm(n), # Error term
y = 10 + 2*x + z + v # Population model for outcome variable
)
return(sampl)
}Compare this DGP to the previous one. The only difference is that x is now generated as z + rnorm(), i.e., x depends on z.
We create new samples using the new DGP:
# Generate r samples of n data points, this time calling gen_endo()
mydata <- list() # Initialize list
# Values for r and n remain the same, as above
# Loop from 1 to r, storing r samples
for(j in 1:r) {
mydata[[j]] <- gen_endo(n)
}The rest of the code remains unchanged. We run the same two types of regressions on each of the r samples. We again visualize the sampling distributions of the slope estimators in both types of regressions. And we calculate the mean of the estimates over all r samples.

# A tibble: 2 × 2
regtype mean_estimate
<chr> <dbl>
1 multi 2.00
2 simpl 2.50
The results are very different this time. The mean of the estimates from the simple regression does not equal the true population value of 2. The simple regression estimator is biased. The mean of the estimates from the multiple regression is still very close to the true population value of 2. The multiple regression estimator is still unbiased.
Why does simple regression not work this time? Because x and z are correlated. The omitted variable z confounds the effect of x on y. The true model is y = 10 + 2*x + z + v. Omitting z means that z is part of the error term. In the model y = 10 + 2*x + u, the error term u is correlated with x. Thus, the zero-conditional-mean assumption is violated, and the simple OLS estimator is biased. In contrast, the multiple regression includes z as a control variable. When z is included, x and the error term are not correlated, the zero-conditional-mean assumption is satisfied, and the OLS estimator is unbiased.