Placebo tests deserve a model, not just a glance

Using Bayesian multilevel models to correct bias and calibrate uncertainty
Author

Miha Gazvoda

Published

March 2, 2026

Code
import polars as pl
import numpy as np
import pymc as pm
import arviz as az
import plotnine as p9
p9.theme_set(p9.theme_bw())

rng = np.random.default_rng(42)
Z_CI90 = 1.645
METRICS = ['mae', 'bias', 'coverage']
APPROACHES = ['raw', 'posterior']

POSTERIOR_STATS_SCHEMA = pl.Struct([
    pl.Field(name, pl.Float64) for name in ["posterior_mean", "posterior_sd", "posterior_5pct", "posterior_95pct"]
])

SIMULATION_CONFIG = {
    'mu_b_values': [0, 1],
    'sigma_b_values': [0, 1],
    's': 1,
    'theta': 2,
    'n_tests': 20,
    'n_replicates': 100,
}


def fit_bias_adjusted_model(y_obs, s, intervention):
    n = len(y_obs)
    with pm.Model():
        mu_b = pm.Normal("mu_b", 0, 2)
        sigma_b = pm.Exponential('sigma_b', 0.5)
        theta = pm.Normal('theta', 0, 3)

        z = pm.Normal('z', mu=0, sigma=1, shape=n) 
        b = pm.Deterministic('b', mu_b + sigma_b * z)
        y = pm.Normal("y", mu=b + theta * intervention, sigma=s, observed=y_obs)
    
        return pm.sample(target_accept=0.99, progressbar=False, quiet=True)


def extract_theta_posterior_stats(idata):
    summary = az.summary(idata, var_names=['theta'], hdi_prob=0.9, kind='stats')
    summary = summary.rename(columns={
        'mean': 'posterior_mean', 
        'sd': 'posterior_sd', 
        'hdi_5%': 'posterior_5pct', 
        'hdi_95%': 'posterior_95pct'
    })
    return summary.iloc[0].to_dict()


def fit_model_from_row(row):
    return fit_bias_adjusted_model(
        y_obs=np.array(row['y']),
        s=np.array(row['s']),
        intervention=np.array(row['intervention'], dtype='int')
    )


def add_raw_estimate_columns(df):
    return df.with_columns(
        raw_mean=pl.col('y'),
        raw_sd=pl.col('s'),
        raw_5pct=pl.col('y') - Z_CI90 * pl.col('s'),
        raw_95pct=pl.col('y') + Z_CI90 * pl.col('s')
    )


def get_metric_expressions(approach: str):
    mean_col = f'{approach}_mean'
    pct_5_col = f'{approach}_5pct'
    pct_95_col = f'{approach}_95pct'
    
    return [
        (pl.col(mean_col) - pl.col('theta')).abs().alias(f'mae_{approach}'),
        (pl.col(mean_col) - pl.col('theta')).alias(f'bias_{approach}'),
        pl.col('theta').is_between(pl.col(pct_5_col), pl.col(pct_95_col)).alias(f'coverage_{approach}')
    ]


def format_facet_label(col_name: str):
    return pl.concat_str(
        [pl.lit(f'{col_name}:'), pl.col(col_name).cast(pl.Utf8)], 
        separator=" "
    ).alias(f"{col_name}_str")


def create_metric_plot(df, metric):
    df_metric = df.filter(pl.col('metric') == metric)
    return (
        p9.ggplot(df_metric, p9.aes(
            'approach', 'mean', 
            ymin=f'mean-{Z_CI90}*se', 
            ymax=f'mean+{Z_CI90}*se', 
            color='approach', 
            fill='approach'
        ))
        + p9.geom_point(alpha=0.5)
        + p9.geom_errorbar(position=p9.position_dodge(width=0.1))
        + p9.facet_grid('sigma_b_str', 'mu_b_str')
        + p9.labs(title=metric)
    )


# Create parameter grid
df_setup = (
    pl.DataFrame({
        'mu_b': [SIMULATION_CONFIG['mu_b_values']],
        'sigma_b': [SIMULATION_CONFIG['sigma_b_values']],
        's': [SIMULATION_CONFIG['s']],
        'theta': [SIMULATION_CONFIG['theta']],
        'n_tests': [SIMULATION_CONFIG['n_tests']],
        'n_replicates': [SIMULATION_CONFIG['n_replicates']]
    })
    .explode('mu_b')
    .explode('sigma_b')
    .with_row_index(name="setup_id")
)

# Expand to individual tests across replicates
df_tests = (
    df_setup
    .with_columns(
        pl.int_ranges(end=pl.col('n_replicates')).alias('replicate_id'),
        pl.int_ranges(end=pl.col('n_tests')).alias('test_id')
    )
    .explode('replicate_id')
    .explode('test_id')
)

# Simulate bias and observed values
df_simulated = (
    df_tests
    .with_columns(intervention=pl.col('test_id') == 0)
    .pipe(lambda d: d.with_columns(b=rng.normal(d['mu_b'], d['sigma_b'])))
    .pipe(lambda d: d.with_columns(y=rng.normal(d['b'] + d['intervention'] * d['theta'], d['s'])))
)

# Fit models per replicate
df_fitted = (
    df_simulated
    .group_by(['setup_id', 'replicate_id'])
    .agg(pl.col('y'), pl.col('intervention'), pl.col('b'), pl.col('s'))
    .with_columns(
        idata=pl.struct(['y', 'intervention', 's']).map_elements(fit_model_from_row, return_dtype=pl.Object)
    )
)
df_with_posteriors = (
    df_fitted
    .with_columns(
        pl.col('idata').map_elements(extract_theta_posterior_stats, return_dtype=POSTERIOR_STATS_SCHEMA).alias('theta_stats')
    )
    .unnest('theta_stats')
    .drop('idata')
)

# Extract treatment-only results
df_treatment_only = (
    df_with_posteriors
    .explode(['y', 'intervention', 'b'])
    .filter(pl.col('intervention'))
)

# Compute all metrics
df_evaluated = (
    df_setup
    .join(df_treatment_only, on=['setup_id'], how='left')
    .pipe(add_raw_estimate_columns)
    .with_columns(*[col for a in APPROACHES for col in get_metric_expressions(a)])
)

df_metrics_summary = (
    df_evaluated
    .unpivot(
        index=['mu_b', 'sigma_b'],
        on=[f"{m}_{a}" for m in METRICS for a in APPROACHES],
        variable_name='methodology',
        value_name='value'
    )
    .with_columns(
        metric=pl.col("methodology").str.split("_").list.get(0),
        approach=pl.col("methodology").str.split("_").list.get(1),
    )
    .group_by(['mu_b', 'sigma_b', 'metric', 'approach'])
    .agg(
        mean=pl.col('value').mean(),
        se=pl.col('value').std() / pl.len().sqrt(),
        n=pl.len()
    )
    .with_columns(format_facet_label('mu_b'), format_facet_label('sigma_b'))
)

# Generate plots
plots = {metric: create_metric_plot(df_metrics_summary, metric) for metric in METRICS}

Placebo tests are a common robustness check across causal inference: event studies, difference-in-differences, regression discontinuity, and more. You apply your method in a setting where the true effect should be zero and check whether you find one. If you detect an effect where none should exist, something is wrong with your assumptions (or you got unlucky). But in practice, these tests are usually interpreted as a pass/fail diagnostic — if nothing looks too large, move on.

It turns out that placebo tests contain distributional information that can directly improve your treatment estimate. By modeling the distribution of placebo effects, you can correct for systematic bias, properly calibrate your uncertainty, and get honest coverage. We follow the methodology proposed by (Gelman and Vákár 2021), who applied this in a setting with repeated sham and actual treatments.

In many setups you can run a handful of placebo versions of the same analysis alongside the real treatment. Each run gives a point estimate \(y_i\) and standard error \(s_i\). A typical result looks like the figure below. In this post, we will show how can we use information from placebo tests to improve treatment effect estimate.

Code
df_example = (
    df_simulated
    .filter((pl.col('setup_id') == 3) & (pl.col('replicate_id') == 0))
    .pipe(add_raw_estimate_columns)
    .with_columns(
        label=pl.when(pl.col('intervention'))
            .then(pl.lit('treatment'))
            .otherwise(pl.lit('placebo')),
    )
    .sort('intervention', descending=True)
    .with_row_index('order')
)

(
    p9.ggplot(df_example, p9.aes(x='order', y='y', color='label'))
    + p9.geom_point()
    + p9.geom_errorbar(p9.aes(ymin='raw_5pct', ymax='raw_95pct'), width=0.3)
    + p9.geom_hline(yintercept=0, linetype='dashed', alpha=0.5)
    + p9.labs(x='Test', y='Estimate', color='')
    + p9.theme(axis_text_x=p9.element_blank(), axis_ticks_major_x=p9.element_blank())
)

Modeling

Each placebo test returns an estimate \(y_i\) that should be zero. Any nonzero result reflects a combination of sampling noise and bias (from misspecification, residual confounding, etc.). Repeating placebo tests lets us learn the bias distribution.

Suppose that we conduct \(i=1,\dots,I\) tests, each yielding estimate \(y_i\) and standard error \(s_i\). We consider the following model:

\[\begin{align*} b_i &\sim \text{Normal}(\mu^b,\, \sigma^b) \\ y_i &\sim \text{Normal}(\theta I_i + b_i,\, s_i) \end{align*}\]

Here,

  • \(\theta\) is the treatment effect of interest (for simplicity we assume a single treated test; multiple \(\theta\)’s can be modeled hierarchically; see (Gelman and Vákár 2021)).
  • \(b_i\) is the test-specific bias (present in both placebo and treatment tests).
  • \(I_i\) is an indicator variable: 1 for the real treatment, 0 for placebo.

We assume that \(b_i\) comes from a normal distribution with mean \(\mu^b\) and standard deviation \(\sigma^b\). If placebo effects are consistent with pure sampling noise, the posterior will concentrate near \(\mu^b \approx 0\) and \(\sigma^b \approx 0\).

We assume that the estimates and standard errors are given, with sample sizes large enough to justify normal likelihoods with known variances. If necessary, one could also model the raw data directly, or replace the normal likelihood with a \(t\) distribution to model outlier placebo effects.

We fit the model in PyMC using Bayesian inference, which naturally accounts for uncertainty in the hyperparameters and makes extensions straightforward. We use weakly informative priors, though informative priors could be used when such information is available.

Why the adjustment works

We can gain intuition by deriving the posterior analytically. For the treatment observation, we can integrate out \(b_i\) — averaging over all possible bias values — to get:

\[y \sim \text{Normal}(\theta + \mu^b,\; \sqrt{s^2 + (\sigma^b)^2})\]

So treating \(\mu^b\) and \(\sigma^b\) as known (in the full model they are estimated from the placebo tests), with a flat prior on \(\theta\), the posterior is:

\[\theta \mid y \sim \text{Normal}\!\Big(y - \mu^b,\; \sqrt{s^2 + (\sigma^b)^2}\Big)\]

The intuition: the bias-adjusted estimate shifts the observed effect by the estimated mean bias \(\mu^b\), and the uncertainty is inflated by \(\sigma^b\) to account for the variation in biases across tests. When \(\mu^b = 0\) and \(\sigma^b = 0\) — meaning placebo tests reveal no bias — we recover the raw estimate exactly.

This gives us two distinct channels through which placebo tests improve inference:

  1. Bias correction: If \(\mu^b \neq 0\), the raw estimate is systematically shifted. The adjustment subtracts this out.
  2. Uncertainty calibration: If \(\sigma^b > 0\), the raw confidence interval is too narrow because it ignores the additional variability from bias. The adjustment widens it.

The raw estimate is only correct if there’s no bias (\(\mu^b = 0\), \(\sigma^b = 0\)). If that’s not true, you either have biased estimate, underestimate its uncertainty, or both.

Simulation

To verify these claims, we run a simulation study. We generate data under four scenarios by crossing \(\mu^b \in \{0, 1\}\) and \(\sigma^b \in \{0, 1\}\), with true treatment effect \(\theta = 2\), standard error \(s = 1\), and 20 tests per replicate (1 treatment + 19 placebo). For each scenario, we run 100 replicates and compare the raw treatment estimate against the posterior from the bias-adjusted model. The priors are \(\mu^b \sim \text{Normal}(0, 2)\), \(\sigma^b \sim \text{Exponential}(0.5)\), and \(\theta \sim \text{Normal}(0, 3)\). We compare on three metrics:

  • MAE (mean absolute error): How far is the point estimate from the truth?
  • Bias: Is the estimate systematically too high or too low?
  • Coverage: How often does the 90% interval contain the true \(\theta\)?

Results

The plots below show the mean and 90% confidence interval of each metric across replicates, faceted by \(\mu^b\) (columns) and \(\sigma^b\) (rows).

Code
plots['mae']

When there is no bias (\(\mu^b = 0\), \(\sigma^b = 0\)), both approaches perform similarly — as expected, since there is nothing to correct. But as soon as bias is present, the posterior estimate dominates. With systematic bias (\(\mu^b = 1\)), the raw estimate is shifted away from the truth, while the model corrects for it. With variable bias (\(\sigma^b = 1\)), the raw MAE increases due to the additional noise that the raw approach ignores.

Code
plots['bias']

When \(\mu^b = 1\), the raw estimate is biased upward by approximately 1 unit, while the posterior successfully removes this shift. When \(\mu^b = 0\), both approaches are approximately unbiased regardless of \(\sigma^b\). Note that the posterior shows a small negative bias across all scenarios: because the prior on \(\theta\) is centered at 0 while the true effect is 2, the prior pulls the estimate slightly downward.

Code
plots['coverage']

The raw 90% interval achieves nominal coverage only in the no-bias scenario. With \(\sigma^b = 1\), coverage drops because the raw interval is too narrow — it does not account for the additional uncertainty from variable bias. With \(\mu^b = 1\), coverage drops further because the interval is also centered in the wrong place. The posterior maintains close to nominal coverage across all scenarios because it estimates both \(\mu^b\) and \(\sigma^b\) from the placebo tests and incorporates them into the interval. It shows slight overcoverage due to the wide priors, which add extra uncertainty.

Back to the example

To make this concrete, let’s return to the opening example (true \(\theta\) = 2):

Approach Estimate 90% CI
Raw 3.0 [1.4, 4.7]
Posterior (bias-adjusted) 1.6 [-0.3, 3.8]

The true bias parameters in this example are \(\mu^b = 1\) and \(\sigma^b = 1\). The model estimates and subtracts this systematic bias, pulling the estimate much closer to the true effect. The remaining slight downward bias comes from the prior on $ heta$, which is centered at 0.The model estimated a bias mean \(\hat{\mu}^b\) = 1.0 (90% CI: [0.5, 1.5]) and bias SD \(\hat{\sigma}^b\) = 0.8 (90% CI: [0.2, 1.4]).

The raw estimate absorbed the systematic bias visible in the placebo tests, while the model estimated it and subtracted it out — pulling the estimate much closer to the truth.

Discussion

The key takeaway is that placebo tests contain far more information than a binary “pass/fail” eyeballing check. By modeling the distribution of placebo effects, we can:

  • Correct for systematic bias in the treatment estimate.
  • Properly calibrate uncertainty to account for sources of variation that the raw analysis ignores, yielding honest coverage of confidence intervals.
  • Use prior information about the treatment effect or bias.

The approach is straightforward to implement — a standard multilevel model that can be fit in PyMC or Stan in seconds. The main assumptions are that biases are exchangeable across tests and normally distributed, both of which can be relaxed if needed.

In practice, this is most useful when you have a reasonable number of placebo tests (say, 10+) – but in theory you can also just use your prior without having any placebo tests (your prior is enough). If your placebo tests all look like pure noise, the model will learn \(\mu^b \approx 0\) and \(\sigma^b \approx 0\), and you’ll recover the raw estimate — so there’s little downside to running the adjustment.

One natural extension is to apply partial pooling across multiple treatment effects if you run similar interventions repeatedly — especially if they are comparable. This connects to the broader idea from Gelman and Vákár (2021) (or Gelman, Hill, and Yajima (2012) without taking bias into consideration) of using the hierarchical structure to improve all your estimates simultaneously.

So next time you run placebo tests, don’t eyeball them — model them.

References

Gelman, Andrew, Jennifer Hill, and Masanao Yajima. 2012. “Why We (Usually) Don’t Have to Worry about Multiple Comparisons.” Journal of Research on Educational Effectiveness 5 (2): 189–211.
Gelman, Andrew, and Matthijs Vákár. 2021. “Slamming the Sham: A Bayesian Model for Adaptive Adjustment with Noisy Control Data.” Statistics in Medicine 40 (15): 3403–24.