I’ve written a personal-view article, “A Reality Check on NHST in A/B Testing”, appearing in Eppo by Datadog’s Outperform magazine (print).
I highlight how NHST’s effectiveness is often explained using an overly simplistic scenario—portraying effects as either “null” or “true”. Instead, I suggest fitting a multilevel model to past experiments to estimate the distribution of unobserved true effects.
This reveals that under assumptions that likely hold in many companies, NHST with conventional error rates:
- Sacrifices business impact by excessively focusing on minimizing potential losses at the expense of potential gains.
- Misleads learnings due to low power against “true” effects (not necessarily against the minimum detectable effects!). This results in exaggerated significant effect estimates, poor coverage, and low replication rates—though the estimated direction typically remains correct.
To address these, a useful alternative I see is applying a shrinkage estimator based on past experiments (and experiment-specific details), combined with explicit cost-benefit analysis. As Andrew Gelman puts it, let’s make decisions in real-world units: dollars, customers.
Performing similar analysis with your own experiments can help you explore trade-offs in your decisions beyond Type I and II errors.