Most CRO wins are statistical noise
The typical A/B test in ecommerce runs for two weeks on insufficient traffic and declares a winner. The result is indistinguishable from chance.
A CRO team reports a 12% conversion lift from a button colour change. The test ran for 11 days. Sample size was 4,200 sessions.
In most ecommerce contexts, that result is meaningless.
The common belief is that statistical significance at 95% confidence means the result is reliable. In practice, that threshold is routinely reached with insufficient sample sizes because the tools report significance as soon as the threshold is crossed — not after a pre-determined sample has been collected.
This is called optional stopping, and it inflates false positive rates from the expected 5% to somewhere between 20% and 40%, depending on how frequently the test is checked.
The mechanism is straightforward. Conversion rates in ecommerce are noisy. They vary by day of week, by traffic source mix, by promotional calendar, by weather. A two-week test captures one or two cycles of this variance. It doesn't control for any of it.
What compounds the problem is survivorship bias in reporting. Teams run 20 tests. Three show significance. Those three get reported. The 17 that didn't move anything are forgotten. The narrative becomes "our CRO programme delivers consistent lifts" when the reality is closer to random fluctuation filtered through confirmation bias.
The stores that get value from experimentation typically do two things differently. They pre-register sample sizes based on minimum detectable effect. And they measure revenue per session rather than conversion rate — because conversion rate ignores order value, which is where most of the actual variance sits.
Experimentation works. But most experimentation programmes measure their own activity, not their commercial impact.