Tuesday, March 17, 2026
All IssuesResourcesAdvertiseSubscribe

Search Performance Marketing

← All Stories
CROBy the Editorial Staff|June 1, 2025

A/B Testing for Practitioners: Statistical Significance and Why Most Tests Are Meaningless

Most A/B tests businesses run are statistically underpowered and produce false positives. Here is the math you need and the process that generates real insights.

A/B testing sounds simple: show version A to half your users, version B to the other half, measure which converts better. In practice, most tests businesses run are methodologically broken. They are called too early, run on insufficient traffic, test too many variables, or measure the wrong metric. The result: decisions based on noise are treated as signals.

This is not an academic problem. A business that ships "winning" A/B test variants that are actually false positives builds a product based on flawed decisions. The compounding effect over 12 months of bad testing is a meaningfully worse product and a worse conversion rate than when they started.

Statistical Significance Is Not What You Think It Is

The 95% confidence level is the industry default, meaning you accept a 5% chance of a false positive -- concluding there is a difference when there is none. At 95% confidence, if you run 20 tests, you will get approximately one false positive by chance alone.

The problem: most growth teams run 20+ tests per quarter. They are in false positive territory constantly, and most of them do not know it.

More importantly, statistical significance does not tell you whether the result is meaningful. A 0.1% lift on a checkout conversion rate at 95% confidence is statistically significant with enough traffic. It is not practically significant. It is noise.

What you need: both statistical significance (ideally 95-99%) and practical significance (the lift is large enough to matter to your business). Define the minimum detectable effect (MDE) before you run the test, not after.

Calculating the Traffic You Actually Need

The most common testing error: running a test for a week with insufficient traffic and calling it. Here is how to calculate the required sample size:

You need to define:

  • Current baseline conversion rate (e.g., 3.2%)
  • Minimum detectable effect (the smallest improvement worth detecting, e.g., 15% relative improvement = 3.68% conversion rate)
  • Statistical significance level (95%)
  • Statistical power (80%, meaning 20% chance of missing a real effect)

With these inputs, use a sample size calculator (Evan Miller's is the standard reference). For a 3.2% baseline, 15% relative MDE, 95% significance, 80% power: you need approximately 8,700 visitors per variant. If your test page gets 200 visitors per day, you need 87 days per variant. Most businesses are not willing to run a test for three months, so they run it for two weeks and make a bad decision.

The solution is to test only on high-traffic pages where you can reach statistical power in a reasonable time, and to set a realistic MDE based on your traffic volume.

What Constitutes a Valid Test

One variable at a time: Testing headline AND button color AND page layout simultaneously is not A/B testing. It is a mess. You cannot attribute the result to a specific change. Multivariate testing (MVT) requires exponentially more traffic to reach significance.

Traffic assignment must be random and consistent: Every visitor should see the same variant on return visits. Session-level randomization (assigning variant at every session) creates contamination. Cookie-based or user ID-based assignment is correct.

No peeking: Checking results daily and stopping the test when you see a positive result inflates false positive rates dramatically. Pre-commit to the sample size and run the test until you reach it.

Segment the results: An overall 2% lift can hide a 10% lift in mobile and a 3% drop in desktop. Segment your results by device, traffic source, new vs. returning, and any other dimension that might reveal differential effects before making a global decision.

What to Test (And What to Skip)

High-impact test candidates:

  • Headlines and value propositions (direct impact on perceived value, high variance)
  • CTA button copy and positioning (high visibility, clear conversion action)
  • Pricing display (for e-commerce and SaaS)
  • Social proof elements (testimonials, review counts, logos)
  • Form length and field order (directly removes or adds friction)

Low-ROI test candidates with insufficient differentiation:

  • Button color (unless dramatically different)
  • Image choices that are visually similar
  • Footer copy
  • Minor copy tweaks on low-traffic pages

The highest-leverage tests are those that change something the visitor cares about at a critical decision point. The checkout page, pricing page, and landing page above the fold are where testing effort is worth it.

The Process That Generates Real Learning

  1. Hypothesis before test: "We believe that [change] will [outcome] because [rationale]." A test without a hypothesis is an exploration, not a test. Hypotheses force you to think about the mechanism.
  2. **Calculate required sample size before launching**. If you cannot reach it in a reasonable time, do not run the test.
  3. **Implement QA**. Before launching, verify that variant code is working correctly, conversion tracking fires on both variants, and traffic split is correct.
  4. **Set a calendar reminder for the end date**. Do not check results until you have the required traffic.
  5. **Analyze the full results**. Overall conversion rate, segmented results, secondary metrics (bounce rate, time on page, revenue per visitor).
  6. **Document and share the learning**. Win or lose, what did you learn about your users? This is the compounding value of a testing program.

The Infrastructure to Do This Right

You need an experimentation platform that handles traffic splitting, variance reduction (CUPED), and significance calculations correctly. Google Optimize is gone. The current options: Optimizely, VWO, AB Tasty, and Statsig (for engineering-heavy teams). For e-commerce, Shopify's native A/B testing or Intelligems for pricing tests.

Do not roll your own experimentation infrastructure unless you have a team that can maintain it properly. The edge cases in experimentation tooling -- carryover effects, Novelty effects, SRM (Sample Ratio Mismatch) detection -- are numerous and non-obvious.

Running a proper testing program is a competitive advantage. Most competitors are running bad tests or no tests. Build the infrastructure and the discipline to run it correctly, and you will compound your conversion rate improvements over time.

More from CRO

Landing Page CRO for Paid Search: 12 Elements That Matter and What Actually Kills Conversions

Read →

The Quarterly Edition

Get the next issue.

Quarterly signal from search and performance marketing. No filler. Unsubscribe any time.

Subscribe Free

AI Network

AISkillsAgents.com — AI marketing tools and automation systems for performance marketersClaudeAISkills.com — Using Claude for SEO research, content strategy, and search performanceAISkillsGenerator.com — AI tools and skill templates for digital marketers