A/B test significance calculator· Post-test analysis

Was that A/B test actually a win?

Pooled + unpooled SEWilson per-arm CIPost-hoc power checkUp to 6 arms

After the test ends, paste in visitor and conversion counts for control and each variant. The calculator returns a verdict, a confidence interval on the lift, achieved power, and a per-arm rate range. No "approaching significance" weasel words. No peeking-inflated p-values. The same frequentist math Optimizely and Google Optimize used, with the inputs and reasoning laid bare.

Verdicts are limited to three: significant win, significant loss, or inconclusive. If the test was undersized to detect the lift you observed, the achieved-power field tells you exactly that. Built so that "should we ship?" has a one-sentence answer backed by transparent math.

Inputs

Arms (2/6)

Control

Visitors

Conversions

Rate: 5.00%

Variant

Visitors

Conversions

Rate: 5.70%

Confidence level

Default 95%. Alpha = 100 − confidence.

Test tail

Methodology

Methods: pooled standard error for p-value, unpooled standard error for the lift CI, Wilson score interval for per-arm rate CIs. P-values via Abramowitz-Stegun 26.2.17 normal CDF approximation. Verified against scipy.stats.proportions_ztest. Last reviewed 2026-05-11.

Results

Inconclusive

Verdict, in three states.

Verdict = "win" or "loss" only if p-value < adjusted alpha and the lift sign matches. Otherwise "inconclusive". Confidence interval shows the range of plausible true lifts. Achieved power tells you whether the test had enough samples to find the effect it just observed.

Confidence

95.0%

α = 5.00%

Effective α

5.00%

per-test

Test tail

Two-sided

Variant A vs ControlInconclusive

Lift (relative)

+14.00%

p-value

0.120

95% CI on lift

[-0.18pp, 1.58pp]

Achieved power

34%

Heads up

Variant A is inconclusive with only 34% achieved power. The test was too small to detect the observed effect. Run longer or accept null.

Per-arm + comparison detail

Conversion rates with Wilson CIs

Arm	Visitors	Conversions	Rate	95% Wilson CI
Control	5,000	250	5.00%	[4.43%, 5.64%]
Variant A	5,000	285	5.70%	[5.09%, 6.38%]

Comparison statistics

Variant A vs Control

Absolute lift

+0.700pp

z-score

1.555

Pooled SE

0.00450

Unpooled SE

0.00450

Verdicts walked through

Six verdicts, with the math shown.

Clear win, clear loss, inconclusive, a false alarm caught by Bonferroni, a low-power null, and a tiny-sample mirage. Each case shows the inputs, the calculator output, and the takeaway for shipping or killing the variant.

Clear win on a healthy sample

Signup conversion test. Control: 5,000 visitors, 250 conversions (5.00%). Variant: 5,000 visitors, 320 conversions (6.40%). 95% confidence, two-sided.

Control 250/5,000 (5.00%) · Variant 320/5,000 (6.40%) · Confidence 95% · Two-sided

> Relative lift: +28.0% · z-score: 3.21 · p-value: 0.0013 · 95% CI on absolute lift: +0.55% to +2.25% · Achieved power: 96%. Verdict: significant win.

Textbook clean win. Lift is well above the noise floor, CI is comfortably above zero, achieved power is high. Ship the variant. Document the inputs and the sample size so the next test can be planned accordingly.

Inconclusive with low power

Pricing-page test. Control: 1,200 visitors, 36 conversions (3.00%). Variant: 1,200 visitors, 42 conversions (3.50%). 95% confidence, two-sided.

Control 36/1,200 (3.00%) · Variant 42/1,200 (3.50%) · Confidence 95% · Two-sided

> Relative lift: +16.7% · z-score: 0.74 · p-value: 0.46 · 95% CI: -0.83% to +1.83% · Achieved power: 11%. Verdict: inconclusive.

The lift looks decent on paper but the test is undersized. Power 11% means even if the variant truly lifts by exactly what you observed, only 11 in 100 tests this size would call it. Run longer if you can; otherwise treat as null.

False alarm caught by Bonferroni

Four-variant homepage test, three variants vs control. Each ~5,000 visitors. Variant B shows a 5.6% rate vs control 5.0%. Without correction, that lift trips significance. With Bonferroni for 3 comparisons, it does not.

Control 250/5,000 · A 245/5,000 · B 280/5,000 · C 255/5,000 · 95% conf · Two-sided · Bonferroni ON

> B vs Control: p-value 0.038 raw. Bonferroni-adjusted alpha: 1.67%. Verdict: inconclusive (was significant pre-correction). Other variants well above 1.67%.

Multi-variant tests need multi-comparison correction. Without it, you ship false winners. This is exactly the situation Bonferroni exists to prevent. A raw p-value of 0.04 across multiple comparisons is not the same evidence as 0.04 from a clean A/B.

Significant loss

Onboarding flow redesign. Control: 10,000 visitors, 1,200 conversions (12%). Variant: 10,000 visitors, 1,080 conversions (10.8%). 95% confidence, two-sided.

Control 1,200/10,000 (12.00%) · Variant 1,080/10,000 (10.80%) · Confidence 95% · Two-sided

> Relative lift: -10.0% · z-score: -2.57 · p-value: 0.0101 · 95% CI on lift: -2.12% to -0.28%. Verdict: significant loss.

Test caught a real regression before it shipped. CI is fully below zero. Kill the variant. Significant losses are the most underrated outcome of a healthy testing program. They prevent ship-and-regret bugs.

Practically significant but statistically meh

Retention email test, very small sample. Control: 250 visitors, 15 conversions (6%). Variant: 250 visitors, 22 conversions (8.8%). 95% confidence, two-sided.

Control 15/250 (6.0%) · Variant 22/250 (8.8%) · Confidence 95% · Two-sided

> Relative lift: +46.7% · z-score: 1.30 · p-value: 0.19 · 95% CI: -1.40% to +7.0% · Achieved power: 25%. Verdict: inconclusive.

Massive observed lift but the sample is too small to call. The CI spans from a meaningful loss to a huge win. You cannot rule anything out. Run again at 5-10x the sample, or accept that low-traffic surfaces need bigger MDEs to be testable.

Bonferroni saved you from a false win

Five challenger variants vs control on a checkout flow. All ~8,000 visitors. Variant D shows a 4.2% rate vs control 3.5%. Raw p = 0.035 (significant at 5%). Adjusted alpha = 1.25%.

6 arms · ~8,000 each · Variant D 336/8,000 (4.2%) vs Control 280/8,000 (3.5%) · Bonferroni ON

> D vs Control: p-value 0.035 raw. Bonferroni-adjusted alpha: 1.25%. Verdict: inconclusive. No variants cleared the corrected bar.

Aggressive multi-variant testing breeds false positives unless you adjust. With 5 challengers, you have a 23% naive chance of one false win even with no true effects anywhere. Always apply Bonferroni or run a sequential-testing method for high-arm tests.

Verdict workflow

Reading the verdict, step by step.

Run this workflow once the test reaches the pre-planned sample size. The verdict line tells you whether to ship. The confidence interval and achieved power tell you how confident you should be in that call.

Wait until the planned sample is hit

Before computing significance, make sure the test ran to the pre-committed sample size. Stopping early on a "looks significant" peek inflates false-positive rates dramatically and invalidates the p-value math. If you peeked, treat results as suggestive, not decisive.

Paste visitors and conversions per arm

For each arm, the visitor count (denominator) and the number of conversions (numerator). Make sure the visitor count is unique users at the assignment level, not pageviews or sessions. Confirm both arms got roughly equal traffic before reading any further.

Read the verdict, then the CI

The verdict is win, loss, or inconclusive. Then look at the confidence interval on the lift. A CI of [0.5%, 2%] gives totally different shipping confidence than [-1%, 3.5%] even if both are "significant". The CI is the more honest signal.

Check the achieved power on inconclusive results

If inconclusive, the achieved-power number tells you whether the test was too small to detect the lift you saw. Power below 50% means: extend the test if you can. Power above 80% with an inconclusive result means: the effect, if any, is small.

Apply Bonferroni for multi-variant

For 3+ arms, leave Bonferroni on. The per-comparison alpha tightens to keep family-wise error at your chosen confidence. Disable only if you are running a single planned comparison or using a different multi-comparison framework.

Verdict pitfalls

Eight ways an A/B test verdict gets faked.

The math behind a p-value is rigorous but its interpretation is fragile. Every entry below is a real failure mode that produces fake winners or fake losers in indie SaaS testing programs.

Pitfall	Why it inflates errors	The fix
Peeking and stopping early	Stopping a test early because the lift "looks good" inflates the false-positive rate dramatically. A 5% alpha test that gets peeked daily can hit an actual false-positive rate of 15-30%.	Commit to a fixed sample size before launch. Run to it. Do not look at significance until the planned sample is reached. If you need interim checks, use sequential testing methods designed for it.
Fishing past the planned end	Extending a test past the planned sample because the result is "almost significant" is the same fallacy in the other direction. The p-value math assumes a fixed-N stopping rule.	If you reach the planned sample and the test is inconclusive, accept null and move on. Extending introduces look-elastic significance and bakes in false-positive bias.
Ignoring multi-variant correction	Comparing three challengers to control without correction triples the false-positive risk. A naive 5% alpha across three comparisons yields ~14% family-wise error.	Apply Bonferroni (this calculator does it automatically) or switch to false-discovery-rate methods for 5+ variants. Multi-variant tests are not free.
Confusing significance with importance	A statistically significant 0.3% lift on a 10,000,000-visitor test is still a tiny business effect. Significance answers "is the lift different from zero?" not "is the lift worth shipping?".	Always pair the p-value with the lift size and confidence interval. Decide shipping based on the lift magnitude and business cost, not just the p-value.
Novelty effect not accounted for	A new variant often produces a short-term lift just because it is new. Tests run for less than a full business cycle (typically a week) can mistake novelty for genuine improvement.	Run tests for at least one full business cycle. For high-engagement product changes, run for two cycles. Watch the lift over time. If it shrinks toward null late in the test, you saw novelty.
Different denominators between arms	If the bucketing is wrong (e.g. caching skew, bot traffic, mis-tagged sessions), one arm gets more or fewer visitors than the other. The math still computes a p-value but the result is meaningless.	Sanity-check sample-ratio mismatch before reading results. Each arm should be within ~5% of expected by chance. If imbalanced, investigate the bucketing before trusting any number.
Reading the wrong metric	Significance was calculated on signup but the team optimized for revenue. Two different metrics, two different sample-size requirements, two different verdicts.	Pick the primary metric before launch. Report it. Treat secondary metrics as descriptive only, not as evidence for shipping or killing the variant.
Treating "p = 0.06" as "approaching significance"	You picked alpha = 5% before the test. 0.06 is above 5%. The test is inconclusive at your committed threshold. Treating 0.06 as "almost significant" is exactly the post-hoc rationalization that inflates false-positive rates.	Be honest about the threshold you committed to. If 6% is acceptable to you, set alpha to 6% before the test, not after. Otherwise treat 0.06 as inconclusive and report it that way.

Quick read

The single most important habit for trustworthy A/B testing is committing to the sample size and stopping rule before launch. Almost every pitfall above traces back to that one discipline. Run the sample-size calculator first, then this one.

Math behind the verdict

Six lines of math behind the win-or-loss call.

Two-proportion z-test post-test math, reduced to six lines. The same formulas Optimizely, AB Tasty, and Google Optimize have shipped in their frequentist modes. Each line is paired with its example calculation so you can verify the calculator by hand.

Conversion rate

= p̂ = conversions ÷ visitors

The observed conversion rate for each arm. Sample-based estimate of the true (unobservable) underlying conversion probability. Confidence intervals on this rate use the Wilson score interval, which stays accurate near 0% and 100%.

320 conversions ÷ 5,000 visitors = 6.40%

Pooled standard error (for p-value)

= SE_pool = √(p_pool(1-p_pool)(1/n₁ + 1/n₂))

Standard error of the difference under the null hypothesis that both arms share the same underlying rate. p_pool = (x₁ + x₂) / (n₁ + n₂). Used for the z-score and p-value.

p_pool = 570 / 10,000 = 0.057. SE_pool = √(0.057·0.943·(1/5,000 + 1/5,000)) = 0.00464

z-score

= z = (p₂ - p₁) ÷ SE_pool

The observed lift expressed in pooled standard errors. A z-score of 1.96 corresponds to a two-sided p-value of 0.05, the classic significance threshold. Larger |z| = stronger evidence against the null.

(0.064 - 0.050) / 0.00464 = 3.02

p-value (two-sided)

= p = 2·(1 - Φ(|z|))

Probability of observing a lift this large (or larger, in either direction) under the null of no true effect. Compare to alpha. Below alpha = significant. Φ is the standard-normal CDF.

2·(1 - Φ(3.02)) = 0.0025

Unpooled SE + confidence interval

= CI = (p₂ - p₁) ± z_crit·√(p₁(1-p₁)/n₁ + p₂(1-p₂)/n₂)

CI on the absolute lift. Uses the unpooled standard error, which does NOT assume both arms share a rate (appropriate for estimation, unlike the pooled SE used for the null-hypothesis test). z_crit = 1.96 for 95% two-sided.

0.014 ± 1.96·0.00465 = [+0.49pp, +2.31pp]

Achieved (post-hoc) power

= power = 1 - Φ(z_α - |effect| / SE_pool)

Probability the test would have detected the lift you actually observed, given the sample you actually ran. Low power on an inconclusive result means the test was too small. High power on an inconclusive result is solid evidence the true effect is small or zero.

1 - Φ(1.96 - 3.02) = 1 - Φ(-1.06) = 0.86 → 86% power

Companion tools

The tools that surround a real verdict.

Significance is the post-test side of A/B testing. The companion calculators below cover the pre-test sample-size math, the metrics tests are usually optimizing, and the unit-economics that decide whether a marginal lift is worth shipping.

Sample size

A/B Test Sample Size Calculator

Before the test: how many visitors per variant do you need to detect the lift you care about at 80% power? Direct companion to this significance calc.

Metrics

SaaS Metrics Calculator

MRR, churn, LTV, CAC, NRR, ARPU. The downstream metrics A/B tests are usually trying to lift. Track them alongside test verdicts.

LTV:CAC

LTV:CAC Ratio Calculator

The unit-economics ratio that decides whether a conversion-rate lift is actually profitable. A 10% lift on a CAC-positive funnel is more valuable than a 30% lift on a leaky one.

Payback

CAC Payback Calculator

Months to recover acquisition cost. Pair with significance results to decide whether to ship a marginal conversion lift or hold for a bigger win.

Valuation

SaaS Valuation Calculator

Seven valuation methods, 2026 multiples. The conversion lifts that compound into MRR are the ones that move company value at a raise.

Validate

SaaS Idea Validation Checklist

8-stage interactive checklist with kill criteria. A/B testing is downstream of having an idea worth testing. Validate first.

Questions

Verdicts, p-values, and pitfalls.

Everything worth knowing about A/B test significance, p-values, confidence intervals, Bonferroni correction, achieved power, and the pitfalls that fake results.

Statistical significance is the formal answer to "could the lift I am seeing be due to chance alone?". The test produces a p-value: the probability of seeing a lift this large or larger if the variant has no real effect. If the p-value is below your significance threshold (alpha, typically 5%), you call the result significant. Significance does not mean the lift is large or meaningful. It only means the lift is statistically distinguishable from zero given the sample size. Always pair significance with the actual lift size and confidence interval to decide whether to ship.

For a two-proportion z-test (the standard A/B test): compute the conversion rate for each arm, then the pooled rate across both arms. Compute the pooled standard error = sqrt(p_pool × (1-p_pool) × (1/n1 + 1/n2)). The z-score is (variant rate - control rate) divided by the pooled SE. The p-value (two-sided) is 2 × (1 - Phi(|z|)), where Phi is the standard-normal CDF. A z-score of 1.96 corresponds to a two-sided p-value of 0.05, which is the classic significance threshold.

The confidence interval tells you the range of plausible true lifts consistent with your observed data. A 95% CI of [0.5%, 2.1%] absolute lift says: "If we repeated this experiment many times, 95% of the calculated intervals would cover the true effect." If the CI straddles zero, the test is inconclusive: the true effect could be a win, a loss, or zero. The width of the CI is the most-honest single signal of how precise your estimate is, and it gives the business actually useful information (a 5% lift with CI [0.5%, 9.5%] is fundamentally different from one with CI [4.5%, 5.5%]).

Inconclusive means the data does not support claiming a real difference between control and variant at your chosen confidence level. It does not mean "the variant did nothing", only that you cannot statistically rule out chance as the explanation. Three honest next steps: (1) accept the null and move on if the test ran to its planned sample; (2) run longer if traffic allows and the planned sample was not reached; (3) abandon the hypothesis if the achieved power is high (>80%) yet the result is null. That is solid evidence the effect, if it exists, is smaller than you targeted.

For the per-arm conversion rate CIs (e.g. control rate with its 95% range), this calculator uses the Wilson score interval rather than the simpler normal approximation. Wilson is more accurate for small samples and for rates near 0 or 1, where the normal approximation breaks down. For the lift CI (difference between arms), the calculator uses an unpooled-SE normal interval, which is the standard frequentist choice and adequate for typical A/B test sample sizes.

Trust the result when: (1) the sample size was pre-planned and not adjusted mid-test; (2) you did not peek and stop early; (3) the test ran for at least one full business cycle; (4) bucketing is sanity-checked (control and variant got roughly equal traffic); (5) no exogenous events (price changes, ads, outages) disturbed the test window. If any of these are violated, the p-value math is technically invalid even if the result looks clean. The math assumes a fixed sample-size stopping rule and no fishing.

When you compare more than one variant to control, the family-wise Type-I error inflates. A naive 5% alpha across three challengers gives ~14% chance of at least one false-positive winner. The standard fix is Bonferroni correction: divide alpha by the number of comparisons. This calculator applies it automatically when multiple variants are present. The effect is to tighten each per-comparison significance threshold and widen each confidence interval. The alternative is FDR (false discovery rate) methods for larger variant counts.

Achieved power is the probability the test would have detected the lift you actually observed, given the sample you actually ran. It is computed after the fact. Low achieved power (e.g. 30%) on an inconclusive result means: "Even if your variant truly produced the lift I just measured, the test was too small to reliably catch it." That signals you should run the test longer (if possible) or accept that you cannot resolve effects this small with the traffic available. Power above ~80% on an inconclusive result is strong evidence the true effect is genuinely small or zero.

Significance depends on both effect size and sample size. A 1% relative lift with 100 visitors per arm has nowhere near enough data to call. A 1% relative lift with 100,000 visitors per arm probably does. Two common failure modes: (1) the test was undersized for the observed effect, so you have insufficient power; (2) the effect itself is small enough to be indistinguishable from sampling noise at any reasonable sample. Look at the confidence interval: if it straddles zero, the lift might be a win, but it might also be flat or a loss.

Almost always two-sided. Two-sided tests for any difference (win OR loss). One-sided tests for one direction only. One-sided gives more power for the same sample but cannot tell you if your variant actively hurt the metric. Use one-sided only when a result in one direction is logically inconsequential. For instance, a strictly additive feature where a loss is impossible. For routine optimization, two-sided is the right default, even though it costs ~20% more samples for the same power.

Standard error (SE) is the expected variability in the lift estimate due to sampling. The pooled SE assumes both arms come from the same underlying rate (the null hypothesis) and is used to compute the p-value. The unpooled SE assumes the two arms have different rates and is used to compute the confidence interval. The z-score is the observed lift divided by the pooled SE: how many standard errors the observed lift is away from zero. A z-score above 1.96 (two-sided 5% alpha) puts you in the significant zone.

If you set alpha = 5% before the test and the p-value came back 0.06, the answer is no. The test is inconclusive at the threshold you committed to. Treating 0.06 as "close enough" is exactly the kind of post-hoc rationalization that inflates false-positive rates. That said, the business decision about whether to ship can use information beyond the test: the size of the observed lift, the CI width, the cost of being wrong, the cost of waiting, and qualitative evidence. The statistical test gives you one input. The decision is yours.

The calculator reports "loss" only when the p-value is below your chosen alpha AND the observed lift is negative. That means the variant performed worse than control at a statistically detectable level. Significant losses are useful: they kill bad ideas with evidence. Most A/B tests end inconclusive, some end with a significant win, and a smaller portion end with a significant loss. Significant losses are not failures of the testing program. They are exactly what testing exists to catch before you ship.

Yes. The form supports up to six arms total (control + 5 variants). Each non-control variant is compared to control via its own two-proportion z-test, and Bonferroni correction adjusts the per-comparison alpha to hold family-wise error at your chosen rate. The calculator does not run all pairwise comparisons between variants (only against control), which is the standard A/B/n design. Comparing variants to each other usually does not serve a clear business question.

This calculator uses frequentist methods (two-proportion z-test), the same math that Optimizely Web, AB Tasty, and most testing platforms have historically used. Bayesian A/B testing uses posterior probabilities and credible intervals instead of p-values and confidence intervals. Bayesian methods allow for honest early-stopping rules and incorporate prior information, but require explicit priors. For indie SaaS doing 1-2 tests at a time without an analytics team, frequentist methods are more mechanical and harder to abuse. Use what your platform supports and stick to the math you committed to.

Lead with the verdict, then the lift, then the CI, then the caveats. "Variant A produced a 12% relative lift in signup conversion. 95% CI: 4% to 20%. p < 0.01. Achieved power 92%. Sample reached planned size; no peeking; no exogenous events. Recommend shipping." That format is reproducible, falsifiable, and skeptical. Avoid: percentage lifts without a CI, vague "approaches significance", or any phrasing that implies the test answered a question it did not actually ask.

Peeking invalidates frequentist p-values. The math you ran assumed a fixed sample-size stopping rule, so checking results mid-test and using what you see to decide whether to stop biases the result toward false positives. Honest options: (1) run to the planned sample anyway and treat the eventual p-value as approximate; (2) use sequential testing methods (alpha-spending functions, group-sequential designs) which permit interim looks with valid p-values; (3) switch to Bayesian inference where peeking does not invalidate the math. In a pinch, treat a peeked p-value as suggestive but report the peek explicitly.

Knowing your metrics
is the easy part.
Shipping is the hard part.

FoundStep is the project management tool that won't let indie devs procrastinate. Validate your idea in 7 questions. Lock your scope. Ship, or kill it.

Start free See how it works

Free trial available

Cancel anytime

No team required