Plan an A/B test that can actually finish.
Calculate how many visitors per variant your A/B test needs before you flip the switch. Power, significance, and the minimum detectable effect are treated as first-class inputs so the result reflects what you actually committed to detecting. Two-proportion z-test math, Bonferroni for multi-variant, duration projection from your daily traffic.
Switch between relative MDE (detect a 10% lift) and absolute MDE (detect a 0.5 percentage-point lift). Read sample per variant, total sample across arms, and projected days to reach the stopping rule. Built so that the test you launch is the test you can actually finish.
Methodology + last reviewed
Formula: two-proportion z-test, large-sample normal approximation. z-criticals derived from Acklam (2003) inverse-normal CDF. Bonferroni applied automatically for variantCount > 2. Last reviewed 2026-05-11.
Sample size, locked.
n per variant = (z_alpha + z_beta)^2 × (p₁(1-p₁) + p₂(1-p₂)) ÷ (p₂ - p₁)². Power answers "if there is a real lift this big, the test will detect it X% of the time." Sample size answers "what visitor count makes that promise true?"
Critical values + MDE sensitivity
| Sample / variant | Detectable absolute lift | Detectable relative lift |
|---|---|---|
| 1,000 | +2.731pp | +54.61% |
| 2,500 | +1.727pp | +34.54% |
| 5,000 | +1.221pp | +24.42% |
| 10,000 | +0.864pp | +17.27% |
| 25,000 | +0.546pp | +10.92% |
| 50,000 | +0.386pp | +7.72% |
| 100,000 | +0.273pp | +5.46% |
| 250,000 | +0.173pp | +3.45% |
| 500,000 | +0.122pp | +2.44% |
Reading the table: pick a sample-per-variant your traffic budget allows, then read across to see the smallest lift you could realistically detect at the chosen power and significance. Smaller MDE = quadratically more samples.
Plan a real test in five steps.
Run this workflow before every A/B test goes live. Free, instant, browser-only. No login, no analytics seat, no signup. Sequence applies whether you are testing a landing page, an onboarding flow, or a pricing change.
Six formulas behind the sample-size number.
Two-proportion z-test math, broken into six pieces. Once you can run them in your head, you can plan tests without a stats consult. Each piece is shown with the formula, an example calculation, and the planning intuition behind it.
Relative vs absolute MDE, side by side.
"Detect a 10% lift" (relative) and "detect a 0.5pp lift" (absolute) describe the same test when the baseline is fixed. They diverge once baselines shift across tests. The framing you pick depends on whether your stakeholders think in percent or percentage points, but the sample requirement is identical.
| Line item | Relative MDE framing | Absolute MDE framing | Why it matters |
|---|---|---|---|
| Baseline rate | 5% (typical SaaS signup) | 5% baseline | Same baseline across both framings. |
| MDE framing | +10% relative lift | +0.5pp absolute lift | Both target the same p₂ = 5.5%, so sample size is identical. |
| Target treatment rate (p₂) | 5.5% | 5.5% | Identical. The framing is descriptive, not mathematical. |
| Sample per variant (80% power, 5% α two-sided) | ~30,560 | ~30,560 | Same answer, two ways to ask the question. |
| Same target on a 2% baseline | +10% relative = 2.2% | +0.5pp absolute = 2.5% | Relative and absolute diverge when the baseline shifts. Relative scales with baseline; absolute does not. |
| When to prefer relative MDE | Comparing tests across different surfaces or baselines. | n/a | PMs and CRO teams usually think in relative terms. |
| When to prefer absolute MDE | n/a | Fixed business targets (e.g. "lift checkout from 3% to 4%"). | Finance and revenue planning usually think in pp. |
| Power impact (80% → 95%) | +106% sample (~63,000 / variant) | +106% sample (~63,000 / variant) | Same effect both framings. Higher power = more samples. |
| Alpha impact (5% → 1%) | +50% sample (~46,000 / variant) | +50% sample (~46,000 / variant) | Tighter alpha = lower false-positive rate = more samples. |
| Halving MDE (10% → 5% relative) | ~4x sample (~122,000 / variant) | ~4x sample (~122,000 / variant) | The MDE-squared rule. Single biggest lever on sample requirement. |
Use relative MDE for routine CRO work. Use absolute MDE only when the business has set a fixed percentage-point target. Toggle this calculator into the framing your test ticket uses so the sample number lines up with how your team will report the result.
Six planning scenarios, run end-to-end.
High-traffic and low-traffic, single-variant and multi-variant, routine and high-stakes. Plug each set of inputs into the calculator above and watch how each lever changes the feasibility verdict.
Optimizing the home-page signup conversion. Baseline conversion 5%, you want to detect a 10% relative lift (going from 5% to 5.5%). Standard 80% power, 5% alpha, two-sided. 2,000 visitors per variant per day.
The textbook A/B test. 10% relative MDE on a healthy traffic surface finishes inside 3 weeks. Run this template every time you have a meaningful funnel and clear hypothesis.
Indie SaaS testing a new pricing page. Baseline checkout-completion 3%. Hoping for a 20% relative lift. 200 visitors per variant per day. 80% power, 5% alpha.
Low-traffic founders should not target sub-20% relative lifts on 3% baselines. Either redesign for a 40-50% relative lift (bigger creative changes), or skip A/B testing and rely on qualitative validation.
Testing three new homepage variants against the control. Same baseline (5%) and MDE (10% relative) as scenario 1. But now you have 3 comparisons against control, so Bonferroni applies.
Adding variants is not free. Each extra variant inflates per-comparison alpha protection, which raises sample size and total test duration. Four-variant tests are rarely worth it unless you genuinely believe all three challengers could win.
A complete pricing-page redesign. The decision is critical (you are committing to new pricing tiers in production). Tighter alpha (1%) and higher power (95%) to reduce both error types. Baseline 4%, MDE 15% relative.
Tightening alpha and power is the right move for high-stakes decisions but the cost is real: 3x sample, 3x duration. Reserve for pricing, branding, and structural decisions you cannot easily undo.
Bigger product, established checkout. Baseline 8%. You want to detect a 5% relative lift (small but valuable at this volume). 80% power, 5% alpha. 10,000 visitors per variant per day.
The squared-MDE rule in action. Small effects need disproportionately more samples. Only chase a 5% MDE if you have very high traffic or the business impact justifies the test cost.
Launching an additive feature. You only care whether it harms conversion (the feature is shipping either way). One-sided test in the "loss" direction at 5% alpha, 80% power. Baseline 6%. Acceptable loss threshold: 10% relative drop.
One-sided tests are 20% cheaper. Use only when a result in one direction is logically inconsequential (you are shipping anyway). For routine optimization, stick to two-sided.
Tools that pair with sample-size math.
Sample-size is the pre-test side of A/B testing. The companion calculators below handle the post-test analysis, the metrics your tests are trying to move, and the runway that determines how long you can spend testing.
Sample-size questions, answered.
Everything worth knowing about A/B test sample-size math, power, MDE, multi-variant corrections, and the difference between pre-test planning and post-test analysis.
Sample size is the number of visitors (or users, or sessions) each variant of an A/B test needs to see before the test has a high enough statistical power to reliably detect the smallest lift you care about. Run an undersized test and you have two failure modes: either it finishes "inconclusive" forever, or worse, it lands on a chance result and you ship a change that does nothing. Run an oversized test and you waste traffic, lengthen the test cycle, and slow product velocity. Calculating sample size up front means you commit to a stopping rule before peeking at results, which is the single most important habit for trustworthy testing.
For a two-proportion z-test (the standard for conversion-rate A/B tests): n per variant = (z_alpha + z_beta)² × (p₁(1-p₁) + p₂(1-p₂)) ÷ (p₂ - p₁)². Where p₁ is your baseline conversion rate, p₂ is the treatment rate after applying the minimum detectable effect, z_alpha is the critical value from the standard normal for your significance level (1.96 for a two-sided 5% alpha), and z_beta is the critical value for your statistical power (0.84 for 80% power). The square in the denominator is why halving the MDE quadruples the sample size required.
Statistical power is the probability that the test correctly detects a real lift when one exists. 80% power is the textbook industry default: if you ran the same test 100 times against a true 10% lift, 80 of those tests would correctly land on "significant" at your chosen alpha. Higher power (90 or 95%) cuts your false-negative rate but inflates sample size sharply: going from 80% to 95% roughly doubles the required visitors. Use 80% for routine tests, 90% for important launches you cannot afford to call wrong, and 95% only for very high-stakes decisions where a missed positive is more costly than the larger test.
Alpha is the false-positive tolerance: the probability the test calls a win when nothing real changed. 5% alpha is the textbook default, used in 95%+ of A/B tests. It means: if you ran the same null-effect test 100 times, you would falsely call 5 of them significant by chance. Tighter alphas (1% or 0.1%) cut false positives but inflate sample size. Looser alphas (10%) sometimes appear in exploratory tests but produce lots of fake wins. Unless you have a specific reason to deviate, use 5% alpha. The test math in this calculator treats your alpha as the per-test rate, then applies a Bonferroni correction internally when you specify more than two variants.
MDE is the smallest lift the test is powered to reliably detect. It is a planning input, not a result. Setting MDE = 10% relative says: "I want the test to find a 10% relative lift if one exists." A smaller MDE asks the test to spot a smaller signal, which requires more samples. MDE is the single biggest lever on sample size: cutting MDE in half quadruples the sample required, because the formula puts MDE squared in the denominator. Realistic MDE values for indie SaaS landing tests range from 5% to 25% relative. Anything below 5% relative on a normal conversion-rate test demands an unrealistic visitor count for most indie products.
Relative MDE is what most testing platforms and PMs talk about: "I want to detect a 10% lift", meaning a 10% relative change to the baseline (a 3% baseline becomes 3.3%). Absolute MDE is measured in percentage points: "I want to detect a 0.5pp lift", meaning a 3% baseline becomes 3.5%. Both produce the same sample size given the same target, just framed differently. Relative is more intuitive when comparing tests across different baselines. Absolute is more intuitive when you have a fixed business target (e.g. "we need to lift checkout from 3% to 4%"). This calculator supports both modes and shows the equivalent in the other framing.
A two-sided test tests for any difference (win OR loss). A one-sided test only tests for one direction (win only, or loss only). One-sided tests need ~20% fewer samples than two-sided to achieve the same power, but they cannot tell you if your variant actively hurt the metric, only whether it failed to win. Use two-sided tests for almost everything: routine A/B tests, landing-page redesigns, pricing changes, onboarding flows. Use one-sided only when a result in the wrong direction is logically impossible (e.g. a new feature that is strictly additive). The vast majority of indie A/B tests should be two-sided.
When you compare more than two variants against a control, you run multiple pairwise comparisons and the family-wise Type-I error rate inflates. A naive 5% alpha with three variants gives you roughly a 14% chance of seeing at least one false-positive winner. The standard fix is a Bonferroni correction: divide your overall alpha by the number of comparisons. This calculator applies it automatically: with a control plus three variants, your per-comparison alpha drops from 5% to ~1.67%, increasing the sample requirement substantially. For more than four variants, consider sequential testing or false-discovery-rate methods rather than Bonferroni, which becomes overly conservative.
Run the test until you reach the calculated sample per variant, and ideally for at least one full business cycle (often one week, occasionally two) to absorb day-of-week effects. Never stop early because results "look significant". That is called peeking and inflates your false-positive rate dramatically. Never extend the test past the planned sample size hoping for significance. That is called fishing and is equally bad. Run the planned sample, stop, and report. If the sample requirement gives a duration over 8 weeks, redesign the test: chase a bigger MDE, accept lower power, or test a higher-traffic surface.
Sample size is a pre-test planning calculation: given your assumptions, how many visitors do you need? Statistical significance is a post-test result: given the data you actually observed, how unlikely was this result under the null hypothesis (no real effect)? They are flip sides of the same coin. The sample-size calculator answers "should I run this test?". The significance calculator answers "did this test prove something?". Run sample-size before launch, significance after the planned sample is hit. This page handles the pre-test side; pair it with the post-test significance calculator linked below.
Low-traffic indie products run into hard constraints: under roughly 200 conversions per variant, even a 30% relative lift can take months to detect at 80% power. You have three honest choices. First, increase your MDE: only test changes that would credibly produce a 25%+ relative lift (full redesigns, headline changes, pricing). Second, accept lower power (70% or even 60%). You will miss more real wins but you will at least finish the test. Third, switch to qualitative methods entirely: user interviews, session recordings, intent surveys. Do not run a 1,000-visitor A/B test and treat the result as decisive. It is not.
The sample-size formula has MDE squared in the denominator: n ∝ 1 / (p₂ - p₁)². Halving the absolute effect quarters the denominator, which quadruples n. This is the law of large numbers in action: detecting a smaller signal against the same noise floor needs disproportionately more data. It is also the single most underappreciated fact in A/B testing. If your team is debating "should we test for a 5% lift or a 10% lift?", choosing 5% costs 4x the visitors and 4x the test duration. Pick MDE based on what is realistically detectable in your traffic budget, then design test variants that aim for at least that effect.
This calculator uses frequentist methods, specifically the two-proportion z-test, which is what platforms like Optimizely Web and Google Optimize historically used. Bayesian A/B testing has different math (posterior probabilities, credible intervals) and different stopping rules. Both are valid. Frequentist is dominant because it is older, has clean stopping rules, and is what most platforms default to. Bayesian shines when you want continuous decision-making (early stopping is honest) and when you can specify informative priors. For indie SaaS doing 1-2 tests at a time with no analytics team, frequentist is the right starting point because the stopping rules are mechanical and hard to abuse.
No. This calculator computes sample size for proportion-type metrics: conversion rate, click-through rate, signup-to-paid rate, anything where each visitor is binary (converted or not). For continuous metrics like average revenue per user, average session duration, or NPS, you need a t-test sample-size formula, which uses the standard deviation of the metric instead of p(1-p) variance. The math is different and the result will be different. If your primary KPI is revenue per visitor, test on conversion rate instead and treat revenue as a secondary read, or use a continuous-metric calculator.
Peeking means checking the test result before the planned sample is reached and deciding whether to stop or continue based on what you see. It feels harmless ("the variant is clearly winning, let me ship") but it dramatically inflates the false-positive rate. A pre-planned 5% alpha test that gets peeked at daily can have an actual false-positive rate of 15-30%. The fix is mechanical: pick a sample size, run to that sample, stop. If you must look mid-test for operational reasons (e.g. checking the test is correctly bucketing users), do not make a stop/continue decision based on the lift number. Sequential testing methods exist for legitimate early-stopping but they require their own math, not the standard z-test.
Use the most recent stable estimate of your control conversion rate from production data, not a rough guess. Pull the last 30 days of conversion data for the surface you will test on. Confirm the rate is stable (no recent product launches or marketing campaigns skewing it). Then plug that rate as baseline. If the baseline is wrong by even a few percentage points, the sample-size calculation can be off by 2-3x because p(1-p) variance scales nonlinearly. For new pages with no historical data, segment to a similar existing page or use a known industry benchmark, and revisit the sample size after the first week of traffic.
For typical conversion-rate testing at indie SaaS scale: 5,000 to 50,000 visitors per variant per test, at 80% power, 5% alpha, two-sided, baseline rates between 1-10% and MDE between 10-25% relative. Larger surfaces (e.g. checkout funnels with millions of users) routinely run tests at 100,000+ per variant for precision on small MDEs. Smaller indie surfaces (new pricing pages, niche landing pages) sometimes operate at 1,000-3,000 per variant but accept lower power or larger MDE. The right answer is whatever this calculator says given your baseline, MDE, and traffic.
Knowing your metrics
is the easy part.
Shipping is the hard part.
FoundStep is the project management tool that won't let indie devs procrastinate. Validate your idea in 7 questions. Lock your scope. Ship, or kill it.