A/B test sample size calculator· Pre-test planning

Plan an A/B test that can actually finish.

Calculate how many visitors per variant your A/B test needs before you flip the switch. Power, significance, and the minimum detectable effect are treated as first-class inputs so the result reflects what you actually committed to detecting. Two-proportion z-test math, Bonferroni for multi-variant, duration projection from your daily traffic.

Switch between relative MDE (detect a 10% lift) and absolute MDE (detect a 0.5 percentage-point lift). Read sample per variant, total sample across arms, and projected days to reach the stopping rule. Built so that the test you launch is the test you can actually finish.

Two-proportion z-testBonferroni multi-variantRelative + absolute MDEDuration projection

Inputs

Baseline conversion rate

Current control conversion rate. Pull from the last 30 days of production data, not a guess.

Minimum detectable effect

Detect a 10% relative lift. Baseline 5% becomes 5.50%.

Statistical power

Default 80%. Probability of detecting a true effect.

Significance (α)

Default 5%. False-positive tolerance.

Test tail

Variants (incl. control)

Daily visitors per variant (optional)

Set to 0 to skip duration projection. Use total visitors split by variant.

Methodology + last reviewed

Formula: two-proportion z-test, large-sample normal approximation. z-criticals derived from Acklam (2003) inverse-normal CDF. Bonferroni applied automatically for variantCount > 2. Last reviewed 2026-05-11.

Results

Slow / long

Sample size, locked.

n per variant = (z_alpha + z_beta)^2 × (p₁(1-p₁) + p₂(1-p₂)) ÷ (p₂ - p₁)². Power answers "if there is a real lift this big, the test will detect it X% of the time." Sample size answers "what visitor count makes that promise true?"

Sample per variant

31,231

rounded up

Total sample

62,462

across 2 variants

Duration

63 days

at 500/day per variant

Baseline rate

5.00%

p₁ (control)

Treatment rate

5.50%

p₂ at MDE (relative 10.00%)

Effective α

5.00%

per-test

Test setup detail

Critical values + MDE sensitivity

z_α (critical)

1.960

z_β (critical)

0.842

Test tail

Two-sided

Comparisons

MDE sensitivity at this baseline (5.00%)

Sample / variant	Detectable absolute lift	Detectable relative lift
1,000	+2.731pp	+54.61%
2,500	+1.727pp	+34.54%
5,000	+1.221pp	+24.42%
10,000	+0.864pp	+17.27%
25,000	+0.546pp	+10.92%
50,000	+0.386pp	+7.72%
100,000	+0.273pp	+5.46%
250,000	+0.173pp	+3.45%
500,000	+0.122pp	+2.44%

Reading the table: pick a sample-per-variant your traffic budget allows, then read across to see the smallest lift you could realistically detect at the chosen power and significance. Smaller MDE = quadratically more samples.

Planning workflow

Plan a real test in five steps.

Run this workflow before every A/B test goes live. Free, instant, browser-only. No login, no analytics seat, no signup. Sequence applies whether you are testing a landing page, an onboarding flow, or a pricing change.

Pull your baseline rate from real data

Open analytics and grab the last 30 days of control-arm conversions for the surface you will test on. A wrong baseline by even a few percentage points throws the sample size off by 2-3x because p(1-p) variance scales nonlinearly.

Pick a realistic MDE

Choose the smallest lift you actually need to detect to call this test a success. Indie SaaS landing tests rarely justify a sub-5% relative MDE. If you set the bar too tight, sample size explodes; too loose and your test will not move the needle anyway.

Set power and significance

Default to 80% power and 5% alpha for routine tests. Bump power to 90% or alpha to 1% when the decision is genuinely high stakes (pricing, branding, structural). Tightening both inflates the sample requirement by roughly 3x.

Add variants and traffic

Set total variants including control. Multi-variant tests trigger automatic Bonferroni correction so per-comparison alpha stays honest. Add daily visitors per variant to project test duration in days, not just total sample.

Commit to the sample before launch

Write the sample size into the test ticket as a hard stopping rule. Do not peek before reaching it; do not extend past it. The whole point of a sample-size calc is to prevent two failure modes: stopping early on noise, or fishing for significance after the planned end.

Math behind the plan

Six formulas behind the sample-size number.

Two-proportion z-test math, broken into six pieces. Once you can run them in your head, you can plan tests without a stats consult. Each piece is shown with the formula, an example calculation, and the planning intuition behind it.

Per-variant sample size

= n = (z_α + z_β)² × (p₁(1-p₁) + p₂(1-p₂)) ÷ (p₂ - p₁)²

The two-proportion z-test sample-size formula. z_α uses α/2 for two-sided tests and α for one-sided. The denominator squared is why halving MDE quadruples sample. Variance term p(1-p) peaks at p=50%, so tests near 50% baselines need fewer samples per percentage-point of effect than tests near 1% or 99%.

(1.96 + 0.842)² × (0.05·0.95 + 0.055·0.945) ÷ (0.005)² = 30,560

z_α critical value

= z_α = Φ⁻¹(1 − α / k)

Inverse standard-normal CDF at the alpha-side. k = 2 for two-sided tests, k = 1 for one-sided. The 5% two-sided value (1.96) is the most-cited number in A/B testing. One-sided 5% drops it to 1.645 and shaves ~20% off the sample requirement.

two-sided α = 5%: z_α = Φ⁻¹(0.975) = 1.960

z_β critical value (for power)

= z_β = Φ⁻¹(power)

Inverse standard-normal CDF at the power. 80% power gives z_β = 0.842. Going to 95% power gives z_β = 1.645, which roughly doubles the sample. Power is the probability the test correctly calls a real lift significant. Higher power means fewer false negatives at the cost of more visitors.

80% power: z_β = Φ⁻¹(0.80) = 0.842

Treatment rate (p₂)

= p₂ = p₁ × (1 + MDE_rel) OR p₂ = p₁ + MDE_abs

Relative MDE: 10% on a 5% baseline gives p₂ = 5.5%. Absolute MDE: 0.5pp on a 5% baseline gives p₂ = 5.5%. Both produce the same sample for the same target rate. Choose the framing that matches how your team thinks about wins.

p₁ = 5%, MDE = 10% relative: p₂ = 5% × 1.10 = 5.5%

Bonferroni correction (multi-variant)

= α_per_comparison = α / (k − 1)

When testing k variants (including control), you make k − 1 comparisons. Naive 5% alpha across three challengers gives ~14% chance of a false-positive winner. Bonferroni divides alpha across the comparisons to hold the family-wise error at the user-chosen rate. The trade-off is more samples per variant.

4 variants (3 comparisons), α = 5%: per-comparison α = 1.67%

Test duration

= days = ⌈n / daily_visitors_per_variant⌉

Once you know per-variant sample and daily traffic per variant, duration is straightforward division. Round up. Add at least one full business cycle (typically a week) to absorb day-of-week effects, even if the math says you would finish sooner.

n = 30,560 per variant, 2,000/day per variant: 16 days

MDE framing

Relative vs absolute MDE, side by side.

"Detect a 10% lift" (relative) and "detect a 0.5pp lift" (absolute) describe the same test when the baseline is fixed. They diverge once baselines shift across tests. The framing you pick depends on whether your stakeholders think in percent or percentage points, but the sample requirement is identical.

Line item	Relative MDE framing	Absolute MDE framing	Why it matters
Baseline rate	5% (typical SaaS signup)	5% baseline	Same baseline across both framings.
MDE framing	+10% relative lift	+0.5pp absolute lift	Both target the same p₂ = 5.5%, so sample size is identical.
Target treatment rate (p₂)	5.5%	5.5%	Identical. The framing is descriptive, not mathematical.
Sample per variant (80% power, 5% α two-sided)	~30,560	~30,560	Same answer, two ways to ask the question.
Same target on a 2% baseline	+10% relative = 2.2%	+0.5pp absolute = 2.5%	Relative and absolute diverge when the baseline shifts. Relative scales with baseline; absolute does not.
When to prefer relative MDE	Comparing tests across different surfaces or baselines.	n/a	PMs and CRO teams usually think in relative terms.
When to prefer absolute MDE	n/a	Fixed business targets (e.g. "lift checkout from 3% to 4%").	Finance and revenue planning usually think in pp.
Power impact (80% → 95%)	+106% sample (~63,000 / variant)	+106% sample (~63,000 / variant)	Same effect both framings. Higher power = more samples.
Alpha impact (5% → 1%)	+50% sample (~46,000 / variant)	+50% sample (~46,000 / variant)	Tighter alpha = lower false-positive rate = more samples.
Halving MDE (10% → 5% relative)	~4x sample (~122,000 / variant)	~4x sample (~122,000 / variant)	The MDE-squared rule. Single biggest lever on sample requirement.

Quick read

Use relative MDE for routine CRO work. Use absolute MDE only when the business has set a fixed percentage-point target. Toggle this calculator into the framing your test ticket uses so the sample number lines up with how your team will report the result.

Planning scenarios

Six planning scenarios, run end-to-end.

High-traffic and low-traffic, single-variant and multi-variant, routine and high-stakes. Plug each set of inputs into the calculator above and watch how each lever changes the feasibility verdict.

High-traffic SaaS signup test

Optimizing the home-page signup conversion. Baseline conversion 5%, you want to detect a 10% relative lift (going from 5% to 5.5%). Standard 80% power, 5% alpha, two-sided. 2,000 visitors per variant per day.

Baseline 5% · MDE 10% relative · Power 80% · Alpha 5% · Two-sided · 2 variants · 2,000/day per variant

> Sample size: ~30,560 per variant · Total: 61,120 · Duration: ~16 days. Doable if traffic is steady. Bonferroni N/A (single comparison).

The textbook A/B test. 10% relative MDE on a healthy traffic surface finishes inside 3 weeks. Run this template every time you have a meaningful funnel and clear hypothesis.

Low-traffic indie pricing test

Indie SaaS testing a new pricing page. Baseline checkout-completion 3%. Hoping for a 20% relative lift. 200 visitors per variant per day. 80% power, 5% alpha.

Baseline 3% · MDE 20% relative · Power 80% · Alpha 5% · Two-sided · 2 variants · 200/day per variant

> Sample size: ~24,930 per variant · Duration: ~125 days. Too long. The MDE is too small for the traffic budget.

Low-traffic founders should not target sub-20% relative lifts on 3% baselines. Either redesign for a 40-50% relative lift (bigger creative changes), or skip A/B testing and rely on qualitative validation.

Four-variant homepage test

Testing three new homepage variants against the control. Same baseline (5%) and MDE (10% relative) as scenario 1. But now you have 3 comparisons against control, so Bonferroni applies.

Baseline 5% · MDE 10% relative · Power 80% · Alpha 5% · Two-sided · 4 variants · 2,000/day per variant

> Per-comparison alpha drops from 5% to ~1.67% (Bonferroni). Sample size: ~46,840 per variant · Total: ~187,360 · Duration: ~24 days.

Adding variants is not free. Each extra variant inflates per-comparison alpha protection, which raises sample size and total test duration. Four-variant tests are rarely worth it unless you genuinely believe all three challengers could win.

High-stakes pricing redesign

A complete pricing-page redesign. The decision is critical (you are committing to new pricing tiers in production). Tighter alpha (1%) and higher power (95%) to reduce both error types. Baseline 4%, MDE 15% relative.

Baseline 4% · MDE 15% relative · Power 95% · Alpha 1% · Two-sided · 2 variants · 1,500/day per variant

> Sample size: ~34,720 per variant · Duration: ~24 days. The tighter alpha (1%) and higher power (95%) roughly triple the sample compared to standard 5%/80%.

Tightening alpha and power is the right move for high-stakes decisions but the cost is real: 3x sample, 3x duration. Reserve for pricing, branding, and structural decisions you cannot easily undo.

Detecting a 5% relative lift on a checkout flow

Bigger product, established checkout. Baseline 8%. You want to detect a 5% relative lift (small but valuable at this volume). 80% power, 5% alpha. 10,000 visitors per variant per day.

Baseline 8% · MDE 5% relative · Power 80% · Alpha 5% · Two-sided · 2 variants · 10,000/day per variant

> Sample size: ~71,820 per variant · Duration: ~8 days. Halving MDE from 10% to 5% quadrupled the sample from ~18K to ~72K.

The squared-MDE rule in action. Small effects need disproportionately more samples. Only chase a 5% MDE if you have very high traffic or the business impact justifies the test cost.

One-sided safety check on a new feature

Launching an additive feature. You only care whether it harms conversion (the feature is shipping either way). One-sided test in the "loss" direction at 5% alpha, 80% power. Baseline 6%. Acceptable loss threshold: 10% relative drop.

Baseline 6% · MDE 10% relative · Power 80% · Alpha 5% · One-sided · 2 variants · 1,000/day per variant

> Sample size: ~22,000 per variant (about 20% smaller than two-sided). Duration: ~22 days.

One-sided tests are 20% cheaper. Use only when a result in one direction is logically inconsequential (you are shipping anyway). For routine optimization, stick to two-sided.

Pair this with

Tools that pair with sample-size math.

Sample-size is the pre-test side of A/B testing. The companion calculators below handle the post-test analysis, the metrics your tests are trying to move, and the runway that determines how long you can spend testing.

Significance

A/B Test Significance Calculator

After the test ends: two-proportion z-test p-value, confidence interval, and a verdict that does not depend on peeking. Direct companion to this sample-size calc.

Metrics

SaaS Metrics Calculator

MRR, churn, LTV, CAC, NRR, ARPU. The metrics your tests are probably trying to move. Set the right primary metric before designing the test.

CAC

CAC Calculator

Customer acquisition cost. The most common metric A/B tests try to lower (or hold flat while lifting conversion).

Churn

Churn Rate Calculator

Monthly customer churn with healthy/warning bands. Onboarding and retention A/B tests usually target this number.

Validate

SaaS Idea Validation Checklist

8-stage interactive checklist with kill criteria. Validate the idea qualitatively before deciding whether your traffic can support quantitative tests.

Burn

Burn Rate Calculator

How long your runway buys you to run tests. Cut MDE in half and you spend 4x the visitors and 4x the time. Runway sets the test budget.

Questions

Sample-size questions, answered.

Everything worth knowing about A/B test sample-size math, power, MDE, multi-variant corrections, and the difference between pre-test planning and post-test analysis.

Sample size is the number of visitors (or users, or sessions) each variant of an A/B test needs to see before the test has a high enough statistical power to reliably detect the smallest lift you care about. Run an undersized test and you have two failure modes: either it finishes "inconclusive" forever, or worse, it lands on a chance result and you ship a change that does nothing. Run an oversized test and you waste traffic, lengthen the test cycle, and slow product velocity. Calculating sample size up front means you commit to a stopping rule before peeking at results, which is the single most important habit for trustworthy testing.

For a two-proportion z-test (the standard for conversion-rate A/B tests): n per variant = (z_alpha + z_beta)² × (p₁(1-p₁) + p₂(1-p₂)) ÷ (p₂ - p₁)². Where p₁ is your baseline conversion rate, p₂ is the treatment rate after applying the minimum detectable effect, z_alpha is the critical value from the standard normal for your significance level (1.96 for a two-sided 5% alpha), and z_beta is the critical value for your statistical power (0.84 for 80% power). The square in the denominator is why halving the MDE quadruples the sample size required.

Statistical power is the probability that the test correctly detects a real lift when one exists. 80% power is the textbook industry default: if you ran the same test 100 times against a true 10% lift, 80 of those tests would correctly land on "significant" at your chosen alpha. Higher power (90 or 95%) cuts your false-negative rate but inflates sample size sharply: going from 80% to 95% roughly doubles the required visitors. Use 80% for routine tests, 90% for important launches you cannot afford to call wrong, and 95% only for very high-stakes decisions where a missed positive is more costly than the larger test.

Alpha is the false-positive tolerance: the probability the test calls a win when nothing real changed. 5% alpha is the textbook default, used in 95%+ of A/B tests. It means: if you ran the same null-effect test 100 times, you would falsely call 5 of them significant by chance. Tighter alphas (1% or 0.1%) cut false positives but inflate sample size. Looser alphas (10%) sometimes appear in exploratory tests but produce lots of fake wins. Unless you have a specific reason to deviate, use 5% alpha. The test math in this calculator treats your alpha as the per-test rate, then applies a Bonferroni correction internally when you specify more than two variants.

MDE is the smallest lift the test is powered to reliably detect. It is a planning input, not a result. Setting MDE = 10% relative says: "I want the test to find a 10% relative lift if one exists." A smaller MDE asks the test to spot a smaller signal, which requires more samples. MDE is the single biggest lever on sample size: cutting MDE in half quadruples the sample required, because the formula puts MDE squared in the denominator. Realistic MDE values for indie SaaS landing tests range from 5% to 25% relative. Anything below 5% relative on a normal conversion-rate test demands an unrealistic visitor count for most indie products.

Relative MDE is what most testing platforms and PMs talk about: "I want to detect a 10% lift", meaning a 10% relative change to the baseline (a 3% baseline becomes 3.3%). Absolute MDE is measured in percentage points: "I want to detect a 0.5pp lift", meaning a 3% baseline becomes 3.5%. Both produce the same sample size given the same target, just framed differently. Relative is more intuitive when comparing tests across different baselines. Absolute is more intuitive when you have a fixed business target (e.g. "we need to lift checkout from 3% to 4%"). This calculator supports both modes and shows the equivalent in the other framing.

A two-sided test tests for any difference (win OR loss). A one-sided test only tests for one direction (win only, or loss only). One-sided tests need ~20% fewer samples than two-sided to achieve the same power, but they cannot tell you if your variant actively hurt the metric, only whether it failed to win. Use two-sided tests for almost everything: routine A/B tests, landing-page redesigns, pricing changes, onboarding flows. Use one-sided only when a result in the wrong direction is logically impossible (e.g. a new feature that is strictly additive). The vast majority of indie A/B tests should be two-sided.

When you compare more than two variants against a control, you run multiple pairwise comparisons and the family-wise Type-I error rate inflates. A naive 5% alpha with three variants gives you roughly a 14% chance of seeing at least one false-positive winner. The standard fix is a Bonferroni correction: divide your overall alpha by the number of comparisons. This calculator applies it automatically: with a control plus three variants, your per-comparison alpha drops from 5% to ~1.67%, increasing the sample requirement substantially. For more than four variants, consider sequential testing or false-discovery-rate methods rather than Bonferroni, which becomes overly conservative.

Run the test until you reach the calculated sample per variant, and ideally for at least one full business cycle (often one week, occasionally two) to absorb day-of-week effects. Never stop early because results "look significant". That is called peeking and inflates your false-positive rate dramatically. Never extend the test past the planned sample size hoping for significance. That is called fishing and is equally bad. Run the planned sample, stop, and report. If the sample requirement gives a duration over 8 weeks, redesign the test: chase a bigger MDE, accept lower power, or test a higher-traffic surface.

Sample size is a pre-test planning calculation: given your assumptions, how many visitors do you need? Statistical significance is a post-test result: given the data you actually observed, how unlikely was this result under the null hypothesis (no real effect)? They are flip sides of the same coin. The sample-size calculator answers "should I run this test?". The significance calculator answers "did this test prove something?". Run sample-size before launch, significance after the planned sample is hit. This page handles the pre-test side; pair it with the post-test significance calculator linked below.

Low-traffic indie products run into hard constraints: under roughly 200 conversions per variant, even a 30% relative lift can take months to detect at 80% power. You have three honest choices. First, increase your MDE: only test changes that would credibly produce a 25%+ relative lift (full redesigns, headline changes, pricing). Second, accept lower power (70% or even 60%). You will miss more real wins but you will at least finish the test. Third, switch to qualitative methods entirely: user interviews, session recordings, intent surveys. Do not run a 1,000-visitor A/B test and treat the result as decisive. It is not.

The sample-size formula has MDE squared in the denominator: n ∝ 1 / (p₂ - p₁)². Halving the absolute effect quarters the denominator, which quadruples n. This is the law of large numbers in action: detecting a smaller signal against the same noise floor needs disproportionately more data. It is also the single most underappreciated fact in A/B testing. If your team is debating "should we test for a 5% lift or a 10% lift?", choosing 5% costs 4x the visitors and 4x the test duration. Pick MDE based on what is realistically detectable in your traffic budget, then design test variants that aim for at least that effect.

This calculator uses frequentist methods, specifically the two-proportion z-test, which is what platforms like Optimizely Web and Google Optimize historically used. Bayesian A/B testing has different math (posterior probabilities, credible intervals) and different stopping rules. Both are valid. Frequentist is dominant because it is older, has clean stopping rules, and is what most platforms default to. Bayesian shines when you want continuous decision-making (early stopping is honest) and when you can specify informative priors. For indie SaaS doing 1-2 tests at a time with no analytics team, frequentist is the right starting point because the stopping rules are mechanical and hard to abuse.

No. This calculator computes sample size for proportion-type metrics: conversion rate, click-through rate, signup-to-paid rate, anything where each visitor is binary (converted or not). For continuous metrics like average revenue per user, average session duration, or NPS, you need a t-test sample-size formula, which uses the standard deviation of the metric instead of p(1-p) variance. The math is different and the result will be different. If your primary KPI is revenue per visitor, test on conversion rate instead and treat revenue as a secondary read, or use a continuous-metric calculator.

Peeking means checking the test result before the planned sample is reached and deciding whether to stop or continue based on what you see. It feels harmless ("the variant is clearly winning, let me ship") but it dramatically inflates the false-positive rate. A pre-planned 5% alpha test that gets peeked at daily can have an actual false-positive rate of 15-30%. The fix is mechanical: pick a sample size, run to that sample, stop. If you must look mid-test for operational reasons (e.g. checking the test is correctly bucketing users), do not make a stop/continue decision based on the lift number. Sequential testing methods exist for legitimate early-stopping but they require their own math, not the standard z-test.

Use the most recent stable estimate of your control conversion rate from production data, not a rough guess. Pull the last 30 days of conversion data for the surface you will test on. Confirm the rate is stable (no recent product launches or marketing campaigns skewing it). Then plug that rate as baseline. If the baseline is wrong by even a few percentage points, the sample-size calculation can be off by 2-3x because p(1-p) variance scales nonlinearly. For new pages with no historical data, segment to a similar existing page or use a known industry benchmark, and revisit the sample size after the first week of traffic.

For typical conversion-rate testing at indie SaaS scale: 5,000 to 50,000 visitors per variant per test, at 80% power, 5% alpha, two-sided, baseline rates between 1-10% and MDE between 10-25% relative. Larger surfaces (e.g. checkout funnels with millions of users) routinely run tests at 100,000+ per variant for precision on small MDEs. Smaller indie surfaces (new pricing pages, niche landing pages) sometimes operate at 1,000-3,000 per variant but accept lower power or larger MDE. The right answer is whatever this calculator says given your baseline, MDE, and traffic.

Knowing your metrics
is the easy part.
Shipping is the hard part.

FoundStep is the project management tool that won't let indie devs procrastinate. Validate your idea in 7 questions. Lock your scope. Ship, or kill it.

Start free See how it works

Free trial available

Cancel anytime

No team required