name: ab-testing description: Run email A/B tests with statistical rigor. Use when testing subject lines, content variants, send times, CTAs, or measuring experiment significance. license: MIT

A/B Testing

Test email variations systematically to improve open rates, click rates, and conversions with statistical confidence.

When to use this skill

Setting up your first email A/B test
Open rates or click rates are flat and you want data-driven improvements
Deciding between subject line variations, send times, or content approaches
Determining if your test results are statistically significant or just noise
Planning a testing program across campaigns or sequences
Evaluating whether to use A/B testing, multivariate testing, or bandit algorithms
Measuring the true incremental lift of your email program with holdout groups

Related skills

email-copywriting - writing the actual content variations to test
template-design - HTML template variations for layout and visual tests
spam-filter-avoidance - ensure test variants don't accidentally trigger spam filters
sender-reputation - monitor whether testing impacts your sending reputation
email-sequences - testing within drip campaigns and automated sequences

What to test (in priority order)

Not all tests deliver equal value. Start with high-impact, easy-to-measure elements and work your way down.

Tier 1 - highest impact, test these first

Element	What to vary	Primary metric	Why it matters
Subject line	Length, personalization, question vs statement, emoji, urgency	Open rate	The single biggest lever. A bad subject line means nobody sees anything else.
From name	Company name vs person name vs "Person at Company"	Open rate	Recipients decide to open based on who sent it as much as the subject.
Send time	Day of week, hour of day, timezone-adjusted vs fixed	Open rate	Same email sent at 6 AM vs 10 AM can see 20-40% open rate differences.

Tier 2 - high impact, requires more setup

Element	What to vary	Primary metric	Why it matters
CTA	Button text, color, placement, number of CTAs	Click rate	"Get started" vs "Start your free trial" can shift click rates by 10-30%.
Preview text	First 40-90 characters visible in inbox	Open rate	Often overlooked - many senders leave this as the default HTML boilerplate.
Content length	Short vs long, single-topic vs multi-topic	Click rate	Depends heavily on audience and email type. No universal "right" length.

Tier 3 - incremental gains, test after you've optimized tiers 1-2

Element	What to vary	Primary metric	Why it matters
Layout	Single column vs multi-column, image placement	Click rate	Visual hierarchy affects scanning behavior.
Personalization depth	Name only vs company vs role-specific content	Click rate, conversion	Diminishing returns - basic personalization matters most.
Tone	Formal vs casual, first person vs third person	Click rate, reply rate	Audience-dependent. B2B enterprise vs startup is a different world.

Rule of thumb: If you're sending fewer than 50,000 emails per month, focus on tier 1. You probably don't have the volume to detect tier 3 differences.

Sample size and statistical significance

This is where most email A/B tests go wrong. People call winners based on gut feeling or tiny sample sizes.

Minimum sample sizes

The sample size you need depends on three things:

Baseline rate - your current open/click rate
Minimum detectable effect (MDE) - the smallest improvement worth detecting
Statistical power - the probability of detecting a real effect (standard: 80%)

Here are practical minimums per variant for a 95% confidence level and 80% power:

Baseline rate	MDE (relative)	Sample per variant	Total for 2 variants
20% open rate	20% (detect 24% vs 20%)	~3,800	~7,600
20% open rate	10% (detect 22% vs 20%)	~15,000	~30,000
5% click rate	20% (detect 6% vs 5%)	~15,000	~30,000
5% click rate	30% (detect 6.5% vs 5%)	~6,700	~13,400
2% conversion	50% (detect 3% vs 2%)	~3,800	~7,600

Translation: If your open rate is 20% and you want to detect a 20% relative improvement (4 percentage point lift to 24%), you need about 3,800 recipients in each variant - roughly 7,600 total sends.

If you can only detect a 50%+ relative change, the test is probably not worth running. You'll only catch massive differences, and you won't learn anything about incremental improvements.

The two-proportion z-test

The standard significance test for email A/B testing is the two-proportion z-test. It compares two conversion rates and tells you whether the difference is statistically significant.

p1 = control conversions / control total
p2 = variant conversions / variant total
p_pool = (control conversions + variant conversions) / (control total + variant total)

standard_error = sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/variant_total))
z = (p2 - p1) / standard_error

A z-score above 1.96 (or below -1.96) means p < 0.05 - the result is significant at 95% confidence.

What 95% confidence actually means: There is less than a 5% probability that the observed difference happened by chance. It does NOT mean there's a 95% chance the variant is better - that's a common misinterpretation.

Confidence intervals matter more than p-values

A result can be "statistically significant" but practically meaningless. Always look at the confidence interval for the difference:

CI: [+0.1%, +4.2%] - Significant, but the true lift might be as small as 0.1%. Probably not worth the effort to implement.
CI: [+2.5%, +6.8%] - Significant, and even the low end is a meaningful improvement. Ship it.
CI: [-0.3%, +3.1%] - NOT significant. The true effect could be negative. Don't call this a winner.

Test design and execution

Randomization and consistency

Good A/B tests require truly random, consistent assignment. A recipient who receives variant A should always be in variant A if they encounter the experiment again.

Hash-based deterministic assignment is the gold standard. Hash the experiment ID + recipient email to produce a stable bucket assignment:

bucket = SHA256(experimentId + ":" + contactEmail) -> normalize to [0, 1)

This approach:

Guarantees the same recipient always gets the same variant
Doesn't require storing assignments upfront (though logging them is still important)
Works across distributed systems without coordination
Supports weighted variants by dividing the [0, 1) range proportionally

Random list splits in your ESP work for one-off campaigns, but break down for sequences or journeys where the same person should consistently see the same variant.

How long to run the test

Minimum: 48 hours. Email open behavior has strong day-of-week patterns. A test that runs only during Tuesday morning will miss the Thursday openers.

Recommended: 5-7 days. This captures a full weekly cycle and accounts for people who don't check email daily.

Maximum: 14 days. Beyond two weeks, external factors (seasonality, news events, list decay) start to contaminate your results.

Rules for when to stop:

Don't peek and stop early. If you check results after 2 hours and see variant B winning by 30%, resist the urge to call it. Early results are extremely noisy. This is called the "peeking problem" and it inflates your false positive rate well above 5%.
Pre-commit to your sample size. Calculate the required sample size before starting. Run until you reach it.
Use a time-based cutoff as backup. If you haven't reached your sample size after 14 days, the test is inconclusive - not a win for whoever happens to be ahead.

Test one variable at a time

Change only one element per test. If you change the subject line AND the CTA AND the send time, and variant B wins, you have no idea which change caused the improvement. You can't apply what you learned.

Exception: multivariate testing (covered below) can test multiple variables simultaneously, but requires much larger sample sizes.

A/B testing vs multivariate testing

Factor	A/B testing	Multivariate testing
Variables tested	1	2+ simultaneously
Variants needed	2-4	Every combination (2x2=4, 2x3=6, 3x3=9...)
Sample size	Moderate (1,000+ per variant)	Large (1,000+ per combination)
What you learn	Which variant wins	Which combination wins AND which variables have the most impact
When to use	Most of the time	When you have high volume (100k+ sends) and want to understand variable interactions

When multivariate testing makes sense

Only if ALL of these are true:

You send 100,000+ emails per campaign (enough volume per combination)
You suspect variables interact (e.g., a casual subject line works better with a casual CTA)
You've already optimized individual variables through A/B tests
You can set up and track all combinations reliably

For most email programs: stick with A/B tests. Run them sequentially. Subject line test in January, CTA test in February, send time test in March. You'll learn more from three clean A/B tests than one muddy multivariate test.

Bandit algorithms vs fixed-horizon tests

Traditional A/B tests run for a fixed duration, then you pick the winner and deploy. Bandit algorithms (multi-armed bandit, Thompson sampling) dynamically shift traffic toward the better-performing variant during the test.

When to use each

Use fixed-horizon A/B tests when:

You need clean, defensible statistical results
You're optimizing a template or strategy you'll reuse for months
Learning is the priority (understanding WHY something works)

Use bandit algorithms when:

You're sending a one-time campaign and want to maximize performance of that specific send
Speed matters more than certainty
The "explore" phase (testing suboptimal variants) has a real cost (e.g., revenue-critical transactional emails)

How bandit testing works for email

Send the first 10-20% of the list split evenly across variants
After initial results come in, shift more volume to the better-performing variant
Continue adjusting allocation as more data arrives
By the end, 70-80% of the list receives the winning variant

Tradeoff: You sacrifice statistical rigor for better aggregate performance. You may not know if variant B is truly better - but more people saw the better-performing option.

Most ESPs that offer "auto-winner" selection are doing a basic version of this: send to a test portion, wait a fixed time, then send the winner to the remainder. This is better than nothing but is not a true bandit algorithm - it doesn't continuously adapt.

Holdout groups

A holdout group is a randomly selected subset of your audience that does NOT receive the email (or receives no email at all). They measure the true incremental lift of your email program.

Why holdouts matter

A/B tests tell you which variant is better. Holdouts tell you whether sending email at all is better than not sending.

Without holdouts, you can't distinguish between:

"Our welcome sequence drove 30% more activations" (real lift)
"People who were going to activate anyway also happened to receive our welcome sequence" (selection bias)

How to implement holdouts

Randomly select 5-10% of your eligible audience as the holdout group
Suppress all email to the holdout group for the test period
Compare conversion/revenue between the group that received email and the holdout
Calculate incremental lift:

lift = (treatment_conversion_rate - holdout_conversion_rate) / holdout_conversion_rate

Holdout group sizing

Audience size	Holdout %	Holdout size	Expected baseline conversion	Can detect lift of
10,000	10%	1,000	5%	~50% relative
50,000	10%	5,000	5%	~25% relative
100,000	5%	5,000	5%	~25% relative
500,000	5%	25,000	5%	~10% relative

Larger audiences can use smaller holdout percentages (5%) because the absolute holdout size is still large enough.

When to use holdouts

Quarterly: Run a 2-4 week holdout on your main email programs to measure ongoing lift
New sequences: Always run a holdout when launching a new email sequence to prove it works
High-frequency sends: If you're sending daily or near-daily, holdouts reveal fatigue effects

Warning: Holdout results often show lower incrementality than you expect. An email program showing 200% ROI based on last-click attribution might show 30% incremental lift in a holdout test. That's normal - it means your email is capturing credit for conversions that would have happened anyway, plus generating real incremental value.

Metrics to optimize for

Choose your primary metric BEFORE running the test. Optimizing for multiple metrics simultaneously leads to cherry-picking results.

Metric	When to optimize for it	Gotchas
Open rate	Subject line tests, from name tests, send time tests	Apple Mail Privacy Protection inflates opens by 30-60%. Unreliable as sole metric for Apple-heavy audiences.
Click rate	CTA tests, content tests, layout tests	More reliable than opens. Measures actual engagement.
Click-to-open rate (CTOR)	Content effectiveness independent of subject line	Combines the Apple MPP noise from opens with click data. Less useful than it was pre-2021.
Conversion rate	When you have clear downstream actions (signup, purchase)	Requires conversion tracking beyond the email. Longer attribution windows.
Revenue per email	E-commerce, when you can tie revenue to individual sends	Best metric for bottom-line impact but needs robust attribution.
Reply rate	Sales emails, cold outreach	Only relevant for emails that expect replies.
Unsubscribe rate	Safety metric - always monitor alongside your primary metric	A variant can win on clicks but lose subscribers. Check both.

The Apple Mail Privacy Protection problem

Since iOS 15 (September 2021), Apple Mail pre-fetches images and tracking pixels for all emails, generating false "opens." This affects roughly 50-60% of consumer email audiences.

Impact on A/B testing:

Open rate tests still work, but the signal is noisier
You need larger sample sizes to detect real differences
Consider using click rate as your primary metric if your audience skews Apple
Never rely solely on open rate for cold or marketing email A/B tests

Testing programs (not just tests)

One-off tests are useful. A systematic testing program compounds learning.

Building a testing roadmap

Run tests in this order for maximum learning:

Subject line framework (month 1-2) - Test 4-6 subject line approaches (question, number, personalized, curiosity, benefit, urgency). Find your top 2-3 frameworks.
Send time optimization (month 2-3) - Test 3-4 send windows. This is audience-specific - there's no universal best time.
CTA optimization (month 3-4) - Test button copy, placement, and number of CTAs.
Content structure (month 4-5) - Test email length, format (text-heavy vs image-heavy), and content hierarchy.
Personalization depth (month 5-6) - Test what level of personalization actually moves the needle.

Documenting and applying learnings

After each test, record:

What you tested and why
Sample size per variant
Duration
Results (with confidence intervals)
Whether the result was statistically significant
What you'll do differently going forward

Without documentation, you'll re-run the same tests or, worse, make changes that contradict what you've already learned.

Common mistakes

1. Calling a winner too early

The single most common mistake. After 200 sends, variant B has a 25% open rate vs variant A's 20%. "B wins!" No - with 200 sends, that 5-point difference is well within the margin of error. You need thousands of observations for open rate tests.

Fix: Calculate your required sample size before starting. Don't look at results until you've reached it.

2. Testing with too little volume

If your list is under 1,000 contacts, most A/B tests are statistically meaningless. You won't have enough data to distinguish a real effect from noise.

Fix: For small lists, skip formal A/B tests. Instead, make bigger, bolder changes between campaigns and observe trends over time. Or batch multiple campaigns together to accumulate sample size.

3. Testing too many variables at once

Changing the subject line, CTA, images, and send time simultaneously. When variant B wins, you don't know which change caused it.

Fix: One variable per test. Always.

4. Ignoring the "losing" variant's data

Variant A loses. You archive it. But variant A might have outperformed on a secondary metric (lower unsubscribes, higher reply rate) or performed better in a specific segment.

Fix: Analyze test results by segment (mobile vs desktop, new subscribers vs long-term, engagement level). A "loser" overall might be a winner for a subset.

5. Not accounting for Apple MPP in open rate tests

If 50% of your audience uses Apple Mail, your open rate data includes a large number of phantom opens. This dilutes real differences and makes tests harder to call.

Fix: Filter Apple Mail opens from your analysis if your ESP supports it, or use click rate as your primary metric.

6. Using "auto-winner" without understanding how it works

Most ESP "auto-winner" features send to a test subset (10-20%), wait a fixed time (often just 2-4 hours), and send the "winner" to the rest. Two hours is nowhere near enough time for reliable results.

Fix: If you use auto-winner, set the wait time to at least 24 hours. Better yet, set it to 48 hours. If your ESP doesn't allow a long enough wait, run the test manually.

7. Treating every campaign as a separate experiment

Testing "Sale ends today!" vs "Last chance - 24 hours left" is not a reusable learning. It's a one-off optimization.

Fix: Test frameworks and patterns, not specific copy. Test "urgency vs curiosity" as a subject line approach, then apply the winner to future campaigns with different specific copy.

8. Never running holdout tests

You're optimizing variant A vs B, but never asking "should we be sending this email at all?"

Fix: Run a holdout test on your main email programs at least once per quarter.

9. Ignoring send volume distribution across variants

If you send variant A to 1,000 people and variant B to 10,000 people, the test is not valid even if you set it up as 50/50. Technical issues (send failures, bounce spikes, ESP throttling) can create uneven distribution.

Fix: Always verify actual send counts per variant before analyzing results. If the split is more than 5% off from your target, investigate before drawing conclusions.

10. P-hacking by choosing your metric after the test

Variant B didn't win on open rate, but it won on click-to-open rate! Let's call that the winner. This is cherry-picking and dramatically inflates false positives.

Fix: Declare your primary metric before the test starts. Secondary metrics are informational, not decision-making.

Platform implementation notes

Most email service providers (ESPs) have built-in A/B testing. When evaluating tools, look for:

Deterministic assignment - Same recipient always gets same variant (hash-based, not random per-send)
Weighted variant support - Ability to split traffic unevenly (e.g., 80/20 for risky changes)
Holdout groups - Native support for suppressing a control group from all sends
Statistical significance reporting - Confidence intervals and p-values, not just "winner" badges
Configurable wait times - Auto-winner that lets you set 24-48 hour windows, not just 2-4 hours
Segment-level results - Break down results by audience segment, not just aggregate

molted.email implements deterministic hash-based variant assignment with weighted buckets, holdout group support, and two-proportion z-test significance testing with 95% confidence intervals. Experiments are tied to journey steps, so variant assignment persists across a sequence rather than randomizing per-send.

References

Evan Miller's Sample Size Calculator - The standard free tool for calculating required sample sizes
Statsig A/B Test Calculator - Sample size and significance calculator
Optimizely Sample Size Calculator - Another widely-used calculator
CXL - 12 A/B Testing Mistakes - Common pitfalls with real examples
Litmus - Email A/B Testing Guide - Email-specific testing best practices
Braze - Multi-Armed Bandit vs A/B Testing - When to use adaptive algorithms
Rejoiner - Measuring Email Lift with Holdout Tests - Holdout group methodology
Apple Mail Privacy Protection FAQ - Impact on email tracking

ナビゲーション

Skillsとは？

リンク

ab-testing