Hacker Remix

Five ways to reduce variance in A/B testing

83 points by Maro 9 months ago | 28 comments

vijayer 9 months ago

This is a good list that includes a lot of things most people miss. I would also suggest:

1. Tight targeting of your users in an AB test. This can be through proper exposure logging, or aiming at users down-funnel if you’re actually running a down-funnel experiment. If your new iOS and Android feature is going to be launched separately, then separate the experiments.

2. Making sure your experiment runs in 7-day increments. Averaging out weekly seasonality can be important in reducing variance but also ensures your results accurately predict the effect of a full rollout.

Everything mentioned in this article, including stratified sampling and CUPED are available, out-of-the-box on Statsig. Disclaimer: I’m the founder, and this response was shared by our DS Lead.

wodenokoto 9 months ago

> 2. Making sure your experiment runs in 7-day increments. Averaging out weekly seasonality can be important in reducing variance but also ensures your results accurately predict the effect of a full rollout.

There are of course many seasonalities: day/nigh, weekly, monthly, yearly seasonality, so it can be difficult to decide how broad you want to collect data. But I remember interviewing at a very large online retailer and they did their a/b tests in an hour because they "would collect enough data points to be statistical significant" and that never sat right with me.

kqr 9 months ago

> Winsorizing, ie. cutting or normalizing outliers.

Note that outliers are often your most valuable data points[1]. I'd much rather stratify than cut them out.

By cutting them out you indeed get neater data, but it no longer represents the reality you are trying to model and learn from, and you run a large risk of drawing false conclusions.

[1]: https://entropicthoughts.com/outlier-detection

chashmataklu 9 months ago

TBH depends a lot on the business you're experimenting with and who you're optimizing for. If you're Lime Bike, you don't want to skew results because of a Doordasher who's on a bike for the whole day because their car is broken.

If you're a retailer or a gaming company, you probably care about your "whales" who'd get winsorized out. Depends on whether you're trying to move topline - or trying to move the "typical".

kqr 9 months ago

> Depends on whether you're trying to move topline - or trying to move the "typical".

If this is an important difference, you should define the "typical" population prior to running the experiment.

If you take "typical" to mean "the users who didn't accidentally produce annoying data in this experiment" you will learn things that don't generalise because they only apply to an ill-defined fictional subsegment of your population that is impossible to recreate.

If you don't know up front how to recognise a "typical" user in the sense that matters to you, then that is the first experiment to run!

sunir 9 months ago

One of the most frustrating results I found is that A/B split tests often resolved into a winner within the sample size range we set; however if I left the split running over a longer period of time (eg a year) the difference would wash out.

I had retargeting in a 24 month split by accident and found it didn’t matter after all the cost in the long term. We could bend the conversion curve but not change the people who would convert.

And yes we did capture more revenue in the short term but over the long term the cost of the ads netted it all to zero or less than zero. And yes we turned off retreating after conversion. The result was customers who weren’t retargeted eventually bought anyway.

Has anyone else experienced the same?

kqr 9 months ago

> We could bend the conversion curve but not change the people who would convert.

I think this is very common. I talked to salespeople who claimed that customers on 2.0 are happier than those on 1.0, which they had determined by measuring satisfaction in the two groups and got a statistically significant result.

What they didn't realise was that almost all of the customers on 2.0 had been those that willingly upgraded from 1.0. What sort of customer willingly upgrades? The most satisfied ones.

Again: they bent the curve, didn't change the people. I'm sure this type of confounding-by-self-selection is incredibly common.

Adverblessly 9 months ago

Obviously it depends on the exact test you are running, but a factor that is frequently ignored in A/B testing is that often one arm of the experiment is the existing state vs. another arm that is some novel state, and such novelty can itself have an effect. E.g. it doesn't really matter if this widget is blue or green, but changing it from one color to the other temporarily increases user attention to it, until they are again used to the new color. Users don't actually prefer your new flow for X over the old one, but because it is new they are trying it out, etc.

sunir 9 months ago

Maybe. Retargeting is unlikely to create novelty.

bdjsiqoocwk 9 months ago

> One of the most frustrating results I found is that A/B split tests often resolved into a winner within the sample size range we set; however if I left the split running over a longer period of time (eg a year) the difference would wash out.

Doesn't that just mean there's no difference? Why is that frustrating?

Does the frustration come from the expectation that any little variable might make a difference? Should I use red buttons or blue buttons? Maybe if the product is shit, the color of the buttons doesn't matter.

admax88qqq 9 months ago

> Maybe if the product is shit, the color of the buttons doesn't matter.

This should really be on a poster in many offices.

cwillu 9 months ago

“It looks awful, and it works” (apologies to Buckley's)

sunir 9 months ago

My frustration is the a/b split tests never seemed to net to anything in the long term even after confidence was reached. It made me question the entire process; but I understand the math so it’s confusing to me.

tmoertel 9 months ago

Just a note that "stratification" as described in this article is not what is traditionally meant by taking a stratified sample. The article states:

> Stratification lowers variance by making sure that each sub-population is sampled according to its split in the overall population.

In common practice, the main way that stratification lowers variance is by computing a separate estimate for each sub-population and then computing an overall population estimate from the sub-population estimates. If the sub-populations are more uniform ("homogeneous") than is the overall population, the sub-populations will have smaller variances than the overall population, and a combination of the smaller variances will be smaller than the overall population's variance.

In short, you not only stratify the sample, but also correspondingly stratify the calculation of your wanted estimates.

The article does not seem to take advantage of the second part.

(P.S. This idea, taken to the limit, is what leads to importance sampling, where potentially every member of the population exists in its own stratum. Art Owen has a good introduction: https://artowen.su.domains/mc/Ch-var-is.pdf.)

ccttss 9 months ago

Thank you for pointing this out, this is my understanding of stratified sampling as well. However, I'm very surprised by the huge ~25x (!) reduction in the variance found by the author's simulation between random and stratified sampling. I feel there must be something wrong with the simulation setup, but at first glance nothing seems amiss...

pkoperek 9 months ago

Good read. Does anyone know if any of the experimentation frameworks actually uses these methods to make the results more reliable (e.g. allow to automatically apply winsorization or attempt to make the split sizes even)?

usgroup 9 months ago

Adding covariates to the post analysis can reduce variance. One instance of this is CUPED by there are lots of covariates which are easier to add (eg request type, response latency, day of week, user info, etc).

withinboredom 9 months ago

good advice! From working on an internal a/b testing platform, we had built-in tooling to do some of this stuff after the fact. I don't know of any off-the-shelf a/b testing tool that can do this stuff.

ulf-77723 9 months ago

Worked at an A/B Test SaaS company as a solutions engineer and to my knowledge every vendor is capable of delivering solutions for those problems.

Some advertise with those things, but the big ones take it for granted. Usually before a test will be developed the project manager will assist in mentioning critical questions about the test setup

chashmataklu 9 months ago

Pretty sure most don't. Most A/B Test SAAS vendors cater to lightweight clickstream optimization, which is why they don't have features like Stratified Sampling. Internal systems are lightyears ahead of most SAAS vendors.

brocklum 9 months ago

I work at Statsig and we built it recently: https://statsig.com/updates/update/stratified-sampling. Definitely still some internal systems that are way ahead of SaaS but we're seeing more companies ditch internal builds for us.

alvarlagerlof 9 months ago

Pretty sure that http://statsig.com can

kqr 9 months ago

See also sample unit engineering: https://entropicthoughts.com/sample-unit-engineering

Statisticians have a lot of useful tricks to get higher quality data out of the same cost (i.e. sample size.)

Another topic I want to learn properly is running multiple experiments in parallel in a systematic way to get faster results and be able to control for confounding. Fisher advocated for this as early as 1925, and I still think we're learning that lesson today in our field: sometimes the right strategy is not to try one thing at a time and keep everything else constant.

authorfly 9 months ago

Can you help me understand why we would use sample unit engineering/bootstrapping? Imagine if we don't care about between subjects variance (and thus P-values in T-tests/AB tests), in that case, it doesn't help us right...

I just feel intuitively that it's masking the variance by converting it into within-subjects variance arbitrarily.

Here's my layman-ish interpretation:

P-values are easier to obtain when the variance is reduced. But we established P-values and the 0.05 threshold before these techniques. With the new techniques reducing SD, which P-values directly interpret, you need to counteract the reduction in SD of the samples with a harsher P-value in order to obtain the same number of true positive experiments as when P-values were originally proposed. In other words, allowing more experiments to have less variance in group tests and result in more statistical significant if there is an effect size is not necessarily advantageous. Especially if we consider the purpose of statistics and AB testing to be rejecting the null hypothesis, rather than showing significant effect sizes.

kqr 9 months ago

Let's use the classic example of "Lady tasting tea". Someone claims to be able to tell, by taste alone, if milk was added before or after boiling water.

We can imagine two versions of this test. In both, we serve 12 cups of tea, six of which have had milk added first.

In one of the experiments, we keep everything else the same: same quantities of milk and tea, same steeping time, same type of tea, same source of water, etc.

In the other experiment, we randomly vary quantities of milk and tea, steeping time, type of tea etc.

Both of these experiments are valid, both have the same 5 % risk of false positives (given by the null hypothesis that any judgment by the Lady is a coinflip). But you can probably intuit that in one of the experiments, the Lady has a greater chance of proving her acumen, because there are fewer distractions. Maybe she is able to discern milk-first-or-last by taste, but this gets muddled up by all the variations in the second experiment. In other words, the cleaner experiment is more sensitive, but it is not at a greater risk of false positives.

The same can be said of sample unit engineering: it makes experiments more sensitive (i.e. we can detect a finer signal for the same cost) without increasing the risk of false positives (which is fixed by the type of test we run.)

----

Sometimes we only care about detecting a large effect, and a small effect is clinically insignificant. Maybe we are only impressed by the Lady if she can discern despite distractions of many variations. Then removing distractions is a mistake. But traditional hypothesis tests of that kind are designed from the perspective of "any signal, however small, is meaningful."

(I think this is even a requirement for using frequentist methods. They neef an exact null hypothesis to compute probabilities from.)

authorfly 9 months ago

Thank you. I'll have to think about it a bit more but I appreciate you response

musicale 9 months ago

6. Switch to A/A testing.

9 months ago

sanchezxs 9 months ago

Yes.