83 points by Maro 9 months ago | 28 comments
vijayer 9 months ago
1. Tight targeting of your users in an AB test. This can be through proper exposure logging, or aiming at users down-funnel if you’re actually running a down-funnel experiment. If your new iOS and Android feature is going to be launched separately, then separate the experiments.
2. Making sure your experiment runs in 7-day increments. Averaging out weekly seasonality can be important in reducing variance but also ensures your results accurately predict the effect of a full rollout.
Everything mentioned in this article, including stratified sampling and CUPED are available, out-of-the-box on Statsig. Disclaimer: I’m the founder, and this response was shared by our DS Lead.
wodenokoto 9 months ago
There are of course many seasonalities: day/nigh, weekly, monthly, yearly seasonality, so it can be difficult to decide how broad you want to collect data. But I remember interviewing at a very large online retailer and they did their a/b tests in an hour because they "would collect enough data points to be statistical significant" and that never sat right with me.
kqr 9 months ago
Note that outliers are often your most valuable data points[1]. I'd much rather stratify than cut them out.
By cutting them out you indeed get neater data, but it no longer represents the reality you are trying to model and learn from, and you run a large risk of drawing false conclusions.
chashmataklu 9 months ago
If you're a retailer or a gaming company, you probably care about your "whales" who'd get winsorized out. Depends on whether you're trying to move topline - or trying to move the "typical".
kqr 9 months ago
If this is an important difference, you should define the "typical" population prior to running the experiment.
If you take "typical" to mean "the users who didn't accidentally produce annoying data in this experiment" you will learn things that don't generalise because they only apply to an ill-defined fictional subsegment of your population that is impossible to recreate.
If you don't know up front how to recognise a "typical" user in the sense that matters to you, then that is the first experiment to run!
sunir 9 months ago
I had retargeting in a 24 month split by accident and found it didn’t matter after all the cost in the long term. We could bend the conversion curve but not change the people who would convert.
And yes we did capture more revenue in the short term but over the long term the cost of the ads netted it all to zero or less than zero. And yes we turned off retreating after conversion. The result was customers who weren’t retargeted eventually bought anyway.
Has anyone else experienced the same?
kqr 9 months ago
I think this is very common. I talked to salespeople who claimed that customers on 2.0 are happier than those on 1.0, which they had determined by measuring satisfaction in the two groups and got a statistically significant result.
What they didn't realise was that almost all of the customers on 2.0 had been those that willingly upgraded from 1.0. What sort of customer willingly upgrades? The most satisfied ones.
Again: they bent the curve, didn't change the people. I'm sure this type of confounding-by-self-selection is incredibly common.
Adverblessly 9 months ago
sunir 9 months ago
bdjsiqoocwk 9 months ago
Doesn't that just mean there's no difference? Why is that frustrating?
Does the frustration come from the expectation that any little variable might make a difference? Should I use red buttons or blue buttons? Maybe if the product is shit, the color of the buttons doesn't matter.
admax88qqq 9 months ago
This should really be on a poster in many offices.
cwillu 9 months ago
sunir 9 months ago
tmoertel 9 months ago
> Stratification lowers variance by making sure that each sub-population is sampled according to its split in the overall population.
In common practice, the main way that stratification lowers variance is by computing a separate estimate for each sub-population and then computing an overall population estimate from the sub-population estimates. If the sub-populations are more uniform ("homogeneous") than is the overall population, the sub-populations will have smaller variances than the overall population, and a combination of the smaller variances will be smaller than the overall population's variance.
In short, you not only stratify the sample, but also correspondingly stratify the calculation of your wanted estimates.
The article does not seem to take advantage of the second part.
(P.S. This idea, taken to the limit, is what leads to importance sampling, where potentially every member of the population exists in its own stratum. Art Owen has a good introduction: https://artowen.su.domains/mc/Ch-var-is.pdf.)
ccttss 9 months ago
pkoperek 9 months ago
usgroup 9 months ago
withinboredom 9 months ago
ulf-77723 9 months ago
Some advertise with those things, but the big ones take it for granted. Usually before a test will be developed the project manager will assist in mentioning critical questions about the test setup
chashmataklu 9 months ago
brocklum 9 months ago
alvarlagerlof 9 months ago
kqr 9 months ago
Statisticians have a lot of useful tricks to get higher quality data out of the same cost (i.e. sample size.)
Another topic I want to learn properly is running multiple experiments in parallel in a systematic way to get faster results and be able to control for confounding. Fisher advocated for this as early as 1925, and I still think we're learning that lesson today in our field: sometimes the right strategy is not to try one thing at a time and keep everything else constant.
authorfly 9 months ago
I just feel intuitively that it's masking the variance by converting it into within-subjects variance arbitrarily.
Here's my layman-ish interpretation:
P-values are easier to obtain when the variance is reduced. But we established P-values and the 0.05 threshold before these techniques. With the new techniques reducing SD, which P-values directly interpret, you need to counteract the reduction in SD of the samples with a harsher P-value in order to obtain the same number of true positive experiments as when P-values were originally proposed. In other words, allowing more experiments to have less variance in group tests and result in more statistical significant if there is an effect size is not necessarily advantageous. Especially if we consider the purpose of statistics and AB testing to be rejecting the null hypothesis, rather than showing significant effect sizes.
kqr 9 months ago
We can imagine two versions of this test. In both, we serve 12 cups of tea, six of which have had milk added first.
In one of the experiments, we keep everything else the same: same quantities of milk and tea, same steeping time, same type of tea, same source of water, etc.
In the other experiment, we randomly vary quantities of milk and tea, steeping time, type of tea etc.
Both of these experiments are valid, both have the same 5 % risk of false positives (given by the null hypothesis that any judgment by the Lady is a coinflip). But you can probably intuit that in one of the experiments, the Lady has a greater chance of proving her acumen, because there are fewer distractions. Maybe she is able to discern milk-first-or-last by taste, but this gets muddled up by all the variations in the second experiment. In other words, the cleaner experiment is more sensitive, but it is not at a greater risk of false positives.
The same can be said of sample unit engineering: it makes experiments more sensitive (i.e. we can detect a finer signal for the same cost) without increasing the risk of false positives (which is fixed by the type of test we run.)
----
Sometimes we only care about detecting a large effect, and a small effect is clinically insignificant. Maybe we are only impressed by the Lady if she can discern despite distractions of many variations. Then removing distractions is a mistake. But traditional hypothesis tests of that kind are designed from the perspective of "any signal, however small, is meaningful."
(I think this is even a requirement for using frequentist methods. They neef an exact null hypothesis to compute probabilities from.)
authorfly 9 months ago
musicale 9 months ago
9 months ago
sanchezxs 9 months ago