Run fewer, better A/B tests

btilly · on June 26, 2021

The notifications examples make me wonder what fundamental mistakes they are making.

People respond to change. If you A/B test, say, a new email headline, the change usually wins. Even if it isn't better. Just because it is different. Then you roll it out in production, look at it a few months later, and it is probably worse.

If you don't understand downsides like this, then A/B testing is going to have a lot of pitfalls that you won't even know that you fell into.

uyt · on June 26, 2021

I think it's known as the "novelty effect" in the industry.

jonathankoren · on June 27, 2021

Yup. A common way to handle this is to throw away the earliest responses, and then run your analysis. This is sometimes called “burn in”.

If the lift goes away, you know it wasn’t real.

kqr · on June 27, 2021

Or compare only new users that have only ever seen either the control or the treatment.

But the effect is real, sure enough. Only to get at it you have to mix up the title/subject line every now and then.

jonathankoren · on June 27, 2021

I wouldn’t do that.

First, it seems impractical because you need a high user growth to be able to get enough samples in a reasonable time to run your experiment to statistical significance. If you are seeing high growth, how many of those accounts are real people rather than bots or duplicate accounts? Maybe that isn’t too big of a deal for an email campaign, but for apps and websites, it matters, because it of the next reason.

Secondly, there’s no reason to believe that new users behave like old users, in fact there are probably many reasons to believe that they do not. Because your A/B population isn’t representative of the actual population of users, there’s no reason to believe whatever effect you measured on the neophyte users is going to carry over to the experienced users. Maybe, but it’s not guaranteed.

Random sampling of users is better, because the A/B population is drawn from the general population, and so the only confounding property is the novelty effect. Doing what you suggest eliminates the novelty effect, but now you can’t generalize your learnings at all.

tootie · on June 26, 2021

I've seen a lot of really sophisticated data pipelines and testing frameworks at a lot of shops. I've seen precious few who were able to make well-considered product decisions based on the data.

jonathankoren · on June 26, 2021

I’m pretty skeptical of this. I’ve run a lot of ML based A/B tests over my career. I’ve talked to a lot of people that have also run ML A/B tests over their careers. And the one constant everyone has discovered is that offline evaluation metrics are only somewhat directionally correlated with online metrics.

Seriously. A/B tests are kind of a crap shoot. The systems are constantly changing. The online inference data drifts from the historical training data. User behavior changes.

I’ve seen positive offline models perform flat. I’ve seen negative offline metrics perform positively. There’s just a lot of variance between offline and online performance.

Just run the test. Lower the friction for running the tests, and just run them. It’s the only way to be sure.

zwaps · on June 26, 2021

Maybe I am too cynic but what we are really talking about here is causal inference for observational data based on more or less structural statistical models.

Any researcher will tell you: this is really hard. It is more than an engineering problem. You need to know not only how to deal with problems, but rather what problems may arise and what you can actually identify. Most importantly, you need to figure out what you can not identify.

There are, at least here in academia, only a limited set of people who are really good at this.

Long story short: even if offline analysis is viable, I doubt every team had the right people for it, making it potentially not worthwhile.

It is infinitely easier to produce a statistical analysis that looks good but isn’t, than one that is good. An overwhelming amount of useless offline models would, statistically speaking, be expected ;)

bruce343434 · on June 26, 2021

Between the emojis in the headings and the 2009 era memes, this was a bit of a cringy read. Also, the author seems to avoid at all costs going in depth about the actual implementation of OPE and I still don't quite understand how I would go about implementing it. Machine learning based on past A/B tests that finds similarities between the UI changes???

econti · on June 26, 2021

Author here! I implemented every method I described in the post in the pip library I used in the post.

In case you missed it: https://github.com/banditml/offline-policy-evaluation

xivzgrev · on June 26, 2021

Yea me too.

My biggest question is where do you get user data to run the simulation? Take the simple push example - if to date you’ve only sent pushes on day 1, and you want to explore day 2,3,4,5 etc…where does that user response data come from? It seems like you need to get the data, then you can simulate various permutations of a policy. But then why not just run multi arm bandit?

ec109685 · on June 26, 2021

The author discusses that in the post, you need to have allowed all possibilities to run at some probability.

dr_dshiv · on June 26, 2021

The challenge I've seen is to have a combination of good, small-scale Human-Centered Design research (watching people work,for instance) and good, large-scale testing. It can be really hard to learn the "why" from a/b tests otherwise.

austincheney · on June 26, 2021

When I was the A/B test guy for Travelocity I was fortunate to have an excellent team. The largest bias we discovered is that our tests were executed with amazing precision and durability. My dedicated QA was the whining star that made that happen. Unfortunately when the resulting feature entered the site in production as a released feature there was always some defect, or some conflict, or some oversight. The actual business results would then under perform compared to the team’s analyzed prediction.

tartakovsky · on June 26, 2021

What is your advice, or more details on the types of challenges you came across and how you handled this discrepancy? I would imagine the data shifts a bit, and that standard assumptions don’t hold up around how the difference you were measuring between your “A” and your “B” remain fixed after the testing period.

austincheney · on June 26, 2021

The biggest technical challenge we came across is that when I had to hire my replacement we couldn’t find competent developer talent. Everybody NEEDED everything to be jQuery and querySelectors but those were far too slow and buggy (jQuery was buggy cross browser). A good A/B test must not look like a test. It has to feel and look like the real thing. Some of our tests were extremely complex spanning multiple pages performing varieties of interactions. We couldn’t be dicking around with basic code literacy and fumbling through entry level beginner defects.

I was the team developer and not the team analyst so I cannot speak to business assumption variance. The business didn’t seem to care about this since the release cycle is slow and defects were common. They were more concerned with the inverse proportions of cheap tests bringing stellar business wins.

eximius · on June 26, 2021

I'm going to stick with multiarm bandit testing.

nxpnsv · on June 26, 2021

This is a much better approach...

gingerlime · on June 26, 2021

What tools/frameworks are you using for running and analysing results?

eximius · on June 27, 2021

At my previous job when it was relevant, I wrote something in house.

A higher order component would pick which variant at runtime (cause we had problems with SSR, or that would be more appropriate). Cached the picks in cookies.

In house charting and probability calculations to determine what P(X>Y) was for each experiment pair. Then we'd just manually prune them occasionally (since the bad ones weren't being displayed, timeliness didn't much matter). Periodically re-introduce old variants by hand if we thought it was worth it.

sbierwagen · on June 26, 2021

>Now you might be thinking OPE is only useful if you have Facebook-level quantities of data. Luckily that’s not true. If you have enough data to A/B test policies with statistical significance, you probably have more than enough data to evaluate them offline.

Isn't there a multiple comparisons problem here? If you have enough data to do single A/B test, how can you do a hundred historical comparisons and still have the same p value?

varsketiz · on June 26, 2021

Recently I hear that booking.com is given as an example of a company that runs a lot of a/b tests. Anyone from booking reading this? How does it look from the inside, is it worth it to run hundreds at a time?