The notifications examples make me wonder what fundamental mistakes they are making.
People respond to change. If you A/B test, say, a new email headline, the change usually wins. Even if it isn't better. Just because it is different. Then you roll it out in production, look at it a few months later, and it is probably worse.
If you don't understand downsides like this, then A/B testing is going to have a lot of pitfalls that you won't even know that you fell into.
First, it seems impractical because you need a high user growth to be able to get enough samples in a reasonable time to run your experiment to statistical significance. If you are seeing high growth, how many of those accounts are real people rather than bots or duplicate accounts? Maybe that isn’t too big of a deal for an email campaign, but for apps and websites, it matters, because it of the next reason.
Secondly, there’s no reason to believe that new users behave like old users, in fact there are probably many reasons to believe that they do not. Because your A/B population isn’t representative of the actual population of users, there’s no reason to believe whatever effect you measured on the neophyte users is going to carry over to the experienced users. Maybe, but it’s not guaranteed.
Random sampling of users is better, because the A/B population is drawn from the general population, and so the only confounding property is the novelty effect. Doing what you suggest eliminates the novelty effect, but now you can’t generalize your learnings at all.
I've seen a lot of really sophisticated data pipelines and testing frameworks at a lot of shops. I've seen precious few who were able to make well-considered product decisions based on the data.
I’m pretty skeptical of this. I’ve run a lot of ML based A/B tests over my career. I’ve talked to a lot of people that have also run ML A/B tests over their careers. And the one constant everyone has discovered is that offline evaluation metrics are only somewhat directionally correlated with online metrics.
Seriously. A/B tests are kind of a crap shoot. The systems are constantly changing. The online inference data drifts from the historical training data. User behavior changes.
I’ve seen positive offline models perform flat. I’ve seen negative offline metrics perform positively. There’s just a lot of variance between offline and online performance.
Just run the test. Lower the friction for running the tests, and just run them. It’s the only way to be sure.
Maybe I am too cynic but what we are really talking about here is causal inference for observational data based on more or less structural statistical models.
Any researcher will tell you: this is really hard. It is more than an engineering problem. You need to know not only how to deal with problems, but rather what problems may arise and what you can actually identify. Most importantly, you need to figure out what you can not identify.
There are, at least here in academia, only a limited set of people who are really good at this.
Long story short: even if offline analysis is viable, I doubt every team had the right people for it, making it potentially not worthwhile.
It is infinitely easier to produce a statistical analysis that looks good but isn’t, than one that is good. An overwhelming amount of useless offline models would, statistically speaking, be expected ;)
Between the emojis in the headings and the 2009 era memes, this was a bit of a cringy read. Also, the author seems to avoid at all costs going in depth about the actual implementation of OPE and I still don't quite understand how I would go about implementing it. Machine learning based on past A/B tests that finds similarities between the UI changes???
My biggest question is where do you get user data to run the simulation? Take the simple push example - if to date you’ve only sent pushes on day 1, and you want to explore day 2,3,4,5 etc…where does that user response data come from? It seems like you need to get the data, then you can simulate various permutations of a policy. But then why not just run multi arm bandit?
The challenge I've seen is to have a combination of good, small-scale Human-Centered Design research (watching people work,for instance) and good, large-scale testing. It can be really hard to learn the "why" from a/b tests otherwise.
When I was the A/B test guy for Travelocity I was fortunate to have an excellent team. The largest bias we discovered is that our tests were executed with amazing precision and durability. My dedicated QA was the whining star that made that happen. Unfortunately when the resulting feature entered the site in production as a released feature there was always some defect, or some conflict, or some oversight. The actual business results would then under perform compared to the team’s analyzed prediction.
What is your advice, or more details on the types of challenges you came across and how you handled this discrepancy? I would imagine the data shifts a bit, and that standard assumptions don’t hold up around how the difference you were measuring between your “A” and your “B” remain fixed after the testing period.
The biggest technical challenge we came across is that when I had to hire my replacement we couldn’t find competent developer talent. Everybody NEEDED everything to be jQuery and querySelectors but those were far too slow and buggy (jQuery was buggy cross browser). A good A/B test must not look like a test. It has to feel and look like the real thing. Some of our tests were extremely complex spanning multiple pages performing varieties of interactions. We couldn’t be dicking around with basic code literacy and fumbling through entry level beginner defects.
I was the team developer and not the team analyst so I cannot speak to business assumption variance. The business didn’t seem to care about this since the release cycle is slow and defects were common. They were more concerned with the inverse proportions of cheap tests bringing stellar business wins.
At my previous job when it was relevant, I wrote something in house.
A higher order component would pick which variant at runtime (cause we had problems with SSR, or that would be more appropriate). Cached the picks in cookies.
In house charting and probability calculations to determine what P(X>Y) was for each experiment pair. Then we'd just manually prune them occasionally (since the bad ones weren't being displayed, timeliness didn't much matter). Periodically re-introduce old variants by hand if we thought it was worth it.
>Now you might be thinking OPE is only useful if you have Facebook-level quantities of data. Luckily that’s not true. If you have enough data to A/B test policies with statistical significance, you probably have more than enough data to evaluate them offline.
Isn't there a multiple comparisons problem here? If you have enough data to do single A/B test, how can you do a hundred historical comparisons and still have the same p value?
Recently I hear that booking.com is given as an example of a company that runs a lot of a/b tests. Anyone from booking reading this? How does it look from the inside, is it worth it to run hundreds at a time?
People respond to change. If you A/B test, say, a new email headline, the change usually wins. Even if it isn't better. Just because it is different. Then you roll it out in production, look at it a few months later, and it is probably worse.
If you don't understand downsides like this, then A/B testing is going to have a lot of pitfalls that you won't even know that you fell into.