I need answers now! Using simulation to jump-start an experiment (Part I)

Thursday, December 11, 2014 Tom Yedwab 1 Comments


I need answers now! (Part I)
Using simulation to jump-start an experiment
NOTICE: No users were harmed in the writing of this blog post.

Say for the moment you're building a website serving some kind of service to millions of users. You know all about progressive refinement using A/B testing methodologies, and you've found some easy wins by tweaking various features on the site and observing the results. Maybe you even launched some completely new user interaction mechanics as an experiment and verified that they improve key metrics. But now you have a proposed feature that your lead engineer thinks will increase engagement by a wide margin and your lead designer things will hurt engagement irreparably. What do you do?

It is tempting to say "let's run the experiment anyway." You can roll it out to just a small fraction of the user base, and you can always pull the plug on the experiment if key metrics take a turn for the worse, right?



I'm not going to go into the details of why this is both counterproductive and, depending on your circumstances, unethical. Even if you aren't trying to provide a free, world-class education for all, running experiments that you suspect will have negative effects on your test subjects is Bad Science. If you can't imagine your proposal ever being approved by an institutional review board at a major university, then you should really reconsider. There is a better way.

Simulations: Same great taste, zero calories

Data modeling, statistics and simulation have traditionally been tightly intertwined insofar as they are often used together to make business decisions in the face of uncertainty about the future or to train those who work with complex processes. It's also somewhat obvious how to use simulations when designing architecture for a building or schematics for electronics. But in web design the principal components are users, so any simulation of how the website works invariably starts with a simulated user to the site.

So, what do we gain by shifting our analysis from real users to simulated ones?
  1. Instant feedback. The biggest gain is that we don't have to wait several weeks for results to come back. (Though let's be honest, we all get impatient and start checking the numbers as soon as data starts to come in!) A simulation can run in hours or even minutes, and we can quickly try many different alternatives in simulation before picking one or two to actually A/B test with real users.
  2. No ethical dilemmas. Simulated users can't be hurt by your changes, write angry reviews of your product or spread bad opinions about you.
  3. You can ignore edge cases. Even the simplest experiment in production requires handling legacy users with ancient accounts that always seem to have strange properties that break your logic in subtle ways. In simulation, if we need to implement any logic, we usually only need to implement the logic to handle the general cases.
  4. Sample sizes as large as you have patience for. Need a few thousand more users? Run the simulation for a little while longer. Borrow a beefier machine and run the simulation in parallel to get better results.
  5. Repeatability. Sometimes we run an A/B test one week, then restart it the next week and get a completely different result. Simulations by definition ignore distortions from time, strange outliers, etc. so if you see an effect you know it's a real effect.
The downside of course is that the simulation is only as good as the model, and therefore you can't simulate things for which there are no good correlates in the data you used to train the model. But when it does work it's a tremendous time-saver, as we'll see in the following case study where we recently used this technique at Khan Academy.

Case study 1: Item reordering

This spring, I wanted to run an experiment within Khan Academy's learning dashboard having to do with the order that individual items (questions) appear in an exercise (a collection of items practiced together). The theory was that applying a bias to move certain items earlier in the task would improve performance over the status quo, which presented the items completely at random. I had various heuristics to test for which items would perform better (these could fill a whole blog post on their own) and the goal was to retain some randomness while increasing the likelihood that certain items occur more frequently earlier in the exercise.

One problem that is inherent in all A/B tests for functionality like this is that the possibility space is large, but only a few alternatives can actually be tested at any one time. While thinking about which heuristics I wanted to test it dawned on me that I could take our existing anonymized performance data and, by aggregating it in different ways, run any of these as a virtual experiment.

This data we have represents a completely random item ordering, so we expect to find an even distribution among the orderings in the sample. If I were to change the recommender code to add a bias to the order items are shown to users, each ordering would still be represented (because of the randomness) but at a different probability. So by calculating the probability of the actual ordering seen by each anonymized user in my dataset, I can assign them a weight for the simulated experiment and then use that to calculate a weighted average of any statistic I want to predict.

For instance, suppose I have items A, B, and C in an exercise. The number of users who see each ordering of those items should be about the same, but a statistic that I might care about (in this made-up example, average overall accuracy) will not necessarily be the same:

Anonymized data from user logs (totally contrived example)
OrderingUser countAverage accuracy
ABC49566%
ACB48065%
BAC53469%
BCA50070%
CAB50272%
CBA48971%
TOTAL3,00069%


This sort of data is really straightforward to pull from BigQuery and summarize. Now suppose I want to weight item C to be in the first position 1 50% of the time, and then in the second and third positions 25% of the time. Given the new probabilities for each ordering, we can weight the user count and get the predicted accuracy with this new ordering scheme:

Simulated data
OrderingProbabilityUser countAverage accuracy
ABC12.5%61.87564%
ACB12.5%6071%
BAC12.5%66.7569%
BCA12.5%62.578%
CAB25%125.572%
CBA25%122.2573%
TOTAL100%498.87572%


(The probabilities are given in the constraints: P(CAB) + P(CBA) = 0.5, P(ACB) + P(BCA) = P(ABC) + P(BAC) = 0.25. The modified user counts are the original counts multiplied by the ordering probabilities.)

There are a few things to notice here. One is that the overall accuracy is higher! That's a good sign for this experiment, as it suggests that seeing item C earlier rather than later is good for overall accuracy. Another is that the user counts are less evenly distributed - CAB now has twice as many "data points" as ACB! In fact, the more dramatic the reordering is, the less evenly distributed the probabilities will be, thus undermining the validity of the result because a smaller and smaller subset of users' statistics dominate:

So, we have a conundrum: make the probabilities too even, and we won't see a statistically meaningful effect. Make the probabilities too uneven, and noise drowns out the effect. There are a few ways to get around this problem, such as incrementally increasing the gain to try to detect a trend or dividing the original sample into subsamples to capture some error bounds for the simulated predictions.

Since the actual math here is pretty straightforward (pretty much just simple probabilities and weighted averages) the whole thing took just a few days to yield results and I was able to test a number of different alternatives in simulation before implementing a few of the best candidates as an experiment to ship to real users.

To be continued...

This should give you a taste of how a very simple simulation of an experiment could work if the change you are making is incremental and predictable. Hopefully this is enough to get you started thinking about hypothetical questions of your own in terms of simulated outcomes. In the next installment I'll delve into a much more robust solution to the simulation problem: the “user simulator”.

If you enjoyed this post and you are interested in the software development side of things, check out my personal blog at arguingwithalgorithms.com.

1 comment:

  1. Typo: “…your lead designer things [thinks] will hurt engagement irreparably.”

    ReplyDelete