I need answers now! Using simulation to jump-start an experiment (Part II)
I need answers now! (Part II)
Using simulation to jump-start an experiment
NOTICE: No users were harmed in the writing of this blog post.
In the last installment of this series I talked about a very light-touch version of user modeling - we just took the existing user population and distorted it to approximate what it might look like under our proposed experiment. This is a really useful trick to have up one’s sleeve, but it only works for a very limited set of experiments. So to follow up we took the next logical step and created a complete virtual online world to plug users into to see how they behave in different situations.
Just kidding! We don’t have the budget for that. (Yet.)
Case study 2: User simulatorSuppose we are introducing a completely new feature on the site for which there is no historical analogue. In our specific case, we might introduce a new type of practice task which presents the user with different problems in math from what the normal tasks will present. How will those users perform? How long will it take them to complete all the content? Are they better off using the new practice task or sticking with the old one?
In order to answer these questions, we need two components:
- A set of working models (corresponding to different types of users at different ability levels) of user performance on a task, drawn from historical data. A simulated user will take actions based on the probabilistic predictions of their user model.
- An automated test harness that can simulate the outputs and options presented to the simulated user at any given point and respond to their inputs appropriately, while reporting useful metrics such as overall accuracy and time-to-completion.
The simulated user
- Can a “perfect” user (99% or 100% accuracy) actually complete a mission? (This is helpful for catching accidental circular dependencies or other bugs that block progress)
- How does the time-to-completion vary with overall accuracy? Do we over-penalize for silly mistakes?
- What is the minimal steady-state accuracy required to actually complete a mission?
It is easy to imagine various ways to improve the user model with real data: we vary the accuracy according to the difficulty of the question, increase the accuracy monotonically to simulate learning, etc. We need these more subtle models to be able to compare two treatments of a site feature with respect to:
- Does overall accuracy go up or down in alternative B compared to A?
- Do more users in alternative B actually complete the mission compared to alternative A?
- Do users who complete a mission in alternative B do so in less time than alternative A?
There is definitely a point of diminishing returns, though. Our user model can’t predict perceptual effects like what font or color users will pay attention to or how they will react to various intrinsic or extrinsic motivators.
The test harnessOnce you have a simulated user, you need a simulated version of the site for them to interact with.
You: Hey, isn’t that a lot of work to build?However, we’ve already done all that work in order to get integration tests working! End-to-end integration tests create user entities in a test database, make API calls on their behalf, and perform other necessary functions like temporarily override the current date and time. They also run in parallel and clean up after themselves between tests, which is exactly what we need to run a bunch of simulated users through a set of tasks independently. The more we can leverage that existing work the easier it becomes to create a functioning a user simulator.
Me: Yes, it sure is!
After delegating setup and teardown to the code shared with tests, the test harness is responsible for creating a user entity, switching to the designated mission, fetching the list of recommended tasks, and completing them one by one, delegating any decisions (order to attempt tasks, correct or incorrect on each problem) to the user model it was initialized with. Different experimental conditions can be enabled or disabled for different subsets of users to simulate multiple A/B test alternatives. When the harness detects that the mission was completed or an error occurred it will write statistics for each alternative to a log and exit.
This scheme has turned out to work even better than expected. Aside from a few simulator-specific performance improvements the business logic is running the same code as in production. On a beefy machine with plenty of processors and memory, we can simulate hundreds of users in minutes, which can give us a quick sanity check that new features aren’t going to break or degrade the experience. The simulator has even caught a few regressions that could block a user from completing a mission. We now run it nightly as another continuous integration test.
ExampleTo see how one might use the simulator, here is an example. This fall we have been working intently on accelerating progress through math missions for users who already know the material. This can be beneficial for students starting at a level below their actual skill level, or wanting to review concepts they’ve already learned. We want to make the process of “catching up” to where you ought to be as quick and painless as possible, and one proxy for this is the time it takes to complete a mission for a user with high accuracy. We had already implemented an experiment to introduce “booster tasks”, which promote the user to a higher mastery level on a group of skills if he or she completes all the problems in the task at a high level of accuracy. The simulator allowed us to validate that the results of this experiment would be positive before actually shipping it to users.
The user simulator is highly configurable, and all I needed to do to run a simulated experiment for an already-implemented experiment is create a YAML file with the configuration I want:
# the slug of the mission you'd like to simulate mission: cc-third-grade-math # simulated users are run in parallel to each other. you can # specify the number of processes in order to maximise # performance for your machine's cpu's. num_processes: 4 # whether or not to use the test db specified in datastore_path use_test_db: true # path to your datastore. datastore_path: ../current.sqlite # specify the parameters of the simulated users and the experiment # groups into which they are segmented. experiment_groups: # name of the experiment group group_a: ProbabilisticUser: num_users: 50 # A/B test alternatives bigbingo_alternatives: booster_tasks_v3: control # the parameters with which the users are initialised. params: # session time per day in seconds. max_time_per_day: 1200 # initial probability of getting problems correct starting_prob: 0.9 # rate at which the ability increases per problem. learning_rate: 0.0 # maximum ability. max_prob: 1.0 group_b: ProbabilisticUser: num_users: 50 # A/B test alternatives bigbingo_alternatives: booster_tasks_v3: booster booster_task_length: length-6 booster_task_min_problems: min-problems-12 # the parameters with which the users are initialised. params: # session time per day in seconds. max_time_per_day: 1200 # initial probability of getting problems correct starting_prob: 0.9 # rate at which the ability increases per problem. learning_rate: 0.0 # maximum ability. max_prob: 1.0
This will create 100 simulated users split into two groups. Both groups have the same internal model: they are highly accurate users (who get exactly 90% of problems correct no matter the question) who are going to just tear through any problems we give them. They do however make mistakes, and the distribution of outcomes is going to reflect how the system reacts to those mistakes. The key difference between the two groups (highlighted in bold) is that one will be enrolled in the “booster_tasks_v3” experiment and the other won’t.
The results are emitted in CSV form. Here is the distribution of the most important statistic - how many problems taken to complete the mission - for 90% accurate users and 95% accurate users:
|As a reward for reading this far, here's a tasty graph!|
In the “without boosters” condition, the primary acceleration mechanic is “cascading challenge” exercises, which continue fast-track the user through mastery levels on consecutive skills while they are getting answers correct. This works works OK for 95% accuracy users, but when the user starts making careless errors those errors can have a huge effect on completion times, as we can see for the 90% accurate users - the range is 130-620 problems! (Note that this simple model assumes it’s equally likely the user errs on a simple problem as a complex problem, which is not true in practice.)
With boosters it is clear that the number of problems required is significantly smaller, but the variability is also dramatically reduced. The worst case is now down to a manageable 190 problems. Of course, we have to validate that users who don’t actually know the material don’t also get promoted at a faster rate. We can just tweak the parameters and run the simulator again.
ConclusionEven without going down the rabbit hole of creating really sophisticated or realistic user models, we have already derived a huge benefit from the ability to roughly compare different treatments on the site and sanity check that increasingly complex systems work as designed. Now whenever we are considering a new improvement or feature and we want to know whether it will be effective, we can go down the list:
- Can we find evidence for it in our existing user data?
- Can we run a simulation based on existing user data to find evidence?
- Can we run an experiment with actual users to gather evidence?
If you enjoyed this post and you are interested in the software development side of things, check out my personal blog at arguingwithalgorithms.com.