A data-informed approach to community fundraising


A data-informed approach to community fundraising

At Khan Academy, our mission is to provide a free, world-class education for anyone, anywhere. With a staff of fewer than 100 people, the only way we can make good on the “free” part for our 30 million registered learners and millions of others is through the funding we receive from our generous donors. And as the educational needs of billions around the globe continue to evolve, for us to make good on the “world-class” part, we must creatively tap into new sources of funding. Our learners are the reason we come into work every day, and we’re committed to helping them learn regardless of what they can or cannot afford. But what more could we do if those learners (and their families) that are able to chip in actually contributed just a few dollars (or pesos, or rupees, or any other currency) to fuel our mission?

It may sound ambitious -- even crazy! -- to believe we can mobilize our community of users to help us continue to fund our mission. But organizations like Wikipedia and the Obama campaign have shown us how powerful a community of donors can be in the pursuit of an audacious vision. These efforts have relied heavily on experimentation and quantitative analysis to better understand what motivates different kinds of people to donate, and why.

In a similar vein, we’ve started to conduct several different kinds of experiments to better understand what motivates Khan Academy users to donate. Emails, social media, and even carrier pigeons are all channels through which we can reach potential donors (TBD on the carrier pigeons). But given that many of our users find value on the platform itself, how can we encourage giving when they’re on the site, without disrupting their learning experience?


Banner message tests

On-site banners (like the ones below) are one way to encourage users to become donors. Over the past several weeks, we’ve been running experiments to see which banner messages are the most compelling in motivating users to donate. These experiments entail sampling a subset of our users and showing them one of several banners when they’re on the site to investigate which banners lead to the greatest number (and total $ amount) of donations. Users can, of course, always dismiss the banner so they don’t have to see it anymore. Experimenting with different banner messages is one way to learn what users value about their Khan Academy experiences and what motivates them to give. In our latest experiment, we found a particularly interesting trend with two different messages, both of which are shown here:


Message A:

“Khan Academy is a small nonprofit with a big mission: a free, world-class education for anyone, anywhere. With fewer than 100 employees, Khan Academy serves more than 15 million users each month with over 100,000 videos and exercises.

If everyone on the site this week gave $20, it would fund all of our work for the next year. If you’ve gotten something out of our site, please take a second to pitch in and help us get back to teaching!”
 
Message B:

“Khan Academy is a small nonprofit with a big mission: a free, world-class education for anyone, anywhere.

If everyone on the site this week gave $3, it would fund all of our work for the next year. That’s right - the price of a cup of coffee is all we ask. We’re so lucky to be able to serve people like you, and we hope you’ll pitch in to help us continue to do our best work.”
 

Message B only asked for a $3 donation by default, since other (particularly higher) amounts might not make sense (unless you drink really expensive coffee!). Message A, however, varied in its default ask with the following amounts: $1, $2, $3, $19, $201 (with the $20 version of the default ask shown above).

So, which message performed better? Well, it depends on what we mean by “better”. When evaluating message strings, all things equal (except for the string itself), we look at 2 key metrics: the % of banner viewers that actually end up contributing, and the total amount (in $) generated from each message. This is because we care about mobilizing the greatest number of users to give us as much money as they’re able to.

So, which message mobilized a higher fraction of our users to give - Message A, or Message B?

It turns out that users who saw Message B were significantly more likely to donate than those who saw Message A (p < 0.05). But was it really the “coffee”-related text that motivated people to donate, or was it the fact that this message had such a low ask amount of $3? If we look at the percentage of banner viewers who donated when they saw Message A at a $3 default ask, even though this percentage is lower than Message B’s, the difference is not statistically significant (p = 0.33). Moreover, we’ve learned that people tend to donate at lower amounts ($3 or under) significantly more often than higher ones. This suggests that many of our users are inclined to give smaller-sized gifts.

Message B might have brought in the highest number of donors - but did it also raise the most $?

In this case, no - Message A brought in approximately 1.4x more in total gifts than Message B2. If we look at the distributions of gift sizes for each (not including projected revenue from recurring gifts), shown below, we can begin to understand why.


These charts suggest that those users who are asked for only $3 - for example, by equating it to a cup of coffee in Message B - tend not to deviate much from that amount. Therefore, even though many of them donated, they mostly donated at a relatively low amount, yielding lower total $ raised. Those asked for an amount that is not pegged to a specific item (i.e. an amount that fell in the range described earlier) tended to give at those amounts -- but also, at amounts they weren’t asked (e.g., $10). It turns out, Professor Daniel Kahneman wasn’t lying when he said anchoring is real! 


Challenges


There are infinite levels of analyses we can perform to better understand what motivates our users to become donors. There are also several challenges that arise when running experiments like the one described above. Here are a few:

  • Inferring why. While the data can tell us “what” or “which” - e.g., which messages motivate users to donate -- it can’t tell us “why” these users actually donate. This is where supplementing these experimental efforts with qualitative user research is essential (more below). 
  • Sample sizes are often small. Even with the large number of users that visit Khan Academy, because a fraction of users are sampled to be shown banners3, and an even smaller fraction actually ends up donating, the total number of donors is small. This can make it challenging to obtain and evaluate valid results quickly, especially if the goal is to iterate on new message strings to learn as quickly as possible what resonates and what doesn’t. 
  • No perfectly-consistent control group. Running experiments over time opens the door to several confounding factors like the seasonality of site usage or different propensities to give at different times of the year -- all of which could sway results. 
  • Deciding what to test is tough! With so many possible messages, banner styles, amounts, and a host of other variables, picking which variables to test is often half the battle. In these cases, a nice mix of data-informed insights and good ol’ fashioned creativity is crucial.



Looking ahead

A world where every Khan Academy user chips in whatever they can to help us collectively achieve our mission is a world we are all very excited about. In addition to experiments, other research, including qualitative surveys, has taught us even more about the deep commitment many of our users have to Khan Academy. Here’s a powerful quote from a Khan Academy parent:

“...we live right above the poverty line but so appreciate everything khan does for our son and even us grown-ups in learn[ing] for free. We don't have much but if there was a way that we could donate in small amounts ($10 or less) it would make us feel better as a family that we are contributing to our education. We have a 12 year old hispanic son living in the middle of the ghetto but he knows how to write java script [sic] and reads at college level. Thanks for all you do.”

Donate today and help us deliver a free, world-class education for anyone, anywhere.


  • 1. The amounts were launched as a separate, “overlapping” experiment so that the effects of default ask amounts can also be explored independently. 
  • 2. If we account for recurring gifts at an average lifetime of 7 months, this multiplier shrinks to approximately 1.1x. It’s important to note that given the small number of donations, many of these values are highly sensitive to outliers (especially large gifts). 
  • 3. In an effort to incrementally improve our messaging, we’ve chosen to show only a subset of our users banners, with plans to widen the sample size as we learn more about what motivates people to give.  

The Experiment

The Experiment

Not every experiment goes the way you want it to. One of the most humbling things about dealing with data - and that one that I was least prepared for when I started running experiments - is the ever-present threat of wasting a lot of time (or worse, making an ill-informed decision) because of some really simple mistakes or unforeseen “gotchas” when constructing your experiment. However, it’s important to us as a team (and maybe even us as an industry *nudge*) to talk as openly about our mistakes as our successes. So in the spirit of avoiding publication bias, here is a post about a fun, lighthearted experiment that taught me a lot about the tricky business of data collection.


Remember this post? We wanted to advertise job openings on the Data Science team, and I had a lot of ideas about how to go about it. After putting some ideas down I realized I had two possible approaches to the pitch, which simply put were: “here are the people you’ll get to work with”, and “here are the projects you’ll get to work on”. We deal with these kinds of design decisions every day, and our go-to method for resolving them is the A/B test. This struck me as an amusing idea: could I run two versions of the post and compare their performance? This isn’t something I’d seen done before, so I decided to give it a try.

In any experiment the first decision to be made is: what are we optimizing for? In this particular case, the choice is obvious: we want to maximize the number of job applications received as a result of reading the blog post. However, I didn’t want to rely exclusively on this since the expected conversion rate was fairly low - thousands of people read any given blog post, but only a handful generally apply for a job. So I also wanted to track some other proxy for engagement, and I settled on the number of times the post was shared. More shares would mean the post struck a chord with readers and reached a wider audience.

While we run similar tests all the time, I had never run an A/B test on a blog post before. The blog itself is hosted on Blogger, so I didn’t have access to our amazing A/B test framework. I had to improvise ways to show the different alternatives and measure the variables I was interested in. Luckily, I had the ability to add some JavaScript to the page, and so I wrote up a snippet that uses both the location hash and browser cookies to assign a user to a persistent alternative:


$(function() {
  var getHash = function() {
    return (window.location.hash === "#___") ? "contentA" : ((window.location.hash === "#__") ? "contentB" : null);
  }
  var setHash = function(value) {
    window.location.hash = (value === "contentA") ? "___" : ((value === "contentB") ? "__" : null);
  }
  var setCookie = function(value) { ... }
  var getCookie = function() { ... }

  if ($(".PostBody .contentA").length > 0) {
    var alt = getHash() || getCookie() || ((Math.random() > 0.5) ? "contentA" : "contentB");

    setHash(alt);
    setCookie(alt);

    // Show the appropriate content
    $(".PostBody .loading").hide();
    $(".PostBody ." + alt).show();

    // Fix up page load tracker img links
    var $el = $(".PostBody ." + alt + " img.page-load");
    $el.attr({src: $el.data("src")});

    // Fix up Twitter/Facebook links
    $(".share-story .tips").each(function(idx, el) {
      if ($(el).data("title") === "Facebook" || $(el).data("title") === "Twitter") {
        $(el).attr("href", $(el).attr("href").replace(
          "\.html", ".html%23" + (alt === "contentB" ? "__" : "___")));
      }
    })
  }
});
There were (I removed one of the alternatives when I ended the experiment) two different divs with the “PostBody” class on the page, both initially hidden. There are several moving parts to this experiment:
  • If the user hasn’t seen any alternative yet (as stored in either the location hash or the cookie) then we randomly assign them an alternative and rewrite both the hash (using “#__” and “#___” to be minimally obvious) and the cookie.
  • We show the appropriate content given the alternative the user has been assigned.
  • There is an <img> tag in each of the post body versions that contains as shortened link to a blank image, which serves as a web beacon. This isn’t visible on the page but the URL shortener service theoretically (see below) allows me to count the number of views in each alternative. However, if I just embed the shortened URL into the Blogger post, Blogger (helpfully) follows the link and replaces it with a static cached copy, so I have to set the “src” attribute from JavaScript to avoid that problem.
  • I want shares of the page, if at all possible, to point to the same alternative that the person sharing saw. This allows me to test whether one alternative gets shared more than the other: if this is true, then the more shared alternative will tend to have more views. So I rewrite the Facebook and Twitter share button links to include the alternative markers corresponding to the correct alternative.
  • Crucially, the referrer links in the page to the job postings in Greenhouse are different depending on the alternative so we can track which applicants were referred from which alternative.

I must admit, I was pretty proud of myself for working through all these issues. I published the post and moved on to other projects.


Fast forward to August. It’s been a fantastic summer here at Khan Academy: We’ve shipped a brand-new Official SAT Practice product to the site, hacked together prototypes in our annual Healthy Hackathon, and read about some interesting research in Statistics Club. And we’ve got some big challenges to look forward to in the fall. So I figured now is a good moment to take a break and evaluate this really quick experiment I launched about half a year ago.

First, here are some results:

TotalAlternative AAlternative B
Views (according to Blogger)11,131
Views (according to goo.gl)3,32519371388
(100%)(58.26%)(41.74%)
Résumés submitted1165


The first thing to note is the huge discrepancy (an order of magnitude difference) between total views according to Blogger’s built-in analytics and the number of views I was able to capture using the web beacon. Since I know that more people than that run with JavaScript enabled and therefore should be seeing the post body (and I haven’t seen any reports of missing body text), all I really have are theories to explain this problem, such as:
  • Ad blockers preventing the image from loading somehow
  • RSS readers fetching the post once and replacing the images with their own cached copies
  • Something particular to how Blogger counts page loads that inflates that number substantially

The RSS issue is potentially a big issue for the experiment: Although I don’t have any particular evidence that this happened, it’s definitely possible for RSS readers to load the blog post itself (including, arbitrarily, one of the alternative bodies) and cache it for a large set of users. This could skew the results in a systematic way without any way for me to track it. Not that it couldn’t happen on the web, but I would expect this to be more of a problem with a blog post than a web page.

One final indignity: I found out when I went to analyze the results that Greenhouse doesn’t track clicks on the referral links, so although I know how many resumes were submitted from each blog post - it was about 50/50 - I can’t see if significantly more viewers clicked through to the job posting. I should have verified that before launching the experiment.

Together, these issues dealt a big blow to my analysis of the experiment, as they diminished the hope that the underreporting bias was at least uncorrelated with the alternative.


So, what did I learn? Mostly that running controlled experiments on social media is really challenging. For example, a completely natural thing for people within Khan Academy to do was visit the blog post and then send out links on Twitter and Facebook. However, this would wildly skew the outcome if they didn’t know to strip the hash markers at the end! I had to go to great lengths to make sure everyone knew to post only the unhashed link and I still ended up pushing an emergency update to the Javascript snippet the day the post was published after the official Khan Academy account accidentally tweeted out the wrong link.

Will I run an experiment like this again? Maybe, someday. However, it’s more of an undertaking than I initially thought to get right and very, very hard to verify that the resulting data is accurate, so I would reserve this trick for instances when there’s likely to be a much stronger effect on the final metric (in this case, submitted résumés).

Hope you learned something! And please share your own “sad trombone” moments in the comments.

So you want to build a recommender system?


So you want to build a recommender system?
Over the years, Khan Academy has built out recommendation engines for several parts of the platform, such as adaptive math ‘Missions’ and SAT practice, but our recommendations never touched one of our main types of content: the videos. For this summer’s Healthy Hackathon, several of us built a video recommendation system for Khan Academy. Using learners’ recent viewing history, we were able to predict what they would want to watch next. At the end of the hackathon, we demoed our model to the company by sending out personalized recommendation emails to all employees, and now we’re starting to test the system on the actual website. This post tells the story of how we got there:

Defining the Problem

Recommendations are everywhere: Netflix knows what you will want to watch next, and Spotify’s new Discover Weekly feature can open your eyes to songs you instantly fall in love with. With this variety of advanced systems comes an overwhelming body of literature, but unfortunately we could not implement all these techniques in four short days. Instead, we decided to focus on a clear and specific question and see how far some simple solutions got us. Our question:

Given the last video a user watched, what video will they want to watch next?

To start, we looked at our users’ video viewing history from the past month. Our model would take in a video a user watched and output several recommended videos. To train and test our system, we organized the viewing history into ordered pairs of videos. Part of these pairs would be used to construct the model; the other part would be used to test it by giving the model the first video in a pair and checking if the second video was one of the recommended output videos. With a clear problem in mind, we explored several approaches:

Model 1: Collaborative Filtering

One popular approach to recommendation systems is collaborative filtering - a method centered around the idea that you will like content similar to other content you’ve viewed. This similarity can be calculated based on other users’ patterns. For example, if many users buy both products A and B, collaborative filtering suggests that these products are related. When predictions are based on binary data, as opposed to ratings, the Slope One family of algorithms can be used. To understand this algorithm, consider the following table where 1 represents that the person saw the video, and a 0 represent that they have not.

The rows in this table represent whether the user watched a video or not (1 or 0). To get a similarity rating between videos, we take the dot product between the columns. In this example, the dot products are:

Between A and B:






Between A and C:






Between B and C:






Then, for each video we recommend the video that it had the highest dot product with (other than itself). So, for video A we could recommend Video C, for video B we could recommend C, for C we would recommend video A.

For this approach, we did not use the video pairs during training. Instead we just put a “1” for every video a user had watched during the month as shown in the example above. The model gave us 55% accuracy on the test data; not a bad initial result, as there are thousands of videos to choose recommendations from.
The main weakness of this approach is that it fails to capture one important aspect of our data: the order. While the order in which you buy products or watch movies may not matter, learning has some natural order to it: you can’t learn to add fractions until you’ve mastered adding natural numbers. The Slope One algorithm is agnostic to this property, so we had to search for a new solution.

Model 2: Markov Chains

Looking to capture the order of events, we turned to Markov chains Markov chains encode the likelihood of transitioning between different states. Here this a simple example from the Wikipedia page:

In this diagram, there are two states A and E. Given that you are in state E, you have a 0.3 chance of staying in state E and a 0.7 chance of going to state A.  From state A, you have a 0.6 chance of remaining in state A, and a 0.4 chance of moving to state E.

In our case the states were different videos, and transitions represented watching one video after another. For this model, we recorded all the transitions in the training data. Then, we encoded the probability of transition as the frequency of a given transition divided by total transitions out of that video. For example, if Video A appeared as the first video 10 times in the data, and 7 times the second video was B, then the probability of transitioning to A -> B is 0.7. Using these probabilities, we would pick the most probable transitions to return as recommendations. This approach was particularly appealing for its speed and compactness: once we pre-computed a matrix of transitions, we could quickly pick recommendations for all videos.

The model produced 66% accuracy on the testing data, a solid improvement over the previous model. But the interesting result appeared when we separated our predictions for videos watched on the same day versus different days. The graph says it best:

For video views more than a day apart, the accuracy dropped dramatically. This result is not entirely surprising - most of Khan Academy’s content is organized into tutorials: short sequences of related videos and/or exercises.

Since the videos in a tutorial are related, students will often watch them in a row. Thus, if a student watched “Evaluating an algebraic expression in a word problem”, it is likely that they will watch “Evaluating an algebraic expression with exponents” on the same day. However, when a student finishes a tutorial or takes several days between watching videos, there are many different places they could go next. Our data showed that the variance in these cases was too large to be captured by the direct transitions.

Model 3: “All-pairs” Markov Chains

Although the direct transitions failed to capture the longer-term viewing patterns, we speculated that these trends may still be present in the data. To explore this idea, we expanded the model to capture all forward transitions for users. If a user watched Video A, then B, then C, we would capture not only the transitions A to B and B to C, but also A to C. The rest of the set-up remained the same. Our accuracy with this model was 62%, a drop from the previous attempt. We suspected that this happened because the additional pairs added noise to previously clear transitions.

Model 4: Markov Model + Time

To find a middle ground between the two previous models, we wanted to incorporate days between video views directly into the model. This dimension would potentially allow us to distinguish between short-term and long-term viewing trends. Unfortunately, in the span of the hackathon, we were not able to fully develop this last model, but in the future it can provide promising results.

Final Models, Results and Thoughts

For the final model, we combined the simple Markov Model (model 2) and the “all-pairs” Markov Model (model 3) with a basic voting scheme, resulting in a 64% accuracy. Although this approach did not account for specific time differences between video views in the prediction, it captured the general order videos were watched in. We decided to combine the two models, instead of just using model 2, because we wanted to keep some of the information about longer term viewing patterns. This decision was motivated by an important insight:

“A good prediction does not a good recommendation make...sometimes."
-Yoda on recommendation systems

Recommendations can serve two main purposes “re-engagement” and “discovery”. Re-engagement means encouraging users to continue using Khan Academy in their learning process; discovery means showing users new content they may be interested in.  Increasing the accuracy of predictions can help accomplish the first goal, because it shows how well we capture established trends in the learning process. However, when it comes to discovery,  relying on prediction accuracy can fail because discovery is about what could happen, not what already did happen. By incorporating model 3, we started to incorporate this dimension. But in order to fully understand this dimension, we will need to find metrics that capture how student behavior changes with recommendations.

With our recommendation model, we were able to give pretty good answers to the established question: Given the last video a user watched, what video will they want to watch next? and gained a few insights for future models.

This post was written about one of the projects I did during my awesome summer internship at Khan Academy. If you want to know more about this project or my internship, you can reach through my personal site.

New opening: Data Scientist

New opening: Data Scientist
Yesterday we posted a new position on our careers page: Data Scientist. We're looking for talented folks, so if that's you please click on through and let us know!

For more about what it's like to be on the Data Science team at Khan Academy check out our recent post "Come be part of the team!".

Helping students learn at their level

Helping students learn at their level


At Khan Academy, our mission is to provide a free, world-class education for anyone, anywhere. One component of that is helping students find the content that will contribute the most to their learning.

Our “missions” — a guided path through a particular area of math content — are one way that students can discover new content to learn. Our recommendation system, which powers the mission dashboard, uses a model of a learner’s knowledge to predict what they already know and to suggest what they should work on next. (I won’t go into the details of our knowledge model, or how we’re working on improving it. This has been detailed elsewhere.)

Right now, it takes a while before our knowledge model has enough information to make good predictions of what a learner knows. As a result, some of the skills we recommend are things that the student has already learned. This isn’t ideal, because then learners spend too much of their limited time on review. On the other hand, we also don’t want to overcompensate and throw learners into content that they’re not ready for. This isn’t useful either, and it’s discouraging too.

Before we dive into how we’re going to address the problem, let’s take a quick look at how the system worked previously.

The mastery system

When a learner works on a particular skill on Khan Academy, we place them in one of five possible levels: unstarted, practiced, level one, level two, and mastered. To get from unstarted to practiced, the learner completes a “practice task,” a series of problems covering different aspects of a single skill. When the learner achieves five correct answers in a row, they’re promoted to practiced, and the practice task ends.

In order to further cement the skill, we use spaced repetition for the remaining levels: learners must wait at least 16 hours to advance from practiced to level one, and the same is true between level one and two, and between two and mastered.

Once the waiting period has elapsed, a student may see a single problem for the practiced skill in a mastery challenge. Unlike a practice task, which focuses on one skill, a mastery challenge is a series of problems from multiple skills. If the learner gets the problem correct, they’re promoted by one level (e.g. practiced to level one). In another 16 hours, the skill may appear in another mastery challenge, and another level may be gained.

For those keeping track, this is a minimum of 8 problems and 48 hours of waiting to get to the highest level of mastery. This often isn’t helpful for learners. For example, if I’m coming to Khan Academy to work on calculus, is it useful for me to do eight problems in single-digit addition (and everything in between)?

Fortunately this exact situation is not an issue because we group content into missions, which focus learners on certain groups of skills like calculus, algebra, or arithmetic. However, in less extreme situations — where the gap between what a learner knows and what we’re recommending for them to work on is within a mission (or where the learner has come to Khan Academy to discover what they don’t know) — the 8 problem / 48 hour system still uses up students’ precious time, which could be better spent learning new things or reinforcing previously practiced concepts.

In the past, we’ve tried several strategies to accelerate this process. Let’s briefly look at two of them: the pretest and challenge problems.

The pretest

Before Fall 2014, when someone started a math mission on Khan Academy for the first time, we’d show them a pretest. The pretest was a required task of 8 or so questions, selected adaptively based on a learner’s answers, designed to assess as quickly as possible what they already knew. After the test, we would recommend skills based on what we thought the student probably knew and what they still needed to learn.

The pretest helped many learners avoid days and days of unnecessary review, but it wasn’t perfect:

  1. It wasn’t complete enough. You can only assess a limited amount of knowledge in 8 questions, and for many learners, there was still a substantial amount of review to do afterwards.
  2. The way the questions were chosen was seemingly arbitrary and made the experience confusing for some learners. For example, the pretest might show calculus questions to someone who had just asked to work on third-grade math. It also felt like a barrier to many learners — something they needed to do before they could see the content they wanted to study.

Challenge problems

Challenge problems are a special type of problem that appear in mastery challenges when our knowledge model has learned that it’s very likely that a student already knows a given skill. They differ from normal mastery challenge problems in that they can appear even if a learner hasn’t practiced the skill on Khan Academy, they can bypass the 16-hour waiting period, and they can promote as far as mastery in just one problem. Challenge problems are the primary mechanism (other than selecting what content we present in the first place) by which our existing knowledge model can tune a learner’s path through a mission.

The problem remains

Especially now that we no longer have the pretest, there’s a substantial amount of time between when a learner starts out on Khan Academy and when there’s sufficient information for our knowledge model to start giving them challenge problems. At the same time, it’s critical for us to engage learners during this time— if a first-time visitor to Khan Academy feels bored by content that’s too easy, or feels like they’re not learning, then we aren’t succeeding at our mission of providing a world-class education to them.

One possible solution is to improve our knowledge model so that it better serves these learners; we’re also working on that. But that’s a complex and time-consuming process, and in the interest of helping our learners in the short-term, we turned to heuristics.

Designing an acceleration heuristic

In thinking about how to design a heuristic to push learners to new content faster, it’s helpful to consider why we think we can actually do this. That is, what do we think we’re going to be able to do that the knowledge model can’t? Then, we should focus on that, and let the knowledge model do its job for the rest.

Here are two things that we know and that our knowledge model may not fully capture:

  1. We know what mission a student selected to study. While students don’t always have an accurate picture of their own knowledge, a student who picked a certain mission is probably more likely to know the prerequisites for that mission than a randomly selected student would be.
  2. We have a graph of relationships between skills that has been curated by content experts. These relationships are things like “You need to have mastered addition of 1-digit numbers before you master addition of 2-digit numbers” and “If you’ve mastered multiplication with carrying, we’re pretty sure you know how to add.” We can use these curated relationships to help infer what a learner already knows. (Right now, you may be asking, “but won’t the knowledge model just learn these relationships if they’re valid?” You’re probably right; I’ll come back to this point at the end.)

And now… the heuristic!

Now that we’ve explicitly set out what information we can leverage, let’s look at the design of our heuristic, which we’re calling “cascading problems.”

In this heuristic, for each skill, a learner is in either a “cascading” (accelerated) state or a normal state. When a skill is in the cascading state, learners receive cascading problems for it in mastery challenges. When a learner first starts a mission, we always (even before the heuristic) show them “mission foundations,” a small set of prerequisite skills that we think you need to know before you can dive into the mission. Given the learner’s choice to work on the mission, we think it’s more likely they already know these foundations, and we start them in the cascading state. (This is how we incorporate our first potential advantage over the knowledge model, the learner’s choice of mission.)

Cascading problems appear in mastery challenges, but differ from normal mastery problems in four ways:

  1. They bypass practice tasks; the skill can immediately appear in a mastery challenge.
  2. There’s no 16-hour waiting period.
  3. Problems answered correctly promote by two levels (e.g. unstarted to level 1). This means it’s possible to get to mastery in only two problems.
  4. Achieving mastery using a cascading problem makes you eligible for cascading problems on that skill’s “postrequisites.”

Point 4 is why we call these problems cascading: so long as you keep answering problems correctly, the acceleration keeps cascading down the tree of prerequisites. This is also how we incorporate our curated content relationships; if you know a skill’s prerequisite, then you’re more likely to know that skill too.

If a cascading problem is answered incorrectly, it immediately moves that skill back to the normal state, and the mastery mechanics work normally. However, it’s possible to begin a cascade again by starting a practice task in a different skill for the first time and answering the first five problems correctly.

The diagram shows an example of how this might look in more concrete terms. Here, a learner mastered “multiplying 1-digit numbers” by answering two cascading problems correctly, and then incorrectly answered a cascading problem in “multiplying by tens.” They see normal mastery mechanics, including needing to complete a practice task, for “multiplying by tens” and its downstream skills. Sibling skills of “multiplying by tens” remain eligible for cascading problems because their prerequisite, “multiplying 1-digit numbers,” was mastered via cascading problems.


Does it work?

How do we evaluate whether the heuristic actually works? How do we make sure we’re not moving students on to new things before they’re ready?
To evaluate whether the heuristic is actually accelerating learners, we looked at mission completion — that is, what portion of learners finished a certain percentage of the mission they chose to work on. This has the advantage of being a pretty direct measurement of what we’re trying to achieve: helping learners quickly polish off content they already know and access harder content further into their mission.

Of course, the trivial way to increase this metric is just to award mastery in all skills to everyone! So, we also need a metric that tells us whether students are actually learning the content. For this, we looked at review accuracy. After a learner masters a skill, we periodically show them that skill in a mastery challenge for review. This helps keep old content fresh in the mind, but has the nice side effect of giving us another chance to measure whether that content is actually mastered. So, if we’re achieving our goal of accelerating learners through content they already know, we should be able to increase completion of missions without a substantial drop in review accuracy.

We shipped the cascading problem heuristic (in several variants; I won’t go into the nitty gritty) to a portion of learners and then looked at mission completion and review accuracy for each group.

In the variant that we eventually decided on, we found was that mission completion for learners using this heuristic was up by 62% compared to learners without it, and review accuracy was not significantly different. This confirmed that the heuristic was helping to accelerate learners through content that they already knew.

Wrap up

We’ve now shipped the best-performing variant of the cascading problems to everyone so that our students can spend less time proving to us what they already know and more time actually learning.

While this wasn’t exactly traditional data science with fancy models and advanced statistics, it was still a fun bit of creative problem solving. This is a good example of the kinds of projects many of us on the data science team at Khan Academy are working on — identifying a real problem facing our learners, hypothesizing a solution to that problem, employing controlled testing to evaluate that hypothesis, and then implementing a solution to ship to our millions of learners.

Finally, I promised that I’d come back to the question of why our knowledge models wouldn’t capture things like the curated prerequisite relationships we used in the cascading problem heuristic. Short answer: I imagine the models are capturing the curated prerequisite relationships at least to some extent. It’s pure speculation at this point, but I suspect that the reason the cascading problems nonetheless succeed is that they take a bigger risk! That is, they promote students even when we’re not as confident that they’ve mastered the skill. However, by using curated content relationships to decide which risks to take, we make that extra risk very directed. This might mitigate the issues associated with this extra risk and allow students to recover from overpromotion by spending more time on a related skill. Could we incorporate a bit more risk-taking into our models and get similar gains? Almost certainly. Perhaps when that happens, you’ll read about it here.




Come be part of the team!

Come be part of the team!
At Khan Academy, we have a lofty mission: A free world-class education for anyone, anywhere. As you might imagine, this is a very broad mandate! In order to make an impact of this magnitude we’ll need all the help we can get to expand accessibility, provide educational value, and meet the learner where they are. On the Data Science team we’re working full tilt to measure key metrics and analyze or model user behavior to guide this important mission, but there are always more students to reach, more subjects to teach and better interventions to try. Do you have strong analytical skills to bring to the effort? If so, read on.

So, what does a data scientist at Khan Academy actually do?

As a functional specialty within the product development team, we bring our unique analytical skills and perspective to new site features (for example, the Learning Dashboard, introduced fall 2013) or to the development of new educational content (the 7,400+ videos and 2,200+ exercises that promote and assess learning on the site). We design and run experiments to test site improvements and then analyze the results. We collect and report data on how the site is currently being used to inform current work or future projects. We also build and maintain infrastructure that automates some of the work and makes deeper analysis possible.

Do we use fancy-pants probabilistic models and machine learning algorithms? You bet! However, much more often the clearest insights are gleaned from simple models and careful analysis of the vast amounts of data the site accumulates each day. With over 3 billion problems done on the site to date, even the subtlest of improvements can be demonstrably significant, allowing us to put the latest pedagogic theories from academia to the test. For example, we’ve taken Dr. Carol Dweck’s research on Mindset and tested it in the real world. And we continually improve our recommendations system to reduce frustration and accelerate learning.

Since Khan Academy is a non-profit, we can focus entirely on optimizing for individual learning and overall social impact rather than advertising clicks or revenue per user.

Sounds great! How can I help?

We’re looking for strong analytical skills, both in the analysis/metrics domain (for the Product Analyst role) and in modeling (for the Data Scientist and Engineer role). You’ll be working closely on projects with other developers and will need to be able to hold your own writing production-ready code. And above all you will need to communicate your findings with your teammates and others in the company so we can act on your insights together.


There is no shortage of big unknowns to explore, and your personal impact will be huge. So what are you waiting for? Apply today:

Apply for Data Scientist position (Full-time)
Apply for Data Scientist and Engineer position (Full-time)
Apply for Data Scientist and Engineer position (Intern)
Apply for Product Analyst position (Full-time)

I need answers now! Using simulation to jump-start an experiment (Part II)


I need answers now! (Part II)
Using simulation to jump-start an experiment

NOTICE: No users were harmed in the writing of this blog post.

In the last installment of this series I talked about a very light-touch version of user modeling - we just took the existing user population and distorted it to approximate what it might look like under our proposed experiment. This is a really useful trick to have up one’s sleeve, but it only works for a very limited set of experiments. So to follow up we took the next logical step and created a complete virtual online world to plug users into to see how they behave in different situations.

Just kidding! We don’t have the budget for that. (Yet.)

Case study 2: User simulator

Suppose we are introducing a completely new feature on the site for which there is no historical analogue. In our specific case, we might introduce a new type of practice task which presents the user with different problems in math from what the normal tasks will present. How will those users perform? How long will it take them to complete all the content? Are they better off using the new practice task or sticking with the old one?

In order to answer these questions, we need two components:
  1. A set of working models (corresponding to different types of users at different ability levels) of user performance on a task, drawn from historical data. A simulated user will take actions based on the probabilistic predictions of their user model.
  2. An automated test harness that can simulate the outputs and options presented to the simulated user at any given point and respond to their inputs appropriately, while reporting useful metrics such as overall accuracy and time-to-completion.

The simulated user

[Wikimedia]
A simulated user can be as simple or as complex as we like, depending on what questions we want to answer. The simplest model for a student attempting math problems is simply a single number representing their overall accuracy level. At each problem we choose a random number from 0 to 1 and if that number is under the threshold, we submit a correct answer, otherwise we submit an incorrect one. Even though this is an unrealistic model it can help answer some important questions:
  • Can a “perfect” user (99% or 100% accuracy) actually complete a mission? (This is helpful for catching accidental circular dependencies or other bugs that block progress)
  • How does the time-to-completion vary with overall accuracy? Do we over-penalize for silly mistakes?
  • What is the minimal steady-state accuracy required to actually complete a mission?

It is easy to imagine various ways to improve the user model with real data: we vary the accuracy according to the difficulty of the question, increase the accuracy monotonically to simulate learning, etc. We need these more subtle models to be able to compare two treatments of a site feature with respect to:
  • Does overall accuracy go up or down in alternative B compared to A?
  • Do more users in alternative B actually complete the mission compared to alternative A?
  • Do users who complete a mission in alternative B do so in less time than alternative A?

There is definitely a point of diminishing returns, though. Our user model can’t predict perceptual effects like what font or color users will pay attention to or how they will react to various intrinsic or extrinsic motivators.

The test harness

Once you have a simulated user, you need a simulated version of the site for them to interact with.
You: Hey, isn’t that a lot of work to build?
Me: Yes, it sure is!
However, we’ve already done all that work in order to get integration tests working! End-to-end integration tests create user entities in a test database, make API calls on their behalf, and perform other necessary functions like temporarily override the current date and time. They also run in parallel and clean up after themselves between tests, which is exactly what we need to run a bunch of simulated users through a set of tasks independently. The more we can leverage that existing work the easier it becomes to create a functioning a user simulator.

After delegating setup and teardown to the code shared with tests, the test harness is responsible for creating a user entity, switching to the designated mission, fetching the list of recommended tasks, and completing them one by one, delegating any decisions (order to attempt tasks, correct or incorrect on each problem) to the user model it was initialized with. Different experimental conditions can be enabled or disabled for different subsets of users to simulate multiple A/B test alternatives. When the harness detects that the mission was completed or an error occurred it will write statistics for each alternative to a log and exit.

This scheme has turned out to work even better than expected. Aside from a few simulator-specific performance improvements the business logic is running the same code as in production. On a beefy machine with plenty of processors and memory, we can simulate hundreds of users in minutes, which can give us a quick sanity check that new features aren’t going to break or degrade the experience. The simulator has even caught a few regressions that could block a user from completing a mission. We now run it nightly as another continuous integration test.

Example

To see how one might use the simulator, here is an example. This fall we have been working intently on accelerating progress through math missions for users who already know the material. This can be beneficial for students starting at a level below their actual skill level, or wanting to review concepts they’ve already learned. We want to make the process of “catching up” to where you ought to be as quick and painless as possible, and one proxy for this is the time it takes to complete a mission for a user with high accuracy. We had already implemented an experiment to introduce “booster tasks”, which promote the user to a higher mastery level on a group of skills if he or she completes all the problems in the task at a high level of accuracy. The simulator allowed us to validate that the results of this experiment would be positive before actually shipping it to users.

The user simulator is highly configurable, and all I needed to do to run a simulated experiment for an already-implemented experiment is create a YAML file with the configuration I want:

# the slug of the mission you'd like to simulate
mission: cc-third-grade-math

# simulated users are run in parallel to each other. you can
# specify the number of processes in order to maximise
# performance for your machine's cpu's.
num_processes: 4

# whether or not to use the test db specified in datastore_path
use_test_db: true

# path to your datastore.
datastore_path: ../current.sqlite

# specify the parameters of the simulated users and the experiment
# groups into which they are segmented.
experiment_groups:
    # name of the experiment group
    group_a:
        ProbabilisticUser:
            num_users: 50

            # A/B test alternatives
            bigbingo_alternatives:
                booster_tasks_v3: control

            # the parameters with which the users are initialised.
            params:
                # session time per day in seconds.
                max_time_per_day: 1200

                # initial probability of getting problems correct
                starting_prob: 0.9

                # rate at which the ability increases per problem.
                learning_rate: 0.0

                # maximum ability.
                max_prob: 1.0

    group_b:
        ProbabilisticUser:
            num_users: 50

            # A/B test alternatives
            bigbingo_alternatives:
               booster_tasks_v3: booster
               booster_task_length: length-6
               booster_task_min_problems: min-problems-12

            # the parameters with which the users are initialised.
            params:
                # session time per day in seconds.
                max_time_per_day: 1200

                # initial probability of getting problems correct
                starting_prob: 0.9

                # rate at which the ability increases per problem.
                learning_rate: 0.0

                # maximum ability.
                max_prob: 1.0

This will create 100 simulated users split into two groups. Both groups have the same internal model: they are highly accurate users (who get exactly 90% of problems correct no matter the question) who are going to just tear through any problems we give them. They do however make mistakes, and the distribution of outcomes is going to reflect how the system reacts to those mistakes. The key difference between the two groups (highlighted in bold) is that one will be enrolled in the “booster_tasks_v3” experiment and the other won’t.

The results are emitted in CSV form. Here is the distribution of the most important statistic - how many problems taken to complete the mission - for 90% accurate users and 95% accurate users:

As a reward for reading this far, here's a tasty graph!

In the “without boosters” condition, the primary acceleration mechanic is “cascading challenge” exercises, which continue fast-track the user through mastery levels on consecutive skills while they are getting answers correct. This works works OK for 95% accuracy users, but when the user starts making careless errors those errors can have a huge effect on completion times, as we can see for the 90% accurate users - the range is 130-620 problems! (Note that this simple model assumes it’s equally likely the user errs on a simple problem as a complex problem, which is not true in practice.)

With boosters it is clear that the number of problems required is significantly smaller, but the variability is also dramatically reduced. The worst case is now down to a manageable 190 problems. Of course, we have to validate that users who don’t actually know the material don’t also get promoted at a faster rate. We can just tweak the parameters and run the simulator again.

Conclusion

Even without going down the rabbit hole of creating really sophisticated or realistic user models, we have already derived a huge benefit from the ability to roughly compare different treatments on the site and sanity check that increasingly complex systems work as designed. Now whenever we are considering a new improvement or feature and we want to know whether it will be effective, we can go down the list:
  1. Can we find evidence for it in our existing user data?
  2. Can we run a simulation based on existing user data to find evidence?
  3. Can we run an experiment with actual users to gather evidence?
Simulations add another form of fast, immediate feedback that doesn’t require shipping anything to users. Having a system in place that automates certain kinds of simulations makes the cost/benefit tradeoff even more favorable, so there’s absolutely no reason not to run a simulation of an experiment before shipping it.

If you enjoyed this post and you are interested in the software development side of things, check out my personal blog at arguingwithalgorithms.com.