The Experiment

Monday, October 05, 2015 KADataScience 3 Comments

The Experiment

Not every experiment goes the way you want it to. One of the most humbling things about dealing with data - and that one that I was least prepared for when I started running experiments - is the ever-present threat of wasting a lot of time (or worse, making an ill-informed decision) because of some really simple mistakes or unforeseen “gotchas” when constructing your experiment. However, it’s important to us as a team (and maybe even us as an industry *nudge*) to talk as openly about our mistakes as our successes. So in the spirit of avoiding publication bias, here is a post about a fun, lighthearted experiment that taught me a lot about the tricky business of data collection.


Remember this post? We wanted to advertise job openings on the Data Science team, and I had a lot of ideas about how to go about it. After putting some ideas down I realized I had two possible approaches to the pitch, which simply put were: “here are the people you’ll get to work with”, and “here are the projects you’ll get to work on”. We deal with these kinds of design decisions every day, and our go-to method for resolving them is the A/B test. This struck me as an amusing idea: could I run two versions of the post and compare their performance? This isn’t something I’d seen done before, so I decided to give it a try.

In any experiment the first decision to be made is: what are we optimizing for? In this particular case, the choice is obvious: we want to maximize the number of job applications received as a result of reading the blog post. However, I didn’t want to rely exclusively on this since the expected conversion rate was fairly low - thousands of people read any given blog post, but only a handful generally apply for a job. So I also wanted to track some other proxy for engagement, and I settled on the number of times the post was shared. More shares would mean the post struck a chord with readers and reached a wider audience.

While we run similar tests all the time, I had never run an A/B test on a blog post before. The blog itself is hosted on Blogger, so I didn’t have access to our amazing A/B test framework. I had to improvise ways to show the different alternatives and measure the variables I was interested in. Luckily, I had the ability to add some JavaScript to the page, and so I wrote up a snippet that uses both the location hash and browser cookies to assign a user to a persistent alternative:


$(function() {
  var getHash = function() {
    return (window.location.hash === "#___") ? "contentA" : ((window.location.hash === "#__") ? "contentB" : null);
  }
  var setHash = function(value) {
    window.location.hash = (value === "contentA") ? "___" : ((value === "contentB") ? "__" : null);
  }
  var setCookie = function(value) { ... }
  var getCookie = function() { ... }

  if ($(".PostBody .contentA").length > 0) {
    var alt = getHash() || getCookie() || ((Math.random() > 0.5) ? "contentA" : "contentB");

    setHash(alt);
    setCookie(alt);

    // Show the appropriate content
    $(".PostBody .loading").hide();
    $(".PostBody ." + alt).show();

    // Fix up page load tracker img links
    var $el = $(".PostBody ." + alt + " img.page-load");
    $el.attr({src: $el.data("src")});

    // Fix up Twitter/Facebook links
    $(".share-story .tips").each(function(idx, el) {
      if ($(el).data("title") === "Facebook" || $(el).data("title") === "Twitter") {
        $(el).attr("href", $(el).attr("href").replace(
          "\.html", ".html%23" + (alt === "contentB" ? "__" : "___")));
      }
    })
  }
});
There were (I removed one of the alternatives when I ended the experiment) two different divs with the “PostBody” class on the page, both initially hidden. There are several moving parts to this experiment:
  • If the user hasn’t seen any alternative yet (as stored in either the location hash or the cookie) then we randomly assign them an alternative and rewrite both the hash (using “#__” and “#___” to be minimally obvious) and the cookie.
  • We show the appropriate content given the alternative the user has been assigned.
  • There is an <img> tag in each of the post body versions that contains as shortened link to a blank image, which serves as a web beacon. This isn’t visible on the page but the URL shortener service theoretically (see below) allows me to count the number of views in each alternative. However, if I just embed the shortened URL into the Blogger post, Blogger (helpfully) follows the link and replaces it with a static cached copy, so I have to set the “src” attribute from JavaScript to avoid that problem.
  • I want shares of the page, if at all possible, to point to the same alternative that the person sharing saw. This allows me to test whether one alternative gets shared more than the other: if this is true, then the more shared alternative will tend to have more views. So I rewrite the Facebook and Twitter share button links to include the alternative markers corresponding to the correct alternative.
  • Crucially, the referrer links in the page to the job postings in Greenhouse are different depending on the alternative so we can track which applicants were referred from which alternative.

I must admit, I was pretty proud of myself for working through all these issues. I published the post and moved on to other projects.


Fast forward to August. It’s been a fantastic summer here at Khan Academy: We’ve shipped a brand-new Official SAT Practice product to the site, hacked together prototypes in our annual Healthy Hackathon, and read about some interesting research in Statistics Club. And we’ve got some big challenges to look forward to in the fall. So I figured now is a good moment to take a break and evaluate this really quick experiment I launched about half a year ago.

First, here are some results:

TotalAlternative AAlternative B
Views (according to Blogger)11,131
Views (according to goo.gl)3,32519371388
(100%)(58.26%)(41.74%)
Résumés submitted1165


The first thing to note is the huge discrepancy (an order of magnitude difference) between total views according to Blogger’s built-in analytics and the number of views I was able to capture using the web beacon. Since I know that more people than that run with JavaScript enabled and therefore should be seeing the post body (and I haven’t seen any reports of missing body text), all I really have are theories to explain this problem, such as:
  • Ad blockers preventing the image from loading somehow
  • RSS readers fetching the post once and replacing the images with their own cached copies
  • Something particular to how Blogger counts page loads that inflates that number substantially

The RSS issue is potentially a big issue for the experiment: Although I don’t have any particular evidence that this happened, it’s definitely possible for RSS readers to load the blog post itself (including, arbitrarily, one of the alternative bodies) and cache it for a large set of users. This could skew the results in a systematic way without any way for me to track it. Not that it couldn’t happen on the web, but I would expect this to be more of a problem with a blog post than a web page.

One final indignity: I found out when I went to analyze the results that Greenhouse doesn’t track clicks on the referral links, so although I know how many resumes were submitted from each blog post - it was about 50/50 - I can’t see if significantly more viewers clicked through to the job posting. I should have verified that before launching the experiment.

Together, these issues dealt a big blow to my analysis of the experiment, as they diminished the hope that the underreporting bias was at least uncorrelated with the alternative.


So, what did I learn? Mostly that running controlled experiments on social media is really challenging. For example, a completely natural thing for people within Khan Academy to do was visit the blog post and then send out links on Twitter and Facebook. However, this would wildly skew the outcome if they didn’t know to strip the hash markers at the end! I had to go to great lengths to make sure everyone knew to post only the unhashed link and I still ended up pushing an emergency update to the Javascript snippet the day the post was published after the official Khan Academy account accidentally tweeted out the wrong link.

Will I run an experiment like this again? Maybe, someday. However, it’s more of an undertaking than I initially thought to get right and very, very hard to verify that the resulting data is accurate, so I would reserve this trick for instances when there’s likely to be a much stronger effect on the final metric (in this case, submitted résumés).

Hope you learned something! And please share your own “sad trombone” moments in the comments.

3 comments:

  1. I've had many sad trombone moments, but never turned them around into a really interesting blog post! I love this, thank you for sharing your lessons.

    ReplyDelete
  2. I've had many sad trombone moments, but never turned them around into a really interesting blog post! I love this, thank you for sharing your lessons.

    ReplyDelete
  3. You have a Statistics Club at Khan Academy? I want to join.

    ReplyDelete