Category: A/B Experiments

  • The Truth about A/B Testing

    Last Thursday 100+ people crammed into a fireside on Growth and Product.**

    This blog post covers the conundrum of Statistical significance in A/B experiments:

    1. why the size of uplift is important

    2. how much data gives me statistical significance?

    3. how long you have to run an experiment (you will be shocked)

    4. are you better tossing a coin?

    5. what you pick may delay experiment cadence.

    Lets start with this top-level discussion (** with Jordan from Deputy)

    Firstly, here are some links if you are interested in the basics and maths of A/B testing.

    Jordan illustrated the point with this chart. The curve shows practically even large startups have volume challenges. Even with 1000 customers/week entering a 50:50 A/B test, if you only are looking for 5% uplift on an existing 25% conversion, then you would need to wait 35 weeks for statistical significance!

    Only then can you make a data-driven decision.

    Source: Jordan Lewis, Deputy

    You have a lot of things to agonise over when you are losing prospects in your funnel (whether it be registration, activation or getting a payment) – which elements do you pick? Which fields do you remove (please refer to my earlier Deputy blog post on trialler incentives). So you need to pick your A/B experiments carefully.

    Simply put: Jordan makes the point that you can only run one experiment (on a page/process) until you have a “winner”. Then you can start the second. This means that elapsed time will hamper your productive output (per year) as a Product or Growth team.

    Gut-Data-Gut – There is a wonderful talk from Stanford about how you need to optimise with the expertise and prudence of your team. Because of time constraints:

    A) you MUST make bets on the biggest upward movers of statistical significance.

    B) you MUST make bets on the smallest downward movers of statistical significance if the experiment fails (your failures are NOT glorious).

    Monitoring both impacts is critical to ensure you are converging on the best experiments and not doing damage in the process!

  • Deputy’s onboarding gamification – genius!

    In the upcoming full interview with Deputy.com‘s Growth team leads (Francois Bondiguel – Head of Growth and Jordan Lewis – Director of Growth) we cover a lot of great Onboarding/Growth experiences and ideas. 

    But I’ve clipped this onboarding gamification example for special attention – its the most genius growth hack you’re are going to see this week!

    (to make sure you get notified when I publish the full interview, add your email in the sidebar on the right –>

    B2B SaaS Trials

    In B2B SaaS its super-common for the prospect to ask for an extension for their trial.

    At Contextual we get this all the time because whilst the Product Manager loves what they see, they need to schedule developers to integrate the mobile SDK’s – so our 14 day trial is sometimes too short.

    Deputy recognized this as a reality of their business and their customer success so they flipped it on its head and allowed customers to self-service their trial extension by performing onboarding tasks.

    Lets hear what Jordan and Francois have to say:

    Specific Activities that get an extension

    This video covers what are the goals that Deputy want their prospective customers to achieve. For example, getting on a screenshare call is a reliable predictor or trialler –> paid customer.

    The extension activities are:

    1. Add your Business Name and details. Gives you 6 days.
    2.  Book a screenshare
    3.  Setup your mobile number and install the mobile App. Earns 2 days.
    4. Add your employees for 5 extra days.
    5. Once you’ve published a shift more free days will be unlocked
    6. Approve a timesheet to unlock free days
    7. Choosing your plan allows you to keep your free days.

    The next video digs in and shows you the type of dialog and carousel content they show the user to explain and encourage.

    Growth is iteration

    Finally Jordan mentions at the end of the video the number of times they have iterated this flow – have a listen you will be surprised!

    Tools like Contextual allow you to iterate quickly with some of this content without disrupting the product roadmap, not all things can be done “no-code” but a lot of experiments can be tested, measured to see what is working the best!

    Aligned goals

    This is a great example about thinking about the customers goals and recognizing the overlap with the Product/Growth teams goals. As I discussed in the post “Goals, Segments, User Activation” don’t prioritize yours goals above the user’s goals. They have a JTBD and that is what will drive activation and retention!

  • Onboarding A/B Tests – the math by example

    In the previous post I ran through why it makes sense to run onboarding experiments and measuring them under an A/B or A/A/B methodology. I stuck to the qualitative principles and didn’t get “into the weeds” of the math. However, I promised to follow up with an explanation of statistical significance for the geek minded.
    Because A/B has been around for a very long time in various “web” fields such as landing page optimisation, email blasts and advertising – this is by far the first, last or most useful. The purpose here is to:

    • tightly couple the running onboarding and educations to a purpose, and that is:
      • Make onboarding less “spray and pray” and head towards more ordered directions of continuous improvement
      • deepen user engagement with your App’s features.
    • Explain the reason why the Contextual Dashboard presents these few metrics rather a zillion pretty charts that don’t do anything other than befuddle your boss.

    In this case, we will consider a simple A/B test (or Champion vs Challenger).

    Confidence for statistical significance

    Back to that statistics lecture again (my 2nd-year engineering statistics class was in evenings and usually preceded by a student’s meal of boiled rice, soy sauce and Guinness (the nutrition element) – so I’ll rely more on Wikipedia than my lecture notes 🙂

    If you think about your A and B experiments, you should get a normal distribution of behaviour – plotting on the chart you get the mean which is the center point of the curve and a population that is plotted either side of the center – yielding a chart like this.

    Confidence Interval is the range of values in a normal distribution than that fit a percentage of the population. In the chart below, 95% of the population is in blue.

    Most commonly the confidence interval of 95% is used, here is what Wikipedia says about 95% and 1.96:

    95% of the area under a normal curve lies within roughly 1.96 standard deviations of the mean, and due to the central limit theorem, this number is therefore used in the construction of approximate 95% confidence intervals.

    The Math by Example

    Let’s take a simple example of an App that is in its default state as the engineers have delivered it, there is a new feature that has been delivered but the Product Manager wants to increase the uptake and engagement of the feature. The goal is to split the audience and measure the uplift of the feature.

    We call the usage of the new feature a “convert” and a 10% conversion rate means that 10% of the total population in the “split matches”.

    CHAMPION

    This is the App’s default state.

    • T = 1000 split matches
    • C = 100 convert (10% conversion rate).
    • 95% range ⇒ 1.96

    The standard error for the champion:

    user onboarding = 1.96 * SQRT(0.1 * (1-0.1) / 1000)= 0.00949

    Standard Error (SE) = 1.96 * 0.00949 = 0.0186

    • C ± SE
    • 10% ± 1.9% = 8.1% to 11.9%

    CHALLENGER:

    This is the App’s default state PLUS the Product Manager’s tip/tour/modal to educate users about this awesome new feature.

    • T = 1000 split matches
    • C = 150 convert (15% conversion rate)
    • 95% range ⇒ 1.96

    SE (challenger)

    user onboarding = 1.96 * SQRT(0.15 * (1-0.15) / 1000)= 0.01129

    Standard Error (SE) =1.96  * 0.01129 *= 0.02213

    • C ± SE
    • 15% ± 2.2% = 12.8% to 17.2%

    Now charting these 2 normal distributions to see the results. Thus, since there is no overlap using the 95%/1.96 confidence, the variation results are accepted as reliable. (I couldn’t figure out how to do the shading for the 95%!)

    In this case you can conclude that the A/B test has succeeded with a clear winner and can be declared as a new champion. If you refer back to the last post, then iteration can be part of your methodology to continuously improve.

    How long should an experiment run?

    Experiments should run to a statistical conclusion, rather than rubbing your chin and saying “lets run it for 3 days” or “lets run it in June” – period based decisions are logical to humans but that has nothing to do with the experiment**.
    So my example above is technically not helpful if the data hadn’t provided a conclusive result – this is argued in a most excellent paper from 2010 by Evan Miller. Vendors of dashboard products like ours can encourage the wrong behaviour by tying the experiment to a time period

    **  except for the behaviour of your human subjects – like your demographic are all on summer holidays

  • Mobile Onboarding A/B testing simply explained

    In earlier posts about Google’s and Twitter’s onboarding tips we mentioned they would absolutely be measuring the impact of Tips and Tours to get the maximum uplift of user understanding and engagement.

    One method is just by simply looking at your analytics and checking the click-thru rate or whatever other CTA (call-to-action) outcome you desired. But 2 big questions loom:

    1. Is what I’m doing going to be a better experience for users?
    2. How do you “continuously improve?

    In recent years – rather than a “spray and pray” approach, it’s favorable to test-and-learn on a subset of your users. Facebook famously run many experiments per day and because their audience size and demographic diversity is massive they can “continuously improve” to killer engagement. If they “burn a few people” along the way its marginal collateral damage in the execution of their bigger goals.

    That sounds mercenary but the “greater-good” is that by learning effectiveness of your experiments will result in better user experiences across the entire user-base and more retained users.

    What do I mean by Mobile Onboarding?

    Onboarding is the early phases of a user’s experience with your App. A wise Product Manager recently said to me “on-boarding doesn’t make a product… …but it can break the product”.

    If you are familiar with Dave McClure’s “startup metrics for pirates” – then the goal of Onboarding is to get the user to the “AR” in “AARRR”. To recap:

    • A – Acquisition
    • A – Activation
    • R – Retention
    • R – Referral
    • R – Revenue

    So Onboarding’s “job” is to get a user Activated and Retended or Retentioned (can I make those words up? OK, OK “Retained”).

    Because a user’s attention-span is slightly worse than a goldfish your best shot is to get the user Activated in the 1st visit. Once they are gone, they may forget you and move onto other tasks.

    Yes – but specifically what do you mean by Onboarding?

    Activation is learning how a user gets to the “ah-ha” moment and cognizes your Apps utility into their “problem solving” model. Specific actions on onboarding are:

    • Get them some instant gratification
    • Get them some more instant gratification
    • Trade some gratification for a favour in return
      • User registration
      • Invite a friend
      • Push notification permission
    • Most importantly it is the education and execution of a task in the App that gets the “ah-ha” moment. This is often:
      • Carousels
      • Tips
      • Tours
      • Coachmarks
      • A guided set of tasks

    Progressive (or Feature) Onboarding

    Any App typically has more than one feature. Many retailers, banks, insurers, real-estate, telcos (and others) have Apps that have multiple nuggets of utility built into the App.

    This is because they have a deep, varied relationship with their customers and multiple features all need to be onboarded. We can’t decide what to call this yet – its “feature” driven – but the goal is to progressively deepen a user’s understanding and extracted value from the App.

    So onboarding (and A/B testing) applies to more than the first “activation” stage of the App.

    What is A/B testing?

    A/B testing, or split testing, are simple experiments to determine which option, A or B, produces a better outcome. It observes the effect of changing a single element, such as the presenting a Tip or Tour to educate a user.

    Champion vs Challenger

    When the process of experimentation is ongoing, the process is known as champion/challenger. The current champion is tested against new challengers to continuously improve the outcome. This is how Contextual allows you to run experiments on an ongoing basis so you can continue to improve your Activation.

    A/B Testing Process

    Step 1: Form a hypothesis around a question you would like to test. The “split” above  might be testing an experiment (based on a hypothesis) that running a Tip or Tour will influence a “Success Metric” of “Purchases”.

    The “Success Metric” does not need to be something so obvious, it may be testing the effectiveness of an experiment to alter “times opened in last 7 days” across the sample population.

    Here’s another example teaching a user how to update their profile and add a selfie.

    Step 2: Know you need statistical significance (or confidence). See the section below on this – it’s a bit statistical but in summary the certainty you want that the outcome of your experiment reflects the truth. Do not simply compare absolute numbers unless the two numbers are so different that you can be sure just by looking at them, such as a difference in conversion rate between 20% and 35%.

    Step 3: Collect enough data to test your hypothesis. With more subtle variations under experiment, more data needs to be collected to make an unambiguous distinction of statistical confidence decided in Step 2.

    Step 4: Analyse the data to draw conclusions. Contextual provides you with the comparison of performance for every campaign grouped by the same “success metric”. The chart below shows the:

    • Blue is the Control Group (Champion)
    • Green is your Experiment  (Challenger)
    • The last 30 days history.

    “Contextual automatically captures screen visits and button clicks without you needing to a-priori think about it”

    Iterate

    Step 5: Build from the conclusions to continue further experiment iterations.

    Sometimes this might mean:

    • Declaring a  new “Champion”
    • Refining a new “Challenger”
    • Or scrapping the hypothesis.

    The most impressive results come from having a culture of ongoing experiments. It will take some time but ultimately the Product Manager can recruit others in their team (developers, QA, growth hackers) to propose other experiments.

    Statistical Significance

    Picking the right metric

    Running experiments are only useful if:

    • You selected the correct “Success Metric” to examine. In Contextual we allow you to automatically chart your “Success Metrics” comparisons, but we also allow you to “what-if” other metrics. Contextual:
    • automatically captures screen visits and button clicks without you needing to a-priori think about it.
    • allows you to sync data from your backend systems so you can measure other out-of-band data like purchases or loyalty points etc.

    A/A/B or A/A/B/B Testing

    It has become more common to also duplicate identical running of an experiment to eliminate any question of statistical biasing using the A/B tool. If there is a variation between A–A or B/B is “statistically significant” then the experiment is invalidated and reject the experiment.

    Sample Size and Significance

    If you toss a coin 2 times its a lousy experiment.  There is an awesome Derren Brown “10 heads in a row” show. Here’s the spoiler video! If you remember back to your statistics classes at College/University the “standard error” (not “standard deviation”) of both A and B need to NOT overlap in order to have significance.

    Where T = test group count and C = converts count and 95% range is 1.96, Standard Error is:

    I’ll do a whole separate post on it for the geeks but using a calculator in the product is good enough for mortals 🙂
    UPDATE: The geek post is here!

    A/B testing vs multivariate testing

    A/A/B is a form of multivariate testing. But multivariate testing is a usually a more complicated form of experimentation that tests changes to several elements of a single page or action at the same time. One example would be testing changes to the colour scheme, picture used and the title font of a landing page.

    The main advantage is being able to see how changes in different elements interact with each other. It is easier to determine the most effective combination of elements using multivariate testing. This whole picture view also allows smaller elements to be tested than A/B testing, since these are more likely to be affected by other components.

    However, since testing multiple variables at once splits up the traffic stream, only sites with substantial amounts of daily traffic are able to conduct meaningful multivariate testing within a reasonable time frame. Each combination of variables must be separated out. For example, if you are testing changes to the colour, font and shape of a call to action button at the same time, each with two options, this results in 8 combinations (2 x 2 x 2) that must be tested at the same time.

    Generally, A/B testing is a better option because of its simplicity in design, implementation and analysis

    Summary

    Experiments can be “spray-and-pray” or they can be run with a discipline that provides statistical certaintly. I’m not saying its an essential step and the ONLY metric you want to apply to your App engagement – but as tools become available to make this testing possible you have the foundations to make it part of you culture.