The science of goo d design
What is good design? Perhaps numbers, not creativity, hold the answer. Michel Ferreira takes a deep dive into A/B testing
Michel Ferreira argues that data is the key to great design in this deep-dive into A/B testing
You’ve probably heard or read about A/B tests. However, like most things we digest online there’s a lot of misinformation out there. In this article, I’d like to take the time to look at best practices in A/B testing, common pitfalls and how experimentation can fast-forward your skills to the next level.
Let’s forget everything we already know and start from scratch. An A/B test is a randomised comparison of two versions of a webpage. That means 50 per cent of your traffic is randomly presented with version A of a page and 50 per cent with version B. We use a control (version A), to compare with a variation (version B), which we anticipate will have an effect on any specific metric. The metrics can be anything from conversion rate or time spent on the page to the number of clicks or how long it takes to complete a task.
To illustrate this, I’ll start with a simple example. A landing page contains
a blue call to action. This is the control against which our test will be measured. On the variation, the call to action will have been changed to green. An A/B test will run, after which the data will show us if either version has a negative or positive effect.
However, if you ran this test with no hypothesis, and tried to observe the data, you’d see differences in lots of metrics. Unfortunately this won’t help you prove anything.
The best way to get good, reliable results is to develop your experiment with the exact metric you’re targeting in mind. In this case, your hypothesis could be that you expect the green button to receive a higher percentage of clickthroughs than the blue. Let’s dig into this even further.
Design of experiments
I work as a designer at Booking.com, where I sometimes joke our designers should be called ‘designentists’. That’s because we believe in testing absolutely everything that we build. We do this through something called ‘design of experiments’ (DOE). This is a systematic method to determine the relationship between the factors affecting a process and the output of that process.
In other words, it is used to find causeand-effect relationships. Using DOE, we use experiments to test an idea and to make sure their effect is not caused by chance or external factors.
Since we’re not data scientists, you may think DOE sounds too complicated. But any experiment can be designed just by following these five steps: 1 Make observations 2 Formulate a hypothesis 3 Design and conduct an experiment to test the hypothesis 4 Evaluate the results of the experiment 5 Accept or reject the hypothesis Let’s break it down and look at each of these steps in more detail.
We start by observing user behaviour, either during user research or by looking at the available data collected by our website. You can review historical data to see trends in your customer’s behaviour, or look at Google Analytics and any other tool you have at your disposal. Try to identify the user’s pain points or anything you believe could improve their overall experience.
Formulate a hypothesis
Hypotheses can be simple ideas like ‘If we modify the copy on the Register button, we expect more users will create accounts because of how much
If you ran a test with no hypothesis, you’d see differences in lots of metrics, but they wouldn’t help you prove anything. Develop your experiment with the exact metric you’re targeting in mind
simpler it is to understand the new message’ or ‘If we increase the size of a button, we will get more users to complete their purchase because it will improve readability.’
You can experiment with anything you like, as long as you can measure it. So how about a technical improvement? ‘If we remove the extra image calls on a page, we will reduce load time, drive users down the shopping cart faster, and increase conversion.’
When formulating ideas, it’s important to have a clear reason for the change. Your best bet is to test ‘SMART’ questions: those that are significant, measurable, achievable, results-oriented and time-bound. With SMART questions, you’ll get better answers. And those will be very important in the end.
Design and build your idea
You have to publish and run your test. I won’t teach you how to design, but I will point out that execution could make or break your experiment. Choosing something you can measure, with a high probability of impact, can really make a difference here. If you start by accepting that most experiments fail, you’ll be able to perform more tests, faster, and learn from your failures. Iterate, rinse, repeat.
Evaluate the results
the sample size, the less time there is to achieve confidence on the statistical significance of your hypothesis. If there is a small effect (say a 0.1 per cent increase in conversion rate) you will need a very large sample size to determine whether that difference is significant, or due to chance. Larger effects can be validated with a smaller sample size.
But here’s a curve ball. You’ve checked your numbers, and everything indicates you only need one week – let’s call that a business cycle – to achieve your results with the necessary confidence (you can use online calculators to determine that: netm.ag/calculate-286).
But the calculator looks only at numbers. Then I ask you this: Can you remember what happened this Friday? Now compare it to Friday 5 August 2016, first day of the Summer Olympics in Rio. Do you think your website’s customers’ behaviours will be the same? Short answer is: No.
Users’ behaviours are affected in unexpected ways, by planned and unplanned events. And because of that, you should run your experiments for at least two full business cycles. That way, you’ll not only get a bigger sample but you’ll also cover your bases if something completely unexpected happens on the week you run the test. Don’t stop the experiment before its cycle is complete, and always run it for full cycles (two full weeks or months).
For valid experimentation, you also need to make sure you run both versions of your page at the same time, with randomised users, and not version A for 100 per cent of the traffic and then B for 100 per cent of the traffic. This would mean you were testing your solution against two different user bases, and not getting real results.
Data does not speak
When developing your questions, consider how you will measure their success. Because in most cases, when the time comes to analyse the data, the answers won’t be descriptive. All you’ll see is ‘yes’, ‘no’ or ‘goodbye’ (inconclusive results).
Ask yourself, can your question be easily answered by these responses? For instance, let’s say you ask: Would a hamburger menu icon work better for my website than the word ‘menu’? Reviewing the data you see ‘No’. Can we make the question better so the answer is easier to understand? Let’s reformulate the same question using the SMART format and add some measurable goals.
How about this: Based on collected data, we believe the hamburger menu icon could be bad for our users and hinder the proper navigation of our website’s secondary actions. We will test this by assuming that a new version with the word ‘menu’ instead of the icon would be easier to understand and improve the overall menu engagement (clicks on menu and clicks in all links inside the menu). We also expect the overall number of users that finish their hotel reservations to be impacted. This will run on our mobile website for two weeks before we make a decision.
Yes, no, goodbye
Time to look at our data and accept or reject the hypothesis. Review the question and see if the answer is now obvious. Does it show an improvement or positive difference? Was there an obvious negative impact on the metric you were aiming for? Or does it simply show no significant result, leaving the experiment inconclusive?
A common mistake is to believe that an inconclusive result means there is a neutral effect and therefore to consider it an acceptable change. Beware: the fact that you can’t see a variation doesn’t
mean the feature is acceptable or better for your users. It just means you can’t measure the impact of what you’re testing. You will either want to review the solution and see if there are other ways to solve the problem, or abandon the idea entirely.
In some cases you’ll see a difference in metrics that are not your primary focus. Try to understand why this is happening. If there’s a negative impact, is it obvious why? For instance, let’s say you tried to improve sales by moving the disclaimers into a new tab on your product page, but then saw a huge number of cancellations or product returns.
Or is it something you can’t place? Let’s say you modified a navigation item on the header of the website, and now users are filling out review forms more often. In this case, your design intuition will be key to understanding what the data is saying. It’s like trying to understand what your users are saying just by looking at their body language.
More importantly, don’t accept any positive result just because it is positive, especially if it is not obvious or related to your hypothesis (‘There is no such thing as magic, Harry’). Sure, you want your results to be positive, but more importantly you want them to be true.
To conversion and beyond
At Booking.com we optimise our website in small steps. But not because we want to obsess over every small detail; we want to have measurable steps that, when validated, will lead our product to become better. Rather than improving one thing 10 per cent (which is really difficult in a high-performing website), we go out and find thousands of things to improve a fraction of a percent. This is achievable and much simpler.
Don’t try to optimise more than one thing at a time. This not only fails to produce results, but when it does, it is impossible to know why and to learn from it. Say you ran a test in which you added an image and changed the colour of the button at the same time, and this generated positive results. You won’t know if it was the image or the colour change that created the effect, making it impossible to learn from it.
Considering most tests fail, if you had a negative result would you be able to say for sure that users prefer pages without images and without blue buttons? As Colin McFarland says: Design like you’re right. Test like you’re wrong.
Why is testing the key to good design? Because ‘good’ is subjective in any case, but worse than that is trying to define ‘good design’. Designers disagree on a number of things. Is Helvetica a good font or not? Don’t get us started on Comic Sans. These discussions are never-ending and no one is right.
A/B testing takes opinions out of the equation. You come in with an idea; an educated guess of what you believe is good for the users of your website. And the users show you the answer. It’s a democracy of good ideas. Ideas that you believe add value to your customer base and that they have the chance to accept or reject.
Your job then is to use your design and problem-solving skills to keep making your ideas better. The entire goal of your design process shifts to finding better solutions to customers’ issues, and refining those to give them the best experience possible. Doesn’t sound easy, does it? But who said good design was meant to be easy?
Don’t just accept any positive result, especially if it is not obvious or related to your hypothesis. Sure you want your results to be positive, but more importantly you want them to be true
A/B basics A/B testing is a randomised control test, where 50 per cent of your traffic is presented with a variation to test if that change has any measurable benefit
Facts If this change had been implemented without using an A/B test, you’d never know if it had any effect
Building a hypothesis My hypothesis is that increasing the size of a button will get more users to complete their purchase because it will improve readability
Hosted options If you’re looking for a quick start into the world of A/B tests, there are some great hosted options available
Menu styles We experimented with using the word ‘menu’ versus an icon to see which pattern suits our users
Just one change If we just add an image to our landing page next to the call to action, and see a difference, we’ll be able to confidently attribute it to our change
Two changes Would you be able to say with confidence that users prefer landing pages without images or green buttons?