Why you cannot validate hypotheses with usability tests
November 29, 2016
If you plan to conduct usability tests in order to validate your hypotheses, the best case scenario is that these tests go to waste. The worst case scenario is that you opt out of using great ideas.
Let me make one thing perfectly clear from the outset. Qualitative and quantitative methods both have a role to play in the toolbox. Don’t stop conducting usability tests.
At the end of the 1990s, Rolf Molich (Jakob Nielsens’ mate) came out with his CUE series. CUE stands for Comparative Usability Evaluations. Nine teams were tasked with evaluating various websites. They were allowed to use any UX method. Many chose to go with the usability test, and some chose to go with expert evaluations (sic).
They began to observe a surprising trend after the second test. The teams reported very different UX problems.
They began to observe a surprising trend after the second test (CUE-2). The teams reported completely different usability problems:
“The CUE-2 teams reported 310 different usability problems. The most frequently reported problem was reported by seven of the nine teams. Only six problems were reported by more than half of the teams, while 232 problems (75 percent) were reported only once.”
The severity of the UX problems also turned out to be highly subjective:
“Many of the problems that were classified as “serious” were only reported by a single team. Even the tasks used by most or all teams produced very different results—around 70 percent of the findings for each of these common tasks were unique.”
And on it went. There were 17 teams carrying out UX evaluations in CUE-4, which had to do with the evaluation of a hotel website, a total of 340 UX problems were reported. Only nine (!) of these problems were reported by more than half the teams. When Molich responded to a direct question: “How can the team be sure that they are addressing the right UX problems?”, he clarified:
"It’s quite simple. They can’t."
What does this tell us? Usability tests are far from scientific. Their reliability is often atrocious. They are subjective and hard to implement. But why is that?
- Test is performed on users who were not representative of the target group.
- The data require interpretation – subjective by nature
- Team tests out its own design, which may govern both the outcome and the interpretation of data
- Test and interviews are conducted in an unrealistic environment. For example, detailed questions are asked about screen views that the user visits for only a fraction of a second
- Too few people are tested for statistical significance to be possible in the validation itself
- When it comes to questions about requirements, users are often not very good at analysing their own (future) behaviour
Things go wrong even if you do everything right
You need to “validate” your design or hypothesis. Let’s say that you do everything right. As scientifically as possible.
- You manage to find test users who are representative of the target group.
- You have another team test your design.
- You explain the context to them perfectly.
- They in turn conduct a clean interpretation of user behaviour – primarily by looking through interview responses, but also behaviours that seem unrealistic for real-life situations.
- Users are as close to the “real” context as possible.
- They are not stressed about the situation.
- Test data and interview questions are realistic, and do not disrupt the flow of the application.
- Questions are not asked in unrealistic environments, such as asking a user to make comments about a screen they look at for a long time.
All of this you get right.
And you still cannot be sure of whether a hypothesis holds water or not. You run the test on too few users to be certain. The outcome? Statistically insignificant anecdotes. The plural of anecdote is not data.
The outcome? Statistically insignificant anecdotes.
Even if you were to run the test on 20 users – something I have seen once in my 20-year UX career – you’re looking at a margin of error of ±19% when drawing conclusions about the entire population. If you want to bring the margin of error down to ±10% (generally regarded as the absolute minimum in the industry – yet most professionals use ± 5%), you will need 71 users. Which is ridiculously expensive. It just isn’t done.
I’m not saying that you should stop running usability tests, not at all. What I mean is that usability tests are a poor method for validating hypotheses. Usability tests are good at lots of things. You can identify usability problems. You can discover problems completely unrelated to what you are testing. You can understand behaviours when you interpret data professionally. It’s reasonably affordable. But – and this is important to note – you cannot see how widespread a problem is. Or if a hypothesis holds water. Or if your observations are true of everyone.
As someone put it:
With usability tests, you risk opting out of good ideas
I can’t count how many times a hypothesis I sent off for usability testing came back from the test leader with the following feedback: “The users seemed not to like the solution.” I have had to nag to have A/B tests run on the hypothesis. In many cases I had to conduct my own tests under the radar. And data from these A/B tests – and from follow-up tests – often indicate a completely different actual use than the one that emerged from the usability test.
You want to know if your hypothesis holds water or not. Take my recommendation. A/B test your hypothesis instead of doing a usability test. But only if you have the possibility of doing so. And by possibility I mean:
- Metrics that are indisputable and not open to interpretation.
- Metrics that are easy to validate statistically using optimisation and analysis tools. Loyalty is an example of a measurement point that is difficult to follow up on using an A/B test.
- Sufficient volume of traffic.
- Reasonable cost for carrying out the test.
- Short or non-existent change curves if your curves are long you will need to do follow-up over time in order to validate)
- Knowledge of A/B testing near you.
This is the point I think many people should work to get to as UXers. You, an amazingly creative and methodologically grounded person who is already an ace at usability tests and other qualitative methods. Understand the difference between anecdotes and empirical evidence.
You have power, so use the right word
And pretty please with cherries on top: Never generalise when you present qualitative data. Do not say “the users” or “they” when showing your neat video clips or your presentation (“They like this icon better than that one” / “The users understand that you can…”). If you do that, there is a risk that what you say will be perceived as being a quantitatively demonstrated truth. Of course, it is even worse to attach value judgments to your presentation of data: "It’s great that the users seemed to like the new feature".
Say this instead:
“This particular user (+ add any uncertainty as to method and context, for example stressed user) seems to like/use/dislike/not understand the content/feature. Now, this does not mean that we know that this is true of our target group. Not at all. It means that we may be on the right track to identifying a problem/solution. The question is whether enough people have the same problem/would behave the same way. What do you think is the smallest possible build we can make to find out?”
Do you conduct interviews or usability tests to gauge driving forces and behaviours? Be sure to also research (or validate using experiments) the market and the potential linked to that driving force. User research and market research. Otherwise you risk creating great solutions that nobody uses. Which is a waste of money. Both public and private.
Strive for architecture, tools and a culture that make it easy and inexpensive to conduct A/B tests. Make sure these factors are in place. Then use quantitative data as your knowledge. Hug it tenderly. Become an even better designer and then go out and save the world.
Until then, be very careful not to draw quantitative conclusions about a population larger than the cohort that participated in the usability test.
Top media by James Cridland under CC BY 2.0. Fluffy dog by y-a-n under CC BY-NC-ND 2.0