Skip to content

What We Can Learn Today from 20 Years of the CUE Studies

The CUE studies focused on researching user experience research itself. How meta!

Words by Molly Malsam, Visuals by Nicky Mazur

When I started in user experience almost 20 years ago, “CUE studies” were a hot topic. Most people I talked with in the field had heard of them and were familiar with the purpose of the studies. However, I’ve noticed that the times I have referenced them more recently, I see blank faces staring back at me. This inspired me to write an article on the topic for those newer in the field and who may not know about this important research.

Tomer Sharon, a leader in the UX field, said of the studies, “CUE was seminal work led by Rolf and is studied in academic halls globally to date. As a user research pro, it’s a must read. It makes you think, tighten[s] your skills, and helps you become better at what you do.”

The why

The primary purposes of this study series were twofold: one, the researchers wanted to determine how reproducible usability evaluations are. In other words, will two separate research teams evaluate the same experience and get similar results?

The answer to this question is an intriguing one for researchers. This gets to the heart of what we do and whether different approaches to answering research questions yield similar results, or whether varying methods and/or researchers will come up with completely unique usability findings. This also helps to understand whether one study is enough for a certain problem set.

Second, the researchers wanted to understand the “state of the art” more generally in the UX research field. How do teams evaluate a website’s usability? What are their approaches, findings, and resulting reports? This is also valuable information for us to understand the industry more as a whole, and how others are approaching their research questions.

The who, what, when, and where

The CUE studies were a series of 10 studies conducted over 20 years, from 1998 to 2018. The idea came from Rolf Molich, who is the owner of a design consultancy in Denmark. Trained as a software developer and having worked in the user experience field since 1984, Rolf has won many awards and also co-invented the heuristic evaluation method with Jakob Nielsen.

In each of the first six studies, multiple teams were given a scenario and objectives for the same website and then told to evaluate the site using their company’s standard methods. Each team provided their results anonymously, and then all participants from each team met for a one-day workshop to discuss the similarities and differences in the combined findings.

Participants were not paid for their time and came from the US and Europe. After the first six studies, the researchers started to home in on more specific questions:

Some top findings

1. The number of issues found was very large

One consistent finding was that in an average modern website, the evaluators found a huge number of usability issues. For example, CUE-2 found 310 issues with Hotmail, while CUE-4 found 340 issues with a hotel website. What’s more, the overlap between findings was fairly low.

In CUE-2, 75% of the issues were found by only 1 of 9 teams, and only 4% were found by 40% of the participating teams. In CUE-9, where the focus was more on the evaluator effect, they used tighter controls on methodology, having researchers review video sessions of five usability tests. Even still, just 10% of all issues were found by 40% or more of the teams.

Jeff Sauro and others have theorized that some of the lack of overlap could be accounted for by the non- control over tasks and methods, which was purposeful by the CUE study designers to capture the range of approaches across teams. Therefore, Sauro reviewed six studies that had more control and found that using the “any-2 agreement” measure, which is the percent of problems any two combinations of evaluators find in common divided by the total problems those two evaluators find, agreement rates were 59% as compared to 17% for uncontrolled studies.

What should we conclude from this? I don’t find this to be a huge cause for concern, but rather a reflection of the scope of the research. It seems that each study focused on an entire website/product and the teams were allowed to determine their own tasks in many cases, so it’s not surprising that different evaluators found unique problems. Sauro’s analysis shows that when you add more controls, you find more agreement.

Moreover, every researcher has special experiences and backgrounds that allow them to discover certain problems more easily than others. For example, those well-trained in accessibility will notice those types of issues, while those with a strong background in semantic code might notice heading hierarchy issues more often.

Also, an “issue” can be anything from a minor annoyance to a severe showstopper, so perhaps in their eagerness to be thorough in this volunteer research, many of the cited issues were minor. It would have been nice to see the categorization of the issues as well, but judging from the reports I saw from CUE-1, that was not consistently reported by participating teams. Three studies appeared to have called out critical problems, which will be discussed in a later finding.

The takeaway Molich shared from this finding is that researchers should never claim or assume that they can do one exhaustive usability test of an entire experience. Exhaustive testing may only be possible within limited function areas of the site.

2. Expert reviews are useful

Given that Molich was an inventor of the heuristic evaluation method—and instrumental in making users informed of what is going on in a system through appropriate feedback in a reasonable amount of time—it’s no surprise that he was interested in examining the value of such methods.

In CUE-4, 17 teams evaluated the same website, with nine teams conducting usability testing and the others using the expert review method of their choice. The results demonstrated that both approaches found most of the same usability issues.

The benefit of expert review methods, of course, is they can be conducted more quickly. The most important part of this finding is that the evaluators must indeed be very experienced in the field. The CUE-3 study found that professionals with limited experience will have trouble conducting expert reviews. Molich also warns that he would be very cautious using these methods in organizations with a low level of usability maturity.

3. Many test reports are low quality

Another consistent finding in the studies was that the quality of reports varied dramatically, with many missing elements such as a summary of findings at the beginning of the report. They also found report length varying wildly, from five to 52 pages for the same study.

Based on the observed issues, they recommended including the following components in a test report so that it is usable by your audience:

  • Limit the length to 25 pages maximum. If you have to leave out some of the less important findings, that’s okay—25 is enough for the team to tackle!
  • Include a brief executive summary at the beginning of the report.
  • Explain your methodology and any other key supporting details.
  • Provide screenshots with callouts to support and explain the issues.
  • Have at least some positive findings along with the negative ones. It’s important for teams to know what is working along with what is not working so they keep the good and improve on the not-so-good.

4. There were very few “false alarms”

The researchers reported that they very carefully evaluated each issue that teams reported, paying special attention to those that only one team found. They confirmed that almost all were reasonable problems that are supported by usable design principles. The false alarms they did find came from some of the less experienced teams in the CUE-3 study.

What this means is that even though different teams of experienced participants may find different issues, the results themselves are reliable. So, even if based on this article you’re concerned that different teams don’t find the same issues, they’re all finding real problems to solve. That’s the point of usability practice: making things easier to use with fewer problems.

5. Five users is not enough

The combined studies found that many serious problems were not discovered by a particular moderator or a particular task set, and that even with 15 or more teams evaluating a web site or product, they will only find a fraction of the problems.

This one is tough to summarize. I did a lot of reading to understand what this finding might mean. Molich states, “It is a widespread myth that five users are enough to find 85 percent of the usability problems in a product.” My understanding of this statement is that the math behind this recommendation to “test with five users” has been improperly stated or paraphrased.

The correct way to state this guideline is, “With five users, you have an 85 percent chance of detecting problems that affect up to 31 percent of users.” In other words, you have a pretty good chance of catching most of the widespread issues. Also, Jakob Nielsen, who first published Why You Only Need to Test with Five Users, shared some other important caveats to the five user rule which are sometimes overlooked. One of those Molich supported in his analysis of the CUE findings—that five users is likely enough for an iterative testing cycle.

Nielsen states in his article that 15 users are ideal, but that it makes more sense to do three studies with five users each so that you can find some problems, fix, and retest them, and potentially find additional problems. The other caveat to the five-user rule is that the users must be very similar to each other and represent the main audience. If there are very unique user groups, they should be treated separately.

I also go back to my reaction to the first finding: the lack of control provided for many of these studies meant that different methods and tasks were used on an entire website, so the scope of the findings was larger than it would typically be for a focused usability test.

Even with that, the results of the CUE studies still do not seem to fully jive with the corrected five users statement given that there was little overlap on even serious problems. I am not totally sure how to reconcile this finding with the user-problem equation. What I do know is that multiple studies have demonstrated the value of usability testing, with a plethora of published studies as well as personal experience showing changes to designs improving dependent variables such as SEQ and SUS.

I also know that the goal of our work is to improve the user experience, and if we do that by fixing the issues we do uncover, we have done a good service. Maybe it’s less important for us to find the same issues than that we do find issues and fix them. Teams have other methods of learning about issues after products are released as well, so if big problems were somehow not uncovered prior to release, they will likely show themselves in another way down the road.

This is only a highlight of the findings from the CUE studies. This website will give you more background along with links to additional details.

Molly is a User Experience Research Manager in the financial services industry. She has a master’s degree in communication and has over 20 years of experience in the UX field. She loves learning more about how people think and behave, and off-work enjoys skiing, reading, and eating almost anything, but first and foremost ice cream.

Subscribe To People Nerds

A weekly roundup of interviews, pro tips and original research designed for people who are interested in people

The Latest