Statistics are an important part of many user researchers’ toolboxes, but knowing that you’re coming to the correct conclusions with data isn’t always so straightforward.
In my previous article, I explained the levels of measurement that researchers use to assign numbers or labels to units of analysis in order to represent categories.
I use the mnemonic “NOIR” to remember these data (as in film or pinot, two of my three favorite noirs):
Shoutout to Ben Wiedmaier for reinforcing that mnemonic! In that article, I shared details about the characteristics of each data level and the associated descriptive statistics you can apply to each level when reporting and sharing results with your teammates.
I also mentioned that some of these levels afford the opportunity to apply other types of data analysis that can be more powerful or answer different questions—in particular, statistical tests of significance. If you have been a UX researcher for any amount of time, then you’ve probably been asked of your findings, “Is this statistically significant?”
Statistical significance explained
While there’s a lot to unpack with the question: Is this statistically significant?, the truly accurate way to phrase that question is: Is this data significantly different from [something else]?
According to the scientific method of testing, we start with assuming the null hypothesis. In other words, what we think will have an effect will not—so the test can either maintain that null hypothesis or disprove it.
For example, a test can potentially find that:
“Participants complete a given task faster using Option 1 than using Option 2.”
Associating the data with the number of participants and reporting the confidence intervals, this finding would disprove the null hypothesis. Otherwise, we can still only conclude, “This design does not make users complete the task any faster than this one.”
Some examples of how to rephrase the “statistical significance” question more specifically are:
→ If five of five test users could not complete a task, how sure can we be that most users will also fail?
In other words, what’s the likelihood that we just happened to randomly pick five users who failed the task when only a handful of users would actually do so, versus the potential conclusion from the test evidence that all of them will fail?
→ If the SUS score for Design Option 1 is 90 and Design Option 2 is 78, is the first option significantly more usable than the second?
In other words, what’s the likelihood that we just happened to find more users that assessed Option 1 as more usable, but if we tested with all users, we would see no real difference? If there was no real difference, but we concluded that a true difference was found, this would be known as a type I error, also known as a false positive.
→ If the SUS score for Design Option 1 is 88 and Design Option 2 is 87, can we conclude that the first option is not better than the second?
In other words, what’s the likelihood that we just happened to find an equal number of users that assessed both options similarly, but if we tested with all users, we would see an actual difference between the two? If there were an actual difference but we concluded that a true difference was not found, this would be known as a type II error, also called a false negative.
Note that in all of these questions, an underlying question is how many test participants you need. That’s a subject for another article, and plenty has been written on that topic, but I didn’t want to ignore that piece of the puzzle. Stay tuned!
Statistical tests for each data level
The following sections describe some of the tests you can apply to various data levels. Note that this is not an exhaustive list. Rather, I’ll discuss the tests that would most likely apply to the kind of data UX researchers deal with.
As I mentioned in Part 1, applying statistical tests to UX research data is not the norm in industry design research. This is because it requires specialized education and expensive software not commonly found in the average UX research department.
I confess that I’m not a statistics expert, so my goal is to introduce you to these statistical tests at a cursory level. I recommend that you read beyond this article before confidently applying any statistical tests described here. As mentioned in Part 1, many of these tests also require special software or a lot of patience with manual or Excel-supported calculations.
As the old commercial line went from a General Hospital soap opera actor: “I’m not a doctor, but I play one on TV.” I would say of myself: “I’m not a statistician, but I play one at work.” I run my test recommendations by people and their publications that have a deeper background in statistics.
Nominal data is classified into variables in a qualitative way and can be analyzed with some nonparametric statistical tests. Nonparametric tests involve data in which no assumptions are made about the distribution shape.
If you’ve taken any classes in statistics, you might recall talking about the “normal distribution”, or the bell curve of most datasets. With nominal data, you can't make assumptions about the distribution, which is why you can only apply nonparametric tests.
In the field of statistics, there’s some lack of clarity about the definition of nonparametric tests and when to use them. Check out Should You Use Nonparametric Methods to Analyze UX Data? for more background.
You can apply two nonparametric tests to nominal data, the Chi-square goodness of fit test and the Chi-square test of independence. Use the former if you’re only looking at one variable and you want to know how well it represents the whole population. Use the latter if you want to look at how two variables relate to each other.
The Chi-square goodness of fit test is also called the Pearson’s Chi-square test. For this test, you’ll need the variable data in question and a hypothesis about that data. For example, let’s say that you’ve done several user interviews and you have observed that most have a lot of technical knowledge. Therefore, your hypothesis is that most of the product’s users are tech-savvy, and you want to evaluate whether that is a valid conclusion. See this article for additional data requirements and information about the Chi-square goodness of fit test.
The Chi-square test of independence is also known as the Chi-square test of association. This lets you determine whether two variables are related or independent. Going back to the technical knowledge example, perhaps you hypothesize that the level of knowledge is related to whether the person lives in a major metropolitan area rather than a rural area. You could explore whether that relationship may exist using this test. See this article for additional data requirements and information about the Chi-square test of independence.
As a refresher, ordinal data also groups data into categories, but it also conveys the order of the variables, such as in a rating scale. Because much of the ordinal data we collect in usability tests is binary (yes/no, pass/fail), you have several test options depending on whether you are comparing two groups of data.
You can find out how to test more than three comparison groups online here.
If you’re not comparing groups, you can use the adjusted Wald confidence interval method, also called the modified Wald interval. This test would answer the aforementioned question about task failure with five of five users.
Assuming you’ve done a reasonable job of gathering a representative sample, you can apply confidence intervals to estimate the usage in the total population. This method should almost always be used when your sample size is less than 150. See this article for additional requirements and information about the adjusted Wald confidence interval.
Let’s say you’re comparing two options with which you used the same test participants, also known as a within-subjects study. Use McNemar's exact test to determine whether a true difference between the two measures exists. This test can be conducted using statistical analysis software such as SPSS.
Finally, if you’re comparing two options that included different test participants, which is called a between-subjects study, use the N-1 two proportion test to answer the same question about whether a data point from each group is significantly different. Check out a free calculator for this test.
Interval and ratio data
I will combine the last two levels of measurement when discussing available tests because they’re both continuous data, which groups data into categories but also conveys the order of the variables, such as a rating scale.
Like ordinal data, you have different test options based on whether you are comparing two groups.
If you’re not comparing groups, you have two options depending on whether the data you’re analyzing is time-on-task, also known as task time. For task time data, you can use this MeasuringU free calculator to get a confidence interval for your results so you know how precise those findings are.
If you’re not looking at task time, you can find your confidence interval using a single-sample confidence interval calculator. Here is a single-sample confidence interval calculator I found online, though you will have to use Excel to compute your mean and standard deviation first.
If you have the knowledge and software, tests can be used to look at correlations between variables that I know some researchers use, such as the Pearson’s r test (also known as Pearson's Correlation Coefficient) and using simple linear regression.
I hope that this article helps to provide an understandable view into the mechanics and reasoning processes around usability research. Purposefully or not, this often overlaps into the world of statistics.
Molly is a User Experience Research Manager in the financial services industry. She has a master’s degree in communication and has over 20 years of experience in the UX field. She loves learning more about how people think and behave, and off-work enjoys skiing, reading, and eating almost anything, but first and foremost ice cream.