User Research for Machine Learning Systems: A Case Study Walkthrough
Michelle Carney talks us through tactics for building more human, more helpful, and more ethical AI + ML systems.
Words by Michelle Carney, Visuals by Allison Corr
Can you do UX Research on machine learning systems? Can you get feedback from real users before the AI has been built? Can you even test an ML system before you have a production-ready model?
As a researcher in the increasingly crowded venn diagram between UX and ML—these are the questions I get often.
And my answers are: yes, please, and you should!
The combination of ML and UX can create really powerful products, like Visual Discovery by Pinterest, or Google’s Smart Compose. And interest in the intersection is growing (our Machine Learning and User Experience Meetup has grown up to 2000+ members strong).
This case study outlines my best practices for doing research on ML models before they’re production-ready built. It involves using data science/ML techniques (like unsupervised learning) to make data-driven design decisions. And it harnesses UX tactics to help make AI/ML systems more explainable and transparent to the end user—helping them to understand the ways they can better control their experience.
Prefer to see this study broken down over a webinar? Stream Michelle Carney's walkthrough on-demand here.
Building a travel app with machine learning & user research:
Imagine you’re developing a brand new personalized travel app, and of course, it is AI driven. This could take shape in a lot of exciting ways. Maybe it recommends places to travel based on your preferred mode of transit. Maybe it gives recommendations based on who you're traveling with, the length of your trip, or the type of activities available at your destination.
Let’s also imagine you’re working on the app before it’s been developed, so you have a real chance to make a meaningful impact on the product’s future.
When I get this type of opportunity, I break my research approach into three major stages: Generative research, concept evaluation (moderated), and prototype evaluation (unmoderated).
The generative research stage is incredibly important for new domains, before you have concepts established. Here, we get to explore the potential of this AI-powered product.
Let’s say you decide the experience is going to have a voice assistant. We might test this by having folks sit in a room and recommend trips to each other. What do we notice about their recommendations, and how do they challenge our assumptions?
Maybe you’d think travel dates would be important—that people would give recommendations based on events or potential weather. But you find that when people recommend trips to another person they are more likely to use questions like, ”Who are you traveling with?” or “What are some special activities you want to do?” Or maybe you notice that most popular trips are based on the recommender's past trips.
This feedback is crucial to making sure that you’re meeting your customer’s mental models—how they conceptualize and understand your would-be system—and designing for their expectations about how the system might behave. Maybe you have a really interesting data set that could map the customer’s favorite music to where they should travel—but that isn’t what they are expecting, so it might not be the best approach. In the process of generating, we're setting the strategy for the hows and whys of our experience.
Another way of testing out very early ideas is via the TripTech Method. To summarize this process: Create a three panel storyboard that covers your product’s problem, solution, and resolution. Next, present that storyboard to your target market, one panel at a time, to get feedback on which needs and features are most important. You can also invite users to co-create solutions based on the storyboards you present.
This process allows you to test a bunch of early designs rapidly and understand which ones resonate with customers. You’ll also uncover pain points you might expect to encounter down the line.
Concept evaluation (moderated)
After I have a rough idea of the type of product our customers are expecting, how it would benefit them, and a prioritized list of user needs, I work with the team to create an early concept for evaluation via moderated usability sessions.
It’s important to note that at this stage no ML is required. As a UX researcher, I'm hoping to test the North Star experience and ask, “Is this the right direction?” With a concept (or set of concepts) in-hand, this is easier for participants.
This can be done in parallel while your ML team works on building the model. Hopefully, from generative research, you’ve discovered the primary expected inputs and outputs of the system.
Say, for the sake of example, that in this case, we continue with the voice-assistant only system. A good method for evaluation would be Wizard of Oz prototyping. Here’s how it works: Ask your users to go through a flow with a “prototype,” where they ask whatever they might naturally ask a voice assistant. Meanwhile you (the Wizard) play the audio files that the system might reply with. In this way, we're capturing real-time, organic feedback for a would-be feature: the hallmark of "good" UX research.
If we decide to take the approach of testing a truly custom travel experience, we might ask the participants to fill out a survey—as they would for app onboarding. While your sample size may vary based on the product you’re testing, 6-10 participants will generally get you a sufficiently rich data set. Let your participants know before the start of the survey that their input will inform the design.
At this point they’d share how often they travel, where they’d traveled last, their three next top and bottom travel locations, who they travel with, etc. From there, I would then mock up custom prototypes for each participant—akin to what my ML models will eventually hope to emulate. The major components and flow for these prototypes should be the same across participants. It’s just the content that they’re seeing that is “personalized” to them based off of their feedback.
It’s also okay to include recommendations here that would objectively be “wrong.” Machine learning is probabilistic, so it won’t always make recommendations that are 100% correct. What’s important at this stage is to capture those errors so we can design a system that can fail gracefully and get feedback for future iterations.
Say that you’ve learned that, unlike with your voice interface, choosing a date is important in a visual interface. So you create a custom prototype with this interface for a potential participant who told us they are excited about going to large cities:
With ML and AI powered systems, you might ask a few specific questions that get at the interactions that will eventually be AI or ML powered like:
Is this what you expected?
How does the model know?
What happens when it is wrong?
How would you change it?
How would it change over time?
I choose to do this as a moderated session because sometimes the prototypes do not work as we expect. At that point, it’s helpful to get the customer’s reaction and expectations in real time—and to make sure we are on the right path to building the best ML/AI powered system for them.
Prototype evaluation (unmoderated)
Now that we've evaluated the concept and iterated on it, and created a working prototype powered by our minimum viable ML model, we have a system that we can share with potential customers.
I prefer to do this method unmoderated. This tends to be closer to launch and it is important to get feedback at scale on the edge cases of your system.
I approach this final evaluation stage with three steps:
Control: Contextual inquiry. How do you current plan trips?
Experimental: Usability on the prototype and model. Try using it to plan trips. Does it work?
Reflection: What did you like? What would you want to improve?
For the control step, I start with a required three entries around a contextual inquiry like, “Show us how you currently book travel.”
By having the participants screen capture how they currently solve their problems—using something like a remote qualitative platform—you get a strong understanding (and hopefully validation) of the information you’ve gathered from the generative and early concept stage. This should give you more context about the user goals, understanding, mental models, and pain points in their current system.
It’s important to do this per individual user. When you give them access to the prototype, you need to understand how they currently solve the problem with and without AI aid. You can ask a few survey questions informed by your generative research and early concept user feedback. These pair with the screen record, screenshot, or other useful media prompt.
Examples for this case:
Who are you traveling with?
Are you booking travel today or just looking?
Do you already know where you want to go?
What are the most important aspects of booking a trip for you?
How satisfied are you with this experience?”
Now it’s time for the experimental step. Give your customers access to the prototype and ask them to try to book the same travel experience.
Is it working how they’d expect? Are there any things that stand out to them or surprise them? How do they think that the system is getting the information?
Ask the same survey questions as the control condition and possibly a few more (e.g. “Did the system make any recommendations you were not expecting?”).
For both the control and our prototype, I normally ask for each user to submit at least three entries with the same prompts. This helps me to really get a robust understanding of how customers are using this system and what their expectations are for it.
Finally, you are able to ask the participants to reflect on their experience. They are now an expert in booking travel. After all, they’ve had to do it at least six times (3x in the control, 3x in the experimental stage).
They should be able to tell you about their expectations and experiences, and how they might want the experience to be improved in the future now. They should have some insight into how the AI system is performing and what it is doing.
I personally use dscout for this phase of my research. For me, it works like Snapchat meets Qualtrics. I can cast a wide net of participants and invite the specific individuals I want to participate. Participants can answer survey questions about their entries, as well as capture either screen capture videos or selfie/camera videos of what they are actually doing.
It’s a fantastic way to make sure your model is giving the expected outputs at scale, as well as how customers would actually try to use it in real life.
dscout also helps me understand and group answers by major demographic factors, which helps in understanding generational mental models as well.
Additional ML/UX resources:
One of my favorite resources is the People + AI Guidebook—a toolkit featuring contributions from over 100+ Googlers. It includes synthesized findings and distilled best practices to help you make human-centered AI product decisions.
A few of my favorite chapters are
Mental Models. Which helps you understand your user’s point of view of the AI system, their ecosystem, and more.
Feedback and Control.How your system request and respond to user feedback? how can we get feedback for future models.
Errors and Failing Gracefully, Since AI is probabilistic, there will always be edge cases where it fails, so how can you design it in a way that gives opportunity to improve and help the user still reach what they intended.
Every UX practitioner should have the People + AI Guidebook in their back pocket if they are working on AI or ML powered systems. It has a ton of fantastic information, and it is a living document and being updated as the field progresses!