People Nerds

AI Needs a Human Touch: Why Researchers Should Lead AI Evals

January 7, 2026

overview

See how researchers can step into a void to lead human, LLM-as-judge, and code-based evaluations to steer AI products in the right direction.

Contributors

Nathan Reiff

Senior UX Researcher and Product Manager at Dscout

Thumy Phan

Illustrator

AI Needs a Human Touch: Why Researchers Should Lead AI Evals

January 7, 2026

Overview

See how researchers can step into a void to lead human, LLM-as-judge, and code-based evaluations to steer AI products in the right direction.

Contributors

Nathan Reiff

Senior UX Researcher and Product Manager at Dscout

Thumy Phan

Illustrator

In today’s AI-centric technology environment, researchers from a wide variety of industries and companies have one thing in common: we are being asked by leadership how we can use AI to make our work more efficient. How can this technology allow us to do more with less? Scale democratization? Automate research?

While these are worthwhile questions to consider, I believe that we should not only be asking how can AI improve UX research, but how can UX research improve AI development

As product teams investigate who should be responsible for prompting and evaluations, I want to make the case for researchers to be deeply involved in this work, if not leading it.

What are AI evals, and why are they important?

Before I get to why and how researchers should be involved in evaluations, let’s get on the same page about what I mean when I say “evaluations”. At the highest level, evaluations are how you measure the quality and effectiveness of your AI system. AI features add a level of unpredictability to your product experience that can’t be QA’d with traditional testing strategies. 

Thus, having robust evaluations in place is critical for maintaining confidence in your product. You won’t be able to completely reduce the variability of the experience, but you can feel good that it will meet user expectations at least most of the time. 

Why should researchers step in?

For some background context, I have been working at Dscout for the past six years: first as a customer-facing research specialist, then as a UXR, and now as a hybrid researcher/product manager on our AI products. Two years ago, when Dscout started developing our AI functionality, the responsibility of evaluations (and some prompt engineering) fell into my hands. 

In the time since, and as our prompting and evaluation practices have matured. It has become clear that there is undeniable benefit to integrating the most humanistic part of the product team (UX research) with, arguably, the least humanistic part (AI engineering). 

By doing so, we can create high-quality AI features that add value to user experiences, instead of noise.

Within the modern product team, researchers are perfectly positioned to lead AI development because we…

  • Are experts in user needs
  • Can pair evaluation outcomes with UX research to drive product decisions
  • Can see around corners and create products that solve real problems.

Researchers as subject matter experts

In their recent Lenny’s article, “Building eval systems that improve your AI product”, Hamel Husain and Shreya Shankar wrote about the importance of a human expert as a starting place for AI evals. 

This person, sometimes referred to as the “benevolent dictator”, can use their subject matter expertise to:

  1. Provide guidance for early prompts
  2. Review early AI outputs to identify problem areas to focus on in later evaluations

This subject-matter expert serves as a stand-in for the user before you put your AI features in front of real users, and that perspective then informs the direction of product development. 

The most effective way to leverage a subject-matter expert in AI evaluations is to choose the person most connected to the end-user: the researcher. At Dscout, a researcher as the benevolent dictator was a natural fit because our end-users are also researchers. However, I believe that this principle can extend to any team. 

Researchers are viable stand-ins for subject matter experts across the board due to their intimate connection with the user perspective. Injecting this perspective early in the evaluation process ensures that prompt engineering and evaluations are grounded in real user processes and goals from the very start.

Combining evaluations with “traditional” UX research

When it comes to building valuable AI experiences, product decisions should not be made based on UX research or evaluations alone. 

In addition to being natural subject-matter experts on user needs, researchers are well-positioned to bridge the gap between evaluations and traditional research. 

At Dscout, we learned this while working on our first AI features: AI themes and summaries. As we embarked on that journey, we learned a lot very quickly with iterative rounds of human evaluation. 

As the researcher and subject-matter expert, I…

  • Wrote guidance for prompting
  • Reviewed outputs
  • Wrote feedback
  • Iterated on prompts with our engineers to improve outputs

This process allowed us to get close to what we deemed to be a valuable experience for our users. But it wasn’t until we got the features in front of users and paired our evaluations with real user feedback that we were able to determine how these experiences should fit into the larger product environment. 

With this research, we learned about…

  • Our users’ approaches to data analysis
  • How our current AI augmentations could fit into that process
  • Where we could venture in the future to better support their journeys
“Research is essential to maintaining a vision of what our products are actually meant to do, what value they're meant to deliver, and exploring how AI can fulfill those promises, instead of the promise of simply ‘AI’.”

Researchers help us make more responsible product decisions

It’s no secret that AI carries immense value alongside immense risk: risk to the environment, to our jobs, to our craft, and to our day-to-day fulfillment. With such disruptive technology, product teams should lean on research to help anticipate and reduce these risks. 

And one major risk facing many product teams is the changing nature of the product itself. We’re seeing a huge push to slap AI onto every digital product that exists. All it takes is one glance at San Francisco billboards to see this—every company is now an “AI platform.” 

Research is essential to maintaining a vision of what our products are actually meant to do, what value they're meant to deliver, and exploring how AI can fulfill those promises, instead of the promise of simply “AI”.

How does research step in?

Hopefully, by now, I’ve started to convince you why deep research involvement in AI development is a key to creating better AI experiences. 

But now you may be asking yourself more tactically—how do we make this happen? I’m lucky to work on a team where research integration into AI development came naturally, but that is likely not the case for all (most?) researchers. 

To start, let’s define three types of evaluations that your product team is likely working on:

  1. Human evaluations: Humans provide feedback on the AI feature/product, which is used to align the application with human preferences via prompt optimization or fine-tuning the model (AKA the qualitative research of AI dev).
  2. LLM-as-a-judge evaluations: Similar to human-based evaluation, but using another LLM to act as a "judge" instead of a human.
  3. Code-based evaluations: Automated, objective, and quantitative assessment of an AI model's performance using pre-defined metrics and a structured dataset.

Researchers can start to get involved in AI development via human evaluations. In addition to being great stand-ins for end users, conducting human evaluations is quite similar to other forms of qualitative data analysis. 

Once you have a dataset and a first draft of a prompt, researchers can simply read AI outputs row-by-row and comment on what is and isn’t going well. 

From there, look at the dataset as a whole to identify patterns of what isn’t working well, and use those problem areas to inform iterations to the AI system. That may include prompt changes or changes to other parts of the system, such as the context or post-processing rules.

After conducting many rounds of human evaluations, researchers will not only be subject-matter experts in the user, but also in the AI system as a whole. 

This equips us well to inform scaling to LLM-as-a-judge and code-based evaluations. Though evaluations are not always the most thrilling tasks, researchers are not strangers to tedious, detail-oriented work. 

For a deeper dive on the difference between these types of evaluations and when to use each, I’d recommend reading the article I referenced earlier: building eval systems that improve your AI product.

Wrapping it up

At the end of the day, evaluations are really just a new method of research. 

As with all research, this work can be time-consuming and tiresome. But doing it well and with a research-minded focus is vital. It’s how we bring the essential human perspective to these non-human buddies we’re creating, which is the foundation for AI product success, tangible business outcomes, and enthusiastic end-user adoption.

And what’s important to remember above all is that the human perspective is still essential. If we can hold onto it, we can continue to create products that keep us in the driver’s seat—not just for the sake of control, but for the sake of a future where the tools we use bring us as much joy as they do efficiency.

You may also like…

HOT off the Press

More from People Nerds