We asked Supreet Kaur, Data and AI Solutions Architect at Microsoft: How do you measure the performance of an LLM?

Before we get into the metrics, hear directly from the source.

In a conversation with TechLeader, Supreet Kaur, Data and AI Solutions Architect at Microsoft, breaks down why evaluating LLMs demands a fundamentally different approach than traditional ML.

Why measuring AI success is more complex than you think

The rise of large language models (LLMs) has unlocked powerful capabilities in customer support, product guidance, and internal tooling. But it's also raised a difficult question: how do we define success?

“We all came from the ML era where metrics were well defined. But in the world of language models, it’s still evolving,” says Supreet.

Unlike traditional ML, where you can lean on accuracy or loss, LLMs are generative, context-sensitive, and unpredictable. This makes evaluation a strategic leadership challenge as well as a technical exercise.

“You could have a perfectly coherent answer that’s factually wrong. Or a factually accurate one that alienates your user. So, what are you really optimizing for?”

For tech leaders, that means shifting from traditional accuracy scores to a multi-dimensional, strategy-aligned framework. You don’t just need better AI — you need AI that delivers measurable value, ethically and efficiently, at scale.

The four-bucket framework for measuring LLMs

Supreet proposes a four-part, cross-functional framework for defining what “good” looks like in enterprise LLMs:

“If you're not tracking these four in parallel, you're not measuring the real impact,” Supreet emphasizes.

Applying the metrics: From prompt to ROI

For most business leaders, the bottleneck isn't just the model. It’s also the lack of visibility into what’s working.

Ask:

  • Which prompt generates the highest-quality output?
  • How does that output map to conversion, resolution time, or user satisfaction?
  • Where does it degrade and why?

These questions are essential for product owners, revenue leads, and strategy heads.

Pro tips from Supreet: Building a real-world LLM strategy

Here’s what she recommends to every enterprise deploying LLMs at scale.

  1. Start with a North Star metric. Tie every LLM deployment to one defining business goal. Be it productivity, asset growth, or reduced handling time. “Pre-ROI vs post-ROI metrics help you track real uplift,” Supreet notes.
  2. Use benchmarking. Compare outputs before and after AI adoption. Focus on measurable lift, not just theoretical capability.
  3. Include ethics in every review. Bias and hallucination risk aren’t just legal concerns, they’re trust risks. Bake ethical review into your model audits.
  4. Invest in human evaluation. Use subject matter experts to grade outputs periodically, especially for high-risk or regulated use cases.

Bottom line

You can’t manage what you don’t measure, and you can’t scale what you don’t understand.

Evaluating LLMs is about being more cross-functional, strategic, and aligned with what matters most to the business. So next time you ask your team, “How’s the model performing?” — don’t settle for latency stats or benchmark scores alone.

Ask:

  • What business outcomes are we tracking?
  • Which prompts are performing best?
  • How are we measuring relevance, groundedness, and ethical risk?

Your next AI performance conversation will be sharper and more valuable.