Blog

If You Can't Measure It, Don't Ship It

Opeyemi Fabiyi
Jun 8, 2026
7
min read

A practical framework for building trust in AI analytics before it reaches your stakeholders

There's a moment in every AI analytics project where the team gets excited. The agent is answering questions. The SQL looks right. A few stakeholders see a demo and say, "This is amazing.” Someone starts talking about a broader rollout.

This is the most difficult moment in the entire project.

Not because the agent is bad. It might be genuinely good. The danger is that nobody has measured whether it's actually producing correct answers. The demo looked impressive. The stakeholders are enthusiastic. But between "this looks right" and "this is right" is a gap that most teams never close before they ship.

The Measurement Problem Nobody Talks About

In one implementation, we built an AI analytics agent and tested it against a set of verified business questions with known correct answers. After multiple rounds of context enrichment, the agent appeared to be sitting in the mid-70s on accuracy. Decent, but not production-ready.

Then we looked more carefully at the 23% that were "failing" by manually reviewing what the agent was generating against reality, and we discovered our underlying evaluation methodology wasn’t holistic.

Several of those answers were actually correct. The agent returned the right numbers, the right groupings, the right filters. But the evaluation was marking them as wrong because of superficial differences: the agent named a column TOTAL_APPOINTMENTS while our expected answer called it APPOINTMENT_COUNT. Same values. Same meaning. Different label. Marked as a failure.

Another set of "failures" was rounding differences. Our expected answer showed 33.3%. The agent returned 33.30%. Same number, different decimal precision. Marked as a failure.

When we fixed the evaluation methodology to handle these cases properly, accuracy moved into the low 90s. We hadn't changed the agent at all. We'd changed how we measured it.

That experience shaped how we think about AI analytics evaluation. The quality of your measurement directly determines the quality of your iteration. If the ruler is wrong, every experiment you run produces misleading signals, and you end up spending weeks "fixing" answers that were already correct.

Why Eyeballing Doesn't Scale

The first instinct most teams have is manual review. Run a question, look at the result, decide if it seems right. This works for the first five questions. It breaks completely at scale for three reasons.

  1. First, you can't hold the correct answer in your head for every question. When you're testing 30 or 40 questions across different time periods, metrics, and entities, you need a verified expected result for each one. "That looks about right" isn't a standard. You need to know that the correct utilisation rate for a location last week was 67.4%, and the agent returned 67.4%. Or it returned 68.1%, and you need to investigate why.
  2. Second, manual review doesn't catch regressions. You improve the agent to handle a new question pattern, and in the process, you quietly break three questions that used to work. Without automated testing against the full question set after every change, you discover this weeks later when a stakeholder says "this number used to be right and now it's wrong." That's the worst kind of failure because it destroys the trust you already built.
  3. Third, manual review doesn't compound. Every hour you spend eyeballing results is an hour that produces no reusable artifact. An automated evaluation pipeline runs the same tests every time, tracks results over time, and shows you a trend line of whether the agent is improving or degrading. Manual review gives you a feeling. Automated evaluation gives you evidence.

What To Actually Measure

Through several implementations, we've developed a framework for what to evaluate. The specific tools and metrics will vary by platform, but the categories are universal.

Category 1: Does the agent produce the right numbers?

This is the most important category, and it's easy to get wrong. The naive approach is to compare the SQL query generated by the agent against an expected query. This fails immediately because SQL is not unique. A CTE-based query and a subquery-based query can produce identical results. A query that filters in the WHERE clause and one that filters with a CASE WHEN can return the same numbers. Comparing SQL text tells you nothing about correctness.

The right approach is to compare results, not queries. Run the expected SQL. Run the agent's SQL. Compare the actual data that comes back. If both return the same numbers for the same question, the agent is correct regardless of how different the SQL looks.

This sounds simple but the implementation has nuance. Column names might differ. Column ordering might differ. Decimal precision might differ. The agent might include extra diagnostic columns that weren't in the expected output. A serious evaluation methodology handles all of these cases through layered comparison logic, starting with strict matching and progressively relaxing the criteria when the strict comparison fails, so that genuinely correct answers aren't penalized for cosmetic differences.

What you ultimately want is a single metric that captures functional correctness: the percentage of questions where the agent produces an answer that's equivalent in business meaning, regardless of how the SQL or the result formatting got there. That's the number to report to stakeholders and the primary threshold for deployment readiness.

Category 2: Does the agent follow the rules?

Every dataset has business rules. Exclude internal test records from customer-facing metrics. Always filter to specific statuses for capacity calculations. Use this date function, not that one. Round percentages to one decimal place.

These rules live in the custom instructions you give the agent. Instruction compliance measures whether the generated SQL actually follows them. An agent that produces the right answer but ignores the business rules is fragile. It got lucky this time. Next time, the missing filter or wrong status code will produce a subtly wrong number that nobody catches until it's in a board deck.

We evaluate compliance separately from correctness because they measure different things. Correctness asks "did you get the right answer?" Compliance asks "did you get it the right way?" Both matter for production readiness.

Category 3: Does the agent know what it doesn't know?

An agent that confidently generates SQL for a question it should refuse is more dangerous than one that says "I can't answer that."

Test the agent with out-of-scope questions. Ask it about salary data when it only has operational metrics. Ask it about a field that doesn't exist in the data model. Ask it about a clinic that was never in the system. A well-configured agent should decline these gracefully, explain what it can help with, and suggest relevant alternatives.

If the agent hallucinates an answer to an impossible question, that's a trust-destroying moment when it reaches a stakeholder. You want to discover this in evaluation, not in production.

Category 4: Is the measurement itself trustworthy?

This is the meta-category that our mid-70s to lower-90s experience exposed. Before you trust your evaluation results, you need to validate the evaluation methodology itself. Are there questions being marked as failures that are actually correct? Are there questions being marked as correct that are actually wrong?

Manually run through every failure during the first few evaluation cycles. For each one, ask: is this a genuine agent failure, or is my evaluation too strict? For each pass, ask: is this genuinely correct, or is my evaluation too lenient? Calibrate the measurement tool before you use it to calibrate the agent.

The Two-Layer Approach

We've found that evaluation works best with two layers, each serving a different purpose.

The first layer is deterministic. It compares actual query results using structured rules. Same data, same shape, same numbers: pass. Different results: fail. This layer is fast, free, reproducible, and handles about 70% of questions conclusively. It's the backbone of the evaluation pipeline.

The second layer leverages AI's probabilistic nature as an evaluator. For questions where the deterministic comparison can't reach a verdict (the results look different but might be semantically equivalent), a second AI model reviews the question, the expected results, and the agent's results, and scores whether they convey the same business insight.

Building Golden Questions That Actually Test The Agent

Your evaluation is only as good as the questions you test against. A vague question produces ambiguous results. A wrong expected answer produces false failures. A question that tests three things at once tells you nothing when it fails.

Every golden question should pass a simple test: if you gave this question to two different analysts on your team, would they write the same SQL? If not, the question needs to be more specific.

The expected SQL must be validated by actually running it and confirming the results with a domain expert. Do not write expected answers from memory or assumption. If you're not sure the expected result is correct, the question isn't ready.

And the golden set should grow over time. Start with 10-15 questions covering the core use cases. Add questions when enrichment reveals new patterns. Add questions when pilot users ask things you didn't anticipate. The production monitoring pipeline (watching real user questions and flagging anomalies) becomes the primary source of new golden questions after deployment.

From Evaluation To Trust

The evaluation pipeline isn't just a quality gate. It's the mechanism that builds stakeholder trust over time.

When we present results to clients, we show the accuracy trend across enrichment iterations, not just a final number. They see where the agent started, what each round of context enrichment changed, and where it landed. A typical progression looks like this: an early baseline in the low double digits before enrichment, climbing through the 70s as we added column descriptions, metric definitions, and business rules, then into the low 90s once the evaluation methodology itself was corrected and the remaining genuine failures were fixed. Each number is backed by a reproducible evaluation run. The stakeholder can see exactly which questions pass and which fail, what changed between runs, and the agent improving with evidence, not promises

This is fundamentally different from the typical AI demo where someone shows three impressive examples and the audience extrapolates that everything works this well. Evaluation-driven development inverts that. You show the full picture: what works, what doesn't, and what you're doing about the gaps.

The teams that ship AI analytics successfully aren't the ones with the best models or the flashiest demos. They're the ones who do both the boring work of creating the rich context to give the agent meaning, and most importantly, create an evaluation mechanism to measure whether their agent is actually right before anyone else has to find out it wasn't.

That investment in evaluation is what separates a demo from a deployment.

*This is the second in a series on building production-ready AI analytics agents. The first article, "The Boring Work That Makes AI Analytics Actually Work," covers the context enrichment methodology that drives agent accuracy. Next, we'll cover “From Pilot to Production: An Honest Account of AI Analytics Deployment" What worked, what broke, and what we'd do differently

If your team is exploring AI analytics and wants to understand whether your data foundation and evaluation approach are ready, we offer an AI Readiness Assessment that evaluates your current setup across data quality, semantic context, and organizational readiness.

Share this post