Ask, don't judge: interpretable LLM evaluation comes to Trusys

Written by

A single quality score tells you something is wrong, but never what. BinEval breaks evaluation into atomic yes/no questions — so every score comes with its receipts. It's now live in Trusys, alongside G-Eval.

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

How BinEval works?

BinEval turns evaluation from a black-box verdict into a structured diagnostic signal. There are two phases: generating the questions, and answering them. Both are driven by a single, task-agnostic meta-prompt — the same machinery works for summarization, dialogue, or instruction-following, with only your task definition changing.

Phase 1 — Decompose the criteria into questions

Given a task prompt T, a meta-prompt instructs an LLM to perform a two-step decomposition. First, it summarizes the task into an explicit set of requirements — each one a distinct evaluation criterion, such as whether the output preserves a key fact or obeys a formatting constraint. Then it decomposes each requirement into one or more binary questions, where “yes” means the criterion is satisfied and “no” signals a violation. Requirements that bundle several sub-tasks get split into separate, simpler checks.

The questions are organized by dimension. For summarization, for instance, they partition naturally into coherence, consistency, fluency, and relevance.

Phase 2 — Answer each question, then aggregate

An evaluator LLM E takes your input x (a source document, transcript, or instruction) and the output y (a summary, response, or completion), and answers each binary question independently. Every verdict is a 0 or a 1 — and, critically, it comes with a short natural-language explanation. That explanation is what makes the score debuggable: you can see not just the verdict, but the reasoning behind it.

Scoring is then just aggregation. The per-dimension score is the fraction of that dimension's questions answered “yes”; the overall score is the fraction across all questions. Both scores live in [0, 1], where 1 means every criterion is satisfied. Because they're just fractions, they map cleanly onto whatever scale you already report — a 1-to-5 Likert range, a 0-to-100 score — via simple affine scaling. That keeps BinEval comparable with your existing assertions instead of forcing a new convention.

Why decomposition catches what holistic judges miss

The clearest way to see the difference is a real failure case from the paper's factual-consistency study. A summary can be fluent, name the right entities, and describe the right event — and still be wrong in ways a holistic judge sails right past, because surface plausibility gets rewarded as global correctness.

Consistency: a surface-plausible summary with three real errors

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

BinEval decomposes consistency into seven targeted questions. Here's how they land:

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

The numbers behind it

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

Method

UniEval (T5)

G-Eval (GPT-4)

BinEval

SummEval (avg ρ)

0.474

0.514

0.563

Topical-Chat (avg ρ)

0.552

0.575

0.632

QAGS (avg ρ)

0.575

0.611

0.620

Building Secure Tool-Calling Architectures

+0.195 ρ  —  BinEval's gain over UniEval on SummEval consistency — its single largest improvement.

3 benchmarks  —  SummEval, Topical-Chat, and QAGS — summarization, dialogue, and hallucination.

0 training  —  Task-agnostic and training-free. No fine-tuned evaluator to maintain.

Beyond raw correlation, BinEval better matches the shape of human score distributions and avoids the ceiling effects common in prior LLM judges — where everything clusters near the top of the scale and borderline outputs become indistinguishable from clean ones. On QAGS in particular, human consistency ratings are strongly bimodal, and BinEval preserves that spread instead of collapsing toward a single high band. In practice, that means sharper discrimination between a mildly flawed output and a clearly broken one — exactly the distinction that matters when you're ranking candidates.

From diagnosis to self-improvement

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

When to reach for BinEval

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

STRONG FIT

• Factual consistency & hallucination detection

• Format, structure, and schema adherence

• Instruction-following with concrete constraints

• Anywhere you need to know why an output failed

USE WITH CARE

• Inherently holistic, subjective criteria (e.g. overall relevance)

• Tasks where humans tolerate missing minor detail

• Cost-sensitive paths — more questions mean more model calls

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

Stop guessing.

Start measuring.

Join teams building reliable AI with TruEval. Start with a free trial, no credit card required. Get your first evaluation running in under 10 minutes.

Questions about Trusys?

Our team is here to help. Schedule a personalized demo to see how Trusys fits your specific use case.

Book a Demo

Ready to dive in?

Check out our documentation and tutorials. Get started with example datasets and evaluation templates.

Start Free Trial

Free Trial

No credit card required

10 Min

To first evaluation

24/7

Enterprise support

Open mobile menu

Benefits

Specifications

How-to

Contact Us

Learn More

Phone

Ask, don't judge: interpretable LLM evaluation comes to Trusys

Written by

A single quality score tells you something is wrong, but never what. BinEval breaks evaluation into atomic yes/no questions — so every score comes with its receipts. It's now live in Trusys, alongside G-Eval.

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

How BinEval works?

BinEval turns evaluation from a black-box verdict into a structured diagnostic signal. There are two phases: generating the questions, and answering them. Both are driven by a single, task-agnostic meta-prompt — the same machinery works for summarization, dialogue, or instruction-following, with only your task definition changing.

Phase 1 — Decompose the criteria into questions

Given a task prompt T, a meta-prompt instructs an LLM to perform a two-step decomposition. First, it summarizes the task into an explicit set of requirements — each one a distinct evaluation criterion, such as whether the output preserves a key fact or obeys a formatting constraint. Then it decomposes each requirement into one or more binary questions, where “yes” means the criterion is satisfied and “no” signals a violation. Requirements that bundle several sub-tasks get split into separate, simpler checks.

The questions are organized by dimension. For summarization, for instance, they partition naturally into coherence, consistency, fluency, and relevance.

Phase 2 — Answer each question, then aggregate

An evaluator LLM E takes your input x (a source document, transcript, or instruction) and the output y (a summary, response, or completion), and answers each binary question independently. Every verdict is a 0 or a 1 — and, critically, it comes with a short natural-language explanation. That explanation is what makes the score debuggable: you can see not just the verdict, but the reasoning behind it.

Scoring is then just aggregation. The per-dimension score is the fraction of that dimension's questions answered “yes”; the overall score is the fraction across all questions. Both scores live in [0, 1], where 1 means every criterion is satisfied. Because they're just fractions, they map cleanly onto whatever scale you already report — a 1-to-5 Likert range, a 0-to-100 score — via simple affine scaling. That keeps BinEval comparable with your existing assertions instead of forcing a new convention.

Why decomposition catches what holistic judges miss

The clearest way to see the difference is a real failure case from the paper's factual-consistency study. A summary can be fluent, name the right entities, and describe the right event — and still be wrong in ways a holistic judge sails right past, because surface plausibility gets rewarded as global correctness.

Consistency: a surface-plausible summary with three real errors

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

BinEval decomposes consistency into seven targeted questions. Here's how they land:

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

The numbers behind it

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

Method

UniEval (T5)

G-Eval (GPT-4)

BinEval

SummEval (avg ρ)

0.474

0.514

0.563

Topical-Chat (avg ρ)

0.552

0.575

0.632

QAGS (avg ρ)

0.575

0.611

0.620

Building Secure Tool-Calling Architectures

+0.195 ρ  —  BinEval's gain over UniEval on SummEval consistency — its single largest improvement.

3 benchmarks  —  SummEval, Topical-Chat, and QAGS — summarization, dialogue, and hallucination.

0 training  —  Task-agnostic and training-free. No fine-tuned evaluator to maintain.

Beyond raw correlation, BinEval better matches the shape of human score distributions and avoids the ceiling effects common in prior LLM judges — where everything clusters near the top of the scale and borderline outputs become indistinguishable from clean ones. On QAGS in particular, human consistency ratings are strongly bimodal, and BinEval preserves that spread instead of collapsing toward a single high band. In practice, that means sharper discrimination between a mildly flawed output and a clearly broken one — exactly the distinction that matters when you're ranking candidates.

From diagnosis to self-improvement

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

When to reach for BinEval

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

STRONG FIT

• Factual consistency & hallucination detection

• Format, structure, and schema adherence

• Instruction-following with concrete constraints

• Anywhere you need to know why an output failed

USE WITH CARE

• Inherently holistic, subjective criteria (e.g. overall relevance)

• Tasks where humans tolerate missing minor detail

• Cost-sensitive paths — more questions mean more model calls

Why Rate Limit Failures Are So Dangerous

Many organizations still treat rate limit errors as minor API inconveniences.

That assumption is becoming expensive.

In reality, rate limit failures create cascading operational disruption across the enterprise.

Stop guessing.

Start measuring.

Join teams building reliable AI with TruEval. Start with a free trial, no credit card required. Get your first evaluation running in under 10 minutes.

Questions about Trusys?

Our team is here to help. Schedule a personalized demo to see how Trusys fits your specific use case.

Book a Demo

Ready to dive in?

Check out our documentation and tutorials. Get started with example datasets and evaluation templates.

Start Free Trial

Free Trial

No credit card required

10 Min

To first evaluation

24/7

Enterprise support

Ask, don't judge: interpretable LLM evaluation comes to Trusys

Written by

Manish Tewari

Published on

June 30, 2026

A single quality score tells you something is wrong, but never what. BinEval breaks evaluation into atomic yes/no questions — so every score comes with its receipts. It's now live in Trusys, alongside G-Eval.

Modern language models have made generation easy and evaluation hard. They produce fluent, contextually appropriate text across summarization, dialogue, reasoning, and instruction following — but judging the quality of that output is now the bottleneck. Human review is slow and expensive. Lexical metrics like ROUGE and BLEU miss semantic correctness and factuality. And the increasingly popular alternative — asking a strong LLM to act as a judge — tends to return a single opaque score that's difficult to debug.

That opacity is especially costly during iterative development. If a summary receives a mediocre rating, you still don't know why: is the problem factual inconsistency, weak relevance, missing content, or poor fluency? Comparing prompts, models, or decoding strategies needs feedback that is not only accurate but actionable. A scalar score isn't.

This is the gap BinEval closes. It's a new assertion type in the Trusys evaluation platform, built on the method introduced in the research paper "Ask, Don’t Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement" (Cho et al., Capital One AI Foundations; accepted to the 2nd Workshop on Compositional Learning at ICML 2026). We've implemented it so you can drop it into any evaluation the same way you'd reach for G-Eval — and get back not a number, but a verdict you can read.

How BinEval works?

BinEval turns evaluation from a black-box verdict into a structured diagnostic signal. There are two phases: generating the questions, and answering them. Both are driven by a single, task-agnostic meta-prompt — the same machinery works for summarization, dialogue, or instruction-following, with only your task definition changing.

Phase 1 — Decompose the criteria into questions

Given a task prompt T, a meta-prompt instructs an LLM to perform a two-step decomposition. First, it summarizes the task into an explicit set of requirements — each one a distinct evaluation criterion, such as whether the output preserves a key fact or obeys a formatting constraint. Then it decomposes each requirement into one or more binary questions, where “yes” means the criterion is satisfied and “no” signals a violation. Requirements that bundle several sub-tasks get split into separate, simpler checks.

The questions are organized by dimension. For summarization, for instance, they partition naturally into coherence, consistency, fluency, and relevance.

Phase 2 — Answer each question, then aggregate

An evaluator LLM E takes your input x (a source document, transcript, or instruction) and the output y (a summary, response, or completion), and answers each binary question independently. Every verdict is a 0 or a 1 — and, critically, it comes with a short natural-language explanation. That explanation is what makes the score debuggable: you can see not just the verdict, but the reasoning behind it.

Scoring is then just aggregation. The per-dimension score is the fraction of that dimension's questions answered “yes”; the overall score is the fraction across all questions. Both scores live in [0, 1], where 1 means every criterion is satisfied. Because they're just fractions, they map cleanly onto whatever scale you already report — a 1-to-5 Likert range, a 0-to-100 score — via simple affine scaling. That keeps BinEval comparable with your existing assertions instead of forcing a new convention.

Why decomposition catches what holistic judges miss

The clearest way to see the difference is a real failure case from the paper's factual-consistency study. A summary can be fluent, name the right entities, and describe the right event — and still be wrong in ways a holistic judge sails right past, because surface plausibility gets rewarded as global correctness.

Consistency: a surface-plausible summary with three real errors

Source (excerpt): “The Pentagon called the intercept unsafe and unprofessional… The Russian Defense Ministry said the jet was scrambled to identify the aircraft…”

Summary under test: “The U.S. RC-135U was flying over the Baltic Sea when it was intercepted by a Russian SU-27 Flanker. The Pentagon said the Russian jet flew around the plane to identify it.”

BinEval decomposes consistency into seven targeted questions. Here's how they land:

Q1 · N  The summary misattributes Russia's stated purpose (“identify the aircraft”) to the Pentagon.

Q2 · N  The source URL is fabricated — it doesn't appear anywhere in the article.

Q3 · N  The entity role is wrong: the statement belongs to a different party than the one named.

Q4 · Y  Aircraft types (RC-135U, SU-27) match the source.

Q5 · N  The Pentagon's and Russia's accounts of the intercept are conflated.

Q6–7 · Y  The core event is described, with no broader hallucination of scope.

Scores: G-Eval & UniEval (holistic) = 5.0   BinEval (3 of 7 passed) = 1.57   Human reference = 2.0

Holistic judges score this a perfect 5.0 — off from the human rating by three full points — because the summary “looks consistent.” BinEval probes each claim independently, catches four of the errors, and lands at 1.57 on a 1-to-5 scale: within half a point of the human judgment. And every one of those failing questions is a specific, named defect you can act on, whether you're fixing the summarizer or the evaluator prompt itself.

Why this works, technically: a single holistic judgment is a hard, multi-faceted classification problem. Each binary question replaces it with one simple, verifiable property. Aggregating several weakly-correlated binary verdicts also reduces variance — roughly proportionally to 1/N — and forces explicit enumeration of failure modes, which improves recall over a single impression. On consistency, the paper finds complexity reduction alone is enough to drive the largest gain over prior evaluators.

The numbers behind it

Across three established benchmarks with human annotations, BinEval matches or outperforms strong baselines including UniEval and G-Eval — training-free, with no task-specific fine-tuning. The averages below are Spearman rank correlations with human judgments (higher is better):

Method

UniEval (T5)

G-Eval (GPT-4)

BinEval

SummEval (avg ρ)

0.474

0.514

0.563

Topical-Chat (avg ρ)

0.552

0.575

0.632

QAGS (avg ρ)

0.575

0.611

0.620

Building Secure Tool-Calling Architectures

+0.195 ρ  —  BinEval's gain over UniEval on SummEval consistency — its single largest improvement.

3 benchmarks  —  SummEval, Topical-Chat, and QAGS — summarization, dialogue, and hallucination.

0 training  —  Task-agnostic and training-free. No fine-tuned evaluator to maintain.

Beyond raw correlation, BinEval better matches the shape of human score distributions and avoids the ceiling effects common in prior LLM judges — where everything clusters near the top of the scale and borderline outputs become indistinguishable from clean ones. On QAGS in particular, human consistency ratings are strongly bimodal, and BinEval preserves that spread instead of collapsing toward a single high band. In practice, that means sharper discrimination between a mildly flawed output and a clearly broken one — exactly the distinction that matters when you're ranking candidates.

From diagnosis to self-improvement

Because every score is grounded in specific questions, the same signal can improve your prompts automatically. Failing or disagreeing questions are precisely the targeted feedback an optimizer needs — far more useful than a holistic score difference, which tells you results diverged but not on which criteria.

The paper describes two update modes that we mirror in the implementation. In self-update, a model refines its own evaluator prompt using divergence from human judgments. In cross-model update, a stronger reference model guides a weaker target — useful when you're migrating an evaluator between model families and want to preserve behavior. A note-taker model extracts generalized lessons from the disagreements, deduplicates them, and rewrites the relevant prompt sections.

One honest finding worth carrying over: the technique helps most when a model already has the capability and just needs clearer guidance. On format and sentence-level constraints, prompt updates produced double-digit accuracy gains. On tasks that require genuine computation during generation — counting tokens, enforcing ratios — the optimizer correctly diagnoses the problem but can't prompt its way to an ability the model lacks, and over-long prompts eventually hurt. BinEval still gives you an accurate diagnosis; it just can't manufacture a capability that isn't there.

When to reach for BinEval

BinEval is a tool with a clear sweet spot, and we'd rather be precise about it than oversell. It excels wherever quality reduces cleanly to checkable properties — and it's less suited to inherently holistic taste, where forcing exhaustive sub-checks can make the evaluator harsher than a human.

STRONG FIT

• Factual consistency & hallucination detection

• Format, structure, and schema adherence

• Instruction-following with concrete constraints

• Anywhere you need to know why an output failed

USE WITH CARE

• Inherently holistic, subjective criteria (e.g. overall relevance)

• Tasks where humans tolerate missing minor detail

• Cost-sensitive paths — more questions mean more model calls

That last point is a genuine trade-off: BinEval exchanges some efficiency for diagnostic value, since it must generate and answer multiple questions rather than render one judgment. When you need an interpretable, auditable verdict — the kind that holds up in a regulated review — that's a trade worth making. When you only need a fast rough ranking, a holistic assertion may be the better default. Trusys lets you choose per evaluation.

Stop guessing.

Start measuring.

Join teams building reliable AI with Trusys. Start with a free trial, no credit card required. Get your first evaluation running in under 10 minutes.

Questions about Trusys?

Our team is here to help. Schedule a personalized demo to see how Trusys fits your specific use case.

Book a Demo

Ready to dive in?

Check out our documentation and tutorials. Get started with example datasets and evaluation templates.

Start Free Trial

Free Trial

No credit card required

10 Min

to get started

24/7

Enterprise support