Beyond Evals: Why AI Observability Needs Policy Enforcement

Beyond Evals: Why AI Observability Needs Policy Enforcement — and How TruPulse Now Delivers It

2026-05-19

A new capability in TruPulse lets teams define agent policies in plain English and continuously evaluate every production trace against them.

The question your AI observability stack cannot answer

Most AI observability tools are good at telling you what happened. Spans, traces, token counts, latency, tool calls, retries, failure modes — modern observability gives you a high-resolution recording of every step your agent took.

What they cannot tell you is whether what happened was allowed.

That is the question that keeps Chief AI Officers, CISOs, and Heads of Risk awake. Not “did the model produce coherent output?” — that is an eval problem, and the industry has spent two years building evals. The question is: did this AI system behave in a way that is consistent with my organization’s policies, my regulator’s expectations, and my customers’ contractual rights?

These are not the same problem. An LLM can pass every benchmark on your eval suite and still leak PII to a third-party tool. A multi-agent system can score 95% on a faithfulness metric and still take an action it was never authorized to take. A perfectly tuned retrieval pipeline can produce a beautifully grounded answer while pulling from a document the user was never supposed to see.

Evals tell you the model is good. Observability tells you what the system did. Neither tells you whether the system stayed inside the lines your organization has drawn.

That gap is what TruPulse now closes.

Why the gap exists, and why agentic systems make it worse

For traditional software, policy enforcement has a familiar shape. You write a rule — “no PII in logs,” “no admin actions without MFA,” “no cross-region data transfer” — and you encode it as a deterministic check. Static analysis catches it before deployment, runtime guards catch it in production, and your audit trail proves it.

This pattern breaks for AI systems for three reasons.

First, policies are semantic, not syntactic. “Do not give personalized financial advice without a disclaimer” cannot be expressed as a regex. It requires understanding what the model said, in context, in plain language. The same words mean different things depending on the conversation, the user, and the product.

Second, agentic systems are multi-step. A policy violation rarely happens in a single turn. An agent might handle three messages cleanly, retrieve a document correctly, and then on the fourth turn — when it routes a tool call through an external service — quietly include the customer’s account number in the request payload. The violation is not in any single span. It is in the trajectory of the trace.

Third, the people who write policies are not the people who write code. Compliance officers, DPOs, risk leads, and product counsel are the ones who own “what the AI is allowed to do.” They write policies in English: in memos, in handbooks, in regulatory submissions. Asking them to translate those policies into Python assertions, YAML rules, or eval test cases is how policy enforcement quietly fails to happen.

The result is a structural blind spot. Traditional observability captures the trace but cannot reason about whether the trace complies with policy. Pre-deployment evals capture model quality but cannot foresee every production policy violation. Hardcoded guardrails catch the egregious cases — jailbreaks, profanity, obvious PII patterns — but miss the nuanced, contextual, organization-specific rules that actually define enterprise AI governance.

What we built: agentic policy evaluation in TruPulse

TruPulse has always been the real-time behavioral monitoring layer of the Trusys platform — the system that watches what your agents are doing in production and flags anomalies, drift, and quality regressions. With this release, TruPulse adds a new class of monitoring signal: policy compliance.

The mechanic is straightforward, and intentionally so.

You define a policy in plain English. No DSL, no test harness, no eval scaffolding. A compliance officer can type:

“The agent must not include any personally identifiable information — names, phone numbers, account numbers, government IDs, or email addresses — in any tool call to an external service.”

TruPulse evaluates every production trace against that policy. Not against a single input-output pair, but across the entire trace — every span, every tool call, every retrieval, every agent handoff. The system reasons about the trajectory as a whole and produces a pass/fail signal with an explanation grounded in the specific spans that triggered the decision.

The signal flows into the same observability surface as the rest of your telemetry. You see policy compliance alongside latency, cost, and quality metrics. You can alert on it, dashboard it, slice it by tenant or agent or model, and feed it into your audit trail.

The metric we use for this is what we call agentic G-Eval.

Why agentic G-Eval is the right primitive

G-Eval is a well-established evaluation technique in the LLM literature — using a strong language model as a judge, guided by an evaluation rubric and chain-of-thought reasoning, to score the quality of generated text against criteria expressed in natural language. It has become a standard tool for evaluating dimensions that resist deterministic measurement: coherence, helpfulness, faithfulness, factuality.

What we have built extends that primitive in two important directions.

The unit of evaluation is the trace, not the turn. Conventional G-Eval scores a single input-output pair. Agentic G-Eval reasons over the full trace — system prompts, user messages, intermediate agent thoughts, tool calls, tool responses, retrieved context, final responses, and handoffs between agents in multi-agent systems. This is what makes it possible to evaluate trajectory-level policies: “the agent never invoked the submit_payment tool before completing KYC verification,” “no document tagged as restricted was retrieved at any point in this conversation,” “the agent did not commit the company to a specific dollar figure without escalating to a human.”

The input is a policy, not a quality rubric. The judge model is prompted not with “score this from 1 to 5 on helpfulness” but with the organization’s actual policy text, the trace, and an instruction to determine whether the trace as a whole is consistent with the policy. The output is a binary pass/fail, a confidence indicator, and a structured explanation pointing to the specific spans that informed the verdict.

This last property — the explanation — is what makes the system usable in regulated environments. A pass/fail without justification is not auditable. With agentic G-Eval, every policy decision in production comes with a traceable rationale that an auditor, a regulator, or an internal review can follow back to the underlying evidence.

What policies look like in practice

The class of rules this unlocks is broader than most teams initially realize. A few examples drawn from real conversations with TruPulse customers:

Data protection policies. “No personally identifiable information may appear in any tool call to an external API.” “Customer data classified as restricted must not be retrieved or referenced in the agent’s reasoning.” “Health information must never be transmitted to any model hosted outside of our jurisdiction.”

Scope and authorization policies. “The agent must not commit to specific monetary refunds above ₹10,000 without escalating to a human reviewer.” “The agent may only invoke the transfer_funds tool after the user has been authenticated in this session.” “The agent must not provide medical diagnoses; symptom triage with referral to a clinician is permitted.”

Disclosure and disclaimer policies. “Any response involving investment products must include the SEBI-mandated risk disclaimer.” “When the agent provides legal information, it must clearly state that the response is not legal advice.”

Sequence and process policies. “The agent must not call the submit_application tool before all required KYC checks have been completed in earlier turns.” “If the user requests data export under the DPDP Act, the agent must verify identity before invoking the export tool.”

Each of these is a policy that compliance and risk teams already write — in memos, in playbooks, in regulatory submissions. What was missing was an enforcement layer that could read those policies in their native form and apply them, continuously, to every production trace.

Where this matters for compliance

The regulatory direction of travel is clear, and it is global.

The EU AI Act requires providers and deployers of high-risk AI systems to implement post-deployment monitoring proportional to the risks the system poses, with ongoing logging and traceability. “We have an eval suite” is not a monitoring program. Continuous, policy-anchored evaluation of production behavior is.

India’s Digital Personal Data Protection Act imposes purpose-limitation, data-minimization, and consent obligations that, in practice, mean an AI system cannot freely move personal data across components and vendors. Demonstrating compliance requires the ability to show, after the fact, that personal data was handled in line with the consent on record.

The RBI’s evolving guidance on AI in financial services, SEBI’s focus on AI governance in capital markets, IRDAI’s expectations around algorithmic transparency in insurance — each pushes regulated entities toward continuous, auditable evidence that their AI systems behave within defined boundaries. The same logic applies to HIPAA-equivalent regimes in healthcare and to the OWASP LLM Top 10, which specifically calls out sensitive information disclosure (LLM06) and excessive agency (LLM08) as governance-critical failure modes.

Policy-based evaluation in TruPulse is designed to produce exactly the evidence these frameworks demand: a continuous, time-stamped, explanation-grounded record of whether every production interaction was consistent with the organization’s stated rules.

How TruPulse fits with the rest of the Trusys platform

Policy enforcement is most valuable when it is part of a layered governance system, not an isolated capability.

TruGuard sits in the request path and blocks well-defined, high-confidence violations before they reach users or external systems — prompt injection, prohibited content, hard-blocked PII patterns. TruGuard is the synchronous, defensive layer.

TruPulse — with the new policy capability — runs continuously over production traces and surfaces the broader, more contextual policy compliance signal. TruPulse is where you discover the violations that no rule-based filter could have caught, the patterns that emerge across thousands of traces, and the slow drift of agent behavior away from policy over time.

Argus, the Trusys governance intelligence layer, aggregates these signals into an executive view: which policies are firing, where, against which agents and models and tenants, with what severity. Argus is what a Chief AI Officer takes to the board and what a CISO takes to the audit committee.

TruEval continues to handle pre-deployment evaluation — the model quality, the regression tests, the red-team scenarios. Policies and evals are complementary, not substitutes. Evals catch what is wrong with the model. Policies catch what is wrong with the system.

Together, these form a closed loop: define policy, evaluate continuously, block what should be blocked, learn from what is observed, and feed every signal back into governance review.

Getting started

If you are running agentic AI systems in a regulated environment — fintech, healthcare, insurance, public sector, large enterprise — and your current observability stack tells you what your agents did but cannot tell you whether they behaved within policy, this capability is built for you.

You can start with three policies. Write them in plain English, the way you would write them in a memo. Point TruPulse at your existing production traces — through our SDK, our OpenTelemetry integrations, or our MCP server — and within hours you will have a continuous, auditable signal on whether your AI system is operating inside the lines you have drawn.

The most important thing AI observability can tell you is not what happened. It is what should not have.

TruPulse now tells you both.

To see policy-based evaluation in action on your own traces, request a TruPulse walkthrough at trusys.ai or write to us at hello@trusys.ai.

Stop guessing.

Start measuring.

Join teams building reliable AI with TruEval. Start with a free trial, no credit card required. Get your first evaluation running in under 10 minutes.

Questions about Trusys?

Our team is here to help. Schedule a personalized demo to see how Trusys fits your specific use case.

Book a Demo

Ready to dive in?

Check out our documentation and tutorials. Get started with example datasets and evaluation templates.

Start Free Trial

Free Trial

No credit card required

10 Min

To first evaluation

24/7

Enterprise support

Benefits

Specifications

How-to

Learn More

Beyond Evals: Why AI Observability Needs Policy Enforcement — and How TruPulse Now Delivers It

2026-05-19

A new capability in TruPulse lets teams define agent policies in plain English and continuously evaluate every production trace against them.

The question your AI observability stack cannot answer

What they cannot tell you is whether what happened was allowed.

Evals tell you the model is good. Observability tells you what the system did. Neither tells you whether the system stayed inside the lines your organization has drawn.

That gap is what TruPulse now closes.

Why the gap exists, and why agentic systems make it worse

This pattern breaks for AI systems for three reasons.

What we built: agentic policy evaluation in TruPulse

The mechanic is straightforward, and intentionally so.

You define a policy in plain English. No DSL, no test harness, no eval scaffolding. A compliance officer can type:

“The agent must not include any personally identifiable information — names, phone numbers, account numbers, government IDs, or email addresses — in any tool call to an external service.”

The metric we use for this is what we call agentic G-Eval.

Why agentic G-Eval is the right primitive

What we have built extends that primitive in two important directions.

What policies look like in practice

The class of rules this unlocks is broader than most teams initially realize. A few examples drawn from real conversations with TruPulse customers:

Where this matters for compliance

The regulatory direction of travel is clear, and it is global.

How TruPulse fits with the rest of the Trusys platform

Policy enforcement is most valuable when it is part of a layered governance system, not an isolated capability.

Together, these form a closed loop: define policy, evaluate continuously, block what should be blocked, learn from what is observed, and feed every signal back into governance review.

Getting started

The most important thing AI observability can tell you is not what happened. It is what should not have.

TruPulse now tells you both.

To see policy-based evaluation in action on your own traces, request a TruPulse walkthrough at trusys.ai or write to us at hello@trusys.ai.

Stop guessing.

Start measuring.

Join teams building reliable AI with TruEval. Start with a free trial, no credit card required. Get your first evaluation running in under 10 minutes.

Questions about Trusys?

Our team is here to help. Schedule a personalized demo to see how Trusys fits your specific use case.

Book a Demo

Ready to dive in?

Check out our documentation and tutorials. Get started with example datasets and evaluation templates.

Start Free Trial

Free Trial

No credit card required

10 Min

To first evaluation

24/7

Enterprise support

Beyond Evals: Why AI Observability Needs Policy Enforcement — and How TruPulse Now Delivers It

2026-05-19

A new capability in TruPulse lets teams define agent policies in plain English and continuously evaluate every production trace against them.

The question your AI observability stack cannot answer

What they cannot tell you is whether what happened was allowed.

Evals tell you the model is good. Observability tells you what the system did. Neither tells you whether the system stayed inside the lines your organization has drawn.

That gap is what TruPulse now closes.

Why the gap exists, and why agentic systems make it worse

This pattern breaks for AI systems for three reasons.

What we built: agentic policy evaluation in TruPulse

The mechanic is straightforward, and intentionally so.

You define a policy in plain English. No DSL, no test harness, no eval scaffolding. A compliance officer can type:

“The agent must not include any personally identifiable information — names, phone numbers, account numbers, government IDs, or email addresses — in any tool call to an external service.”

The metric we use for this is what we call agentic G-Eval.

Why agentic G-Eval is the right primitive

What we have built extends that primitive in two important directions.

What policies look like in practice

The class of rules this unlocks is broader than most teams initially realize. A few examples drawn from real conversations with TruPulse customers:

Where this matters for compliance

The regulatory direction of travel is clear, and it is global.

How TruPulse fits with the rest of the Trusys platform

Policy enforcement is most valuable when it is part of a layered governance system, not an isolated capability.

Together, these form a closed loop: define policy, evaluate continuously, block what should be blocked, learn from what is observed, and feed every signal back into governance review.

Getting started

The most important thing AI observability can tell you is not what happened. It is what should not have.

TruPulse now tells you both.

To see policy-based evaluation in action on your own traces, request a TruPulse walkthrough at trusys.ai or write to us at hello@trusys.ai.

Stop guessing.

Start measuring.

Join teams building reliable AI with Trusys. Start with a free trial, no credit card required. Get your first evaluation running in under 10 minutes.

Questions about Trusys?

Our team is here to help. Schedule a personalized demo to see how Trusys fits your specific use case.

Book a Demo

Ready to dive in?

Check out our documentation and tutorials. Get started with example datasets and evaluation templates.

Start Free Trial

Free Trial

No credit card required

10 Min

to get started

24/7

Enterprise support