Agent Evaluation

Agent Eval generates test cases, runs them, scores results across metrics, and identifies failures for remediation.

Core concepts

Environments are named test configurations for one agent. Scenarios are situations the agent should handle. Personas are simulated user types that make test inputs realistic. Test cases combine scenario and persona into an input and expected outcome.

Metrics

Task completion
Hallucination
Bias
Toxicity
Faithfulness
Reflection
LLM-as-a-Judge
Tool call accuracy
KB retrieval precision

Workflow

Create environment
  -> generate scenarios and personas
  -> generate test cases
  -> select metrics
  -> run tests
  -> review scores
  -> improve the agent

Agent Hardening

Agent Hardening analyzes failed test cases and recommends changes to instructions, model selection, tools, memory, or guardrails.

Introduction

Key Concepts

Agent Evaluation

Agent Evaluation

Core concepts

Metrics

Workflow

Agent Hardening

​Agent Evaluation

​Core concepts

​Metrics

​Workflow

​Agent Hardening

Agent Evaluation

Core concepts

Metrics

Workflow

Agent Hardening