Skip to main content

Agent Evaluation

Agent Eval generates test cases, runs them, scores results across metrics, and identifies failures for remediation.

Core concepts

Environments are named test configurations for one agent. Scenarios are situations the agent should handle. Personas are simulated user types that make test inputs realistic. Test cases combine scenario and persona into an input and expected outcome.

Metrics

  • Task completion
  • Hallucination
  • Bias
  • Toxicity
  • Faithfulness
  • Reflection
  • LLM-as-a-Judge
  • Tool call accuracy
  • KB retrieval precision

Workflow

Create environment
  -> generate scenarios and personas
  -> generate test cases
  -> select metrics
  -> run tests
  -> review scores
  -> improve the agent

Agent Hardening

Agent Hardening analyzes failed test cases and recommends changes to instructions, model selection, tools, memory, or guardrails.