> ## Documentation Index
> Fetch the complete documentation index at: https://daily-docs-pr-4892.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Pipecat Evals

> Behavioral testing for your agents: scripted conversations, semantic assertions, and an LLM judge.

Pipecat Evals is the framework's built-in system for testing agent behavior. You describe a conversation and the behavior you expect, and Pipecat runs it against your real agent (the same pipeline, the same services, the same code) and tells you whether the expectation still holds.

```yaml capital_question.yaml theme={null}
name: capital_question

turns:
  - user: "What is the capital of Germany?"
    expect:
      - event: response
        eval: "the response says the capital of Germany is Berlin"
```

```bash theme={null}
pipecat eval run capital_question.yaml
```

## Why evals matter

Voice agents are probabilistic systems. The same agent can answer differently run to run, and a prompt tweak, a model upgrade, or a service swap can quietly break behavior that used to work: a function that no longer gets called, context that stops carrying across turns, an interruption that derails the conversation. Manual testing catches some of this, but it's slow, unrepeatable, and impractical to run on every change.

Evals make agent behavior testable the way unit tests make code testable:

* **Regression safety**: run your scenarios after every prompt, model, or pipeline change and catch breakage before users do.
* **Fast iteration**: text-mode evals skip STT and TTS entirely, so a full conversation test runs in seconds with no audio service cost.
* **Semantic assertions**: an LLM judge checks meaning ("the response says the capital is Berlin"), not exact strings, so tests don't break when wording changes.
* **A feedback signal for AI coding assistants**: evals give a coding assistant a command it can run and a pass/fail result it can read, closing the loop between writing agent code and verifying it. See [The Eval Loop](/pipecat/evals/the-eval-loop).

Pipecat itself relies on this framework: before every release, an eval suite drives 100+ example agents end to end.

## How it works

Pipecat Evals has two halves:

1. **The eval transport.** Your agent runs unchanged with the eval transport. If your agent uses `create_transport()` and the development runner, this is already built in: start it with `-t eval` and it hosts a local WebSocket server speaking RTVI, instead of connecting to Daily, WebRTC, or telephony.

2. **The eval harness.** The harness connects to that transport as an RTVI client, plays the scenario's user turns (as text, or as synthesized speech in audio mode), collects the events your agent emits, and asserts on them in order: transcriptions, LLM responses, spoken output, function calls, and timing.

When a scenario asserts on meaning rather than exact text, a **judge LLM** evaluates the agent's response against a natural-language criterion. The judge runs locally with [Ollama](https://ollama.com) by default, or against OpenAI or any OpenAI-compatible endpoint.

### Text and audio modes

Every scenario runs in one of two modes:

| Mode               | User input                                               | Agent output                                                    | Best for                                                        |
| ------------------ | -------------------------------------------------------- | --------------------------------------------------------------- | --------------------------------------------------------------- |
| **Text** (default) | Sent as text, bypassing the STT                          | LLM text; TTS is skipped automatically                          | Fast, cheap iteration on prompts, logic, and function calling   |
| **Audio**          | Synthesized by a TTS the harness runs (local by default) | Real synthesized speech, transcribed by an STT the harness runs | True end-to-end coverage of the full STT, LLM, and TTS pipeline |

Text mode exercises your agent's actual pipeline and context handling while skipping the audio services, so it costs nothing in TTS or STT usage and runs fast. Audio mode synthesizes the user's voice, streams it through your agent's real STT, and transcribes the agent's actual spoken audio for judging, catching issues that only surface with real speech (turn detection, homophones, barge-in).

## What you can test

* **Response content**: substring checks (`text_contains`) or semantic judging (`eval`) of the agent's replies.
* **Multi-turn context**: verify the agent remembers earlier turns.
* **Function calling**: assert that specific tools were called, with specific arguments.
* **Interruptions**: barge in mid-response and verify the agent recovers (`send_after`).
* **Latency**: per-event budgets with `within_ms`.
* **Vision**: serve an image when the agent requests one and judge its description.

## YAML or Python

Scenarios are YAML files, so they're easy to write, review, and share. Everything is also available as a library: load and run scenarios programmatically, build them in code, inject a custom judge, or orchestrate whole suites from your own tooling. See [Using the Library](/pipecat/evals/library).

## Requirements

* **Pipecat CLI**: the `pipecat eval` commands ship with the CLI extra: `uv tool install "pipecat-ai[cli]"`. If you've added `pipecat-ai[cli]` to your project instead, run them with `uv run pipecat eval` (just like `uv run bot.py`). The same commands are also available as `python -m pipecat.evals`.
* **A judge LLM** (for `eval:` assertions): Ollama by default (`ollama pull gemma2:9b`), or point the scenario's `judge:` block at OpenAI or any OpenAI-compatible endpoint.
* **Audio services** (audio mode only): the harness needs a TTS to synthesize the user's voice and an STT to transcribe the agent's speech. Both can be local models or HTTP-based services; the defaults are local (Kokoro and Moonshine or Whisper, installed with `uv add "pipecat-ai[kokoro,moonshine]"` or `uv add "pipecat-ai[kokoro,whisper]"`), which download once on first use and run with no keys and no per-run cost. WebSocket-streaming services aren't supported here, which keeps the harness simple.
* **Your agent's own credentials**: the agent under test is your real agent, so it needs the same service API keys it normally would.

## Production evaluation

Pipecat Evals is built for development: fast, local, repeatable, and run on every change. Once your agent is deployed, third-party evaluation platforms complement it with testing and monitoring at production scale:

* **Simulations**: scripted or AI-driven test calls over API, WebSocket, or telephony, exercising multi-turn flows, edge cases, and real phone-network conditions before they reach users.
* **Observability**: continuous evaluation of live traffic, with automated quality scoring of calls and transcripts, and metrics tracked over time to catch quality drift.

<CardGroup cols={2}>
  <Card title="Coval" icon="flask-vial" iconType="duotone" href="/pipecat/evals/platforms/coval">
    AI-native simulation and evaluation platform for voice agents, trusted by
    QA, Engineering, Operations, AI, and Executive teams.
  </Card>

  <Card title="Bluejay" icon="bird" iconType="duotone" href="/pipecat/evals/platforms/bluejay">
    Simulation, observability, and evaluation platform with native Pipecat Cloud
    integration. Supports no-code API, WebSocket, and telephony testing.
  </Card>

  <Card title="Cekura" icon="shield-check" iconType="duotone" href="/pipecat/evals/platforms/cekura">
    Automated testing and monitoring platform with native Pipecat Integration
    for WebRTC/Text based testing and support for Mock Tools, Custom Dynamic
    Variables and more!
  </Card>

  <Card title="Arize" icon="chart-line" iconType="duotone" href="/pipecat/evals/platforms/arize">
    Observability and online evaluation for voice agents. Auto-instrument Pipecat
    with OpenInference (OpenTelemetry) to trace every turn, then run LLM-as-judge
    evals on live traffic in Arize AX or open-source Phoenix.
  </Card>
</CardGroup>

<Note>
  Building an evaluation integration for Pipecat? We welcome contributions to
  this page. Open a PR on the [docs
  repository](https://github.com/pipecat-ai/docs).
</Note>

Pipecat's other building blocks feed into any evaluation workflow: [Metrics](/pipecat/fundamentals/metrics) for TTFB, processing time, and usage; [Saving Transcripts](/pipecat/fundamentals/saving-transcripts) for offline analysis; [OpenTelemetry](/api-reference/server/utilities/opentelemetry) for latency traces; and [Observers](/api-reference/server/utilities/observers/observer-pattern) for custom instrumentation.

## Next steps

<CardGroup cols={2}>
  <Card title="Writing Scenarios" icon="file-pen" iconType="duotone" href="/pipecat/evals/scenarios">
    The full scenario format: turns, expectations, modalities, and the judge.
  </Card>

  <Card title="Eval Suites" icon="list-check" iconType="duotone" href="/pipecat/evals/suites">
    Spawn multiple agents and run many scenarios concurrently from a manifest.
  </Card>

  <Card title="The Eval Loop" icon="arrows-rotate" iconType="duotone" href="/pipecat/evals/the-eval-loop">
    Let a coding assistant write agent code, run evals, and iterate
    automatically until the agent is better.
  </Card>
</CardGroup>
