3. Performance
Performance evaluation measures the non-functional characteristics of your LLM application — how fast, how cheap, how scalable, and how resilient it is under real-world conditions.
Latency
Response time is one of the most visible quality signals to end users. Slow responses degrade UX regardless of accuracy.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Time to First Token (TTFT) | Time from request submission to the first token appearing in the response stream | Perceived responsiveness — users judge “speed” by when they see the first output, not when the full response completes |
| End-to-End Latency | Total time from request to complete response delivery | Overall throughput and SLA compliance; critical for synchronous workflows |
| Throughput | Requests processed per second (RPS) under steady-state load | Capacity planning; determines how many concurrent users the system can serve |
| Streaming Stability | Consistency of token delivery rate during streamed responses | Choppy streaming (bursts then pauses) feels broken even if total latency is acceptable |
| Responsiveness | Time from user action (click, enter) to visible system acknowledgment | Includes network, queue, and pre-processing time — not just LLM inference |
How to Measure
- TTFT: Instrument the streaming callback — timestamp the first
on_tokenevent minus the request timestamp. - E2E Latency: Timestamp at request send and response complete. For multi-step agents, sum per-step latencies and add orchestration overhead.
- Throughput: Run a controlled load test (see Load Testing below) and measure sustained RPS at acceptable latency percentiles (p50, p95, p99).
- Streaming Stability: Measure inter-token intervals during streaming. Flag if standard deviation exceeds a threshold (e.g., >500ms gaps).
Tools
For end-to-end load and latency testing, industry-standard tools like JMeter and k6 are well-suited — they can simulate concurrent users, measure p50/p95/p99 latencies, and stress-test your full pipeline including retrieval, orchestration, and LLM inference.
For chunk-by-chunk streaming latency (TTFT, inter-token intervals, streaming stability), use streamapiperformance — an npm package purpose-built for measuring token-level timing in streamed LLM responses.
Code: examples/performance/latency_evaluator.py — a simple callable-based timing wrapper for measuring E2E latency of any LLM call.
Cost
At production scale, cost becomes one of the most critical performance dimensions. As user volume grows, every unnecessary token, redundant LLM call, and over-fetched context chunk compounds into significant spend. A single query in a multi-agent RAG pipeline can trigger 3–10+ LLM invocations — if you’re not tracking cost per journey, you’re flying blind.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Per Model Call | Token cost (input + output) for a single LLM invocation | Baseline unit cost; varies dramatically between models (GPT-4o vs. GPT-4o-mini vs. open-source) |
| End-to-End Journey Cost | Total cost of resolving a user query, including all LLM calls, retrieval, and tool invocations | Multi-agent and RAG systems often make 3–10+ LLM calls per query; per-call cost alone is misleading |
| Token Usage Efficiency | Ratio of useful output tokens to total tokens consumed (including system prompts, retries, and context) | Bloated system prompts, unnecessary retries, and over-fetched context silently inflate costs |
How to Measure
- Per Model Call: Log
prompt_tokensandcompletion_tokensfrom the API response. Multiply by the model’s per-token pricing. - Journey Cost: Sum all per-call costs across the full agent/RAG pipeline for a single user query. Track as a distribution (p50, p95).
- Token Efficiency:
(output_tokens) / (total_input_tokens + output_tokens). Low efficiency suggests system prompt bloat or over-retrieval.
Context & Memory Efficiency
For RAG and memory-augmented systems, more context is not always better. There’s a quality curve.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Context Size vs Quality Curve | How response quality changes as you increase the number of retrieved chunks (K) | Diminishing returns — going from K=3 to K=10 may add noise without improving accuracy |
| Memory Size vs Relevance Curve | How memory recall quality degrades as the memory store grows | Older or less relevant memories may pollute the context window |
How to Evaluate
- Sweep K: Run the same test set with K=1, 3, 5, 10, 20. Plot accuracy (or faithfulness score) vs K.
- Find the elbow: Identify the point where adding more chunks stops improving quality.
Load Testing
Evaluate system behavior under realistic and peak traffic conditions.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Concurrency Limit | Maximum number of simultaneous requests before latency degrades beyond acceptable thresholds | Capacity planning; determines infrastructure scaling requirements |
| Peak Load Behavior | System behavior at and beyond capacity — does it degrade gracefully or fail catastrophically? | Determines whether the system queues, throttles, or crashes under burst traffic |
How to Test
- Use load testing tools (Locust, k6, Artillery) to simulate concurrent users.
- Ramp from 1 → N concurrent requests. Record latency percentiles (p50, p95, p99) and error rates at each level.
- Identify the concurrency at which p95 latency exceeds your SLA — that’s your effective concurrency limit.
- Push 20% beyond that limit and observe: does the system queue gracefully, return 429s, or crash?
Tool recommendation: Locust is Python-native and easy to script custom LLM request patterns. k6 is better for high-volume HTTP benchmarks with built-in dashboards.
Reliability
How does the system behave when things go wrong?
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Retry Mechanism | Whether failed LLM calls, tool invocations, or retrieval steps are automatically retried with appropriate backoff | Transient failures (rate limits, timeouts) are common; retries prevent unnecessary user-facing errors |
| Graceful Degradation | System behavior when a component fails — does it fall back to a simpler path or fail entirely? | Users prefer a partial answer over a cryptic error; fallback chains maintain UX under failure conditions |
What to Validate
| Scenario | Expected Behavior |
|---|---|
| LLM API returns 429 (rate limited) | Retry with exponential backoff; succeed within 2–3 retries |
| Primary retrieval service is down | Fall back to cached results or a secondary index |
| One agent in a multi-agent chain times out | Orchestrator detects the timeout, skips or retries the step, and returns a partial result with a disclaimer |
| LLM returns unparseable output (malformed JSON) | Retry with a stricter prompt; if still fails, return a structured error to the user |
Capability-Specific Performance
Performance metrics vary significantly by capability. Evaluating RAG latency is a different exercise from evaluating agent loop cost. This section covers performance considerations unique to each architectural capability.
RAG Performance
RAG applications add retrieval and embedding steps that often dominate latency and cost.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Retrieval Latency | Time for the vector database query to return results | Affected by index size, query complexity, and metadata filtering. Typically 50–200ms but can spike with large indexes |
| Indexing Throughput | How fast new documents can be ingested, chunked, embedded, and indexed | Determines how fresh your knowledge base can be. Batch vs streaming ingestion have very different profiles |
| Embedding Cost | Cost of embedding queries and documents | Embedding models are cheap per call but expensive at scale—millions of documents or high query volume can dominate your bill |
| Chunk Size vs Accuracy Tradeoff | How chunk size affects retrieval accuracy and token usage | Smaller chunks improve precision but increase retrieval calls and context assembly complexity. Larger chunks reduce calls but may include noise. Profile this tradeoff empirically for your data |
Agent Performance
Agents multiply costs and latency by the number of turns. A 10-turn agent is roughly 10x the cost and latency of a single call.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Per-Turn Latency | Time for each individual agent turn (model inference + tool execution) | Total task time equals the sum of turn latencies. Track both individual turns and cumulative task duration |
| Cumulative Turn Cost | Total token cost across all turns in a task | Grows linearly (or worse) with turn count. Multi-step agents can consume 10x+ the tokens of a single call |
| Multi-Agent Coordination Overhead | Latency and token cost added by handoffs and shared-state synchronization | When multiple agents collaborate, measure the overhead of coordination itself, separate from the work done |
| Tool Call Parallelism Gains | Latency reduction from running independent tool calls concurrently | Agents that issue multiple independent tool calls in parallel can dramatically reduce latency. Measure the speedup from parallel vs sequential execution |
Tool Execution Performance
Tool calls are often the latency bottleneck in agent systems. A slow external API can dominate total response time.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Tool Call Latency | Time from issuing a tool call to receiving the result | Profile each tool individually—some take milliseconds, others seconds. Slow tools are candidates for caching or timeouts |
| Sequential vs Parallel Execution | Latency difference between running tools one at a time vs concurrently | Measure the gap between sequential and parallel execution for your common tool combinations |
| External API Rate Limits | How close you run to provider rate limits | External services impose rate limits that can bottleneck your agent. Monitor utilization and implement backoff/queuing before you hit them |
Multimodal Performance
Non-text modalities have distinct performance profiles: images consume many tokens, audio requires real-time processing, and generated media is slow.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Audio/Image Processing Latency | Processing time for STT, TTS, and vision model inference | Audio processing needs to be near real-time for voice agents; image processing adds seconds per image |
| Token Cost for Visual Inputs | Token consumption for image inputs | Images consume significantly more tokens than equivalent text descriptions. A single high-res image can use 1000+ tokens. Budget visual token costs carefully |
| Streaming Latency for Voice | Mouth-to-ear latency (time from user finishing speaking to hearing first word of response) | For voice agents, target under 500ms for natural conversation. Higher latency breaks conversational flow |
← Previous: 2. Accuracy · Next: 4. Safety →