3. Performance

Performance evaluation measures the non-functional characteristics of your LLM application — how fast, how cheap, how scalable, and how resilient it is under real-world conditions.

Latency

Response time is one of the most visible quality signals to end users. Slow responses degrade UX regardless of accuracy.

Metric	What It Measures	Why It Matters
Time to First Token (TTFT)	Time from request submission to the first token appearing in the response stream	Perceived responsiveness — users judge “speed” by when they see the first output, not when the full response completes
End-to-End Latency	Total time from request to complete response delivery	Overall throughput and SLA compliance; critical for synchronous workflows
Throughput	Requests processed per second (RPS) under steady-state load	Capacity planning; determines how many concurrent users the system can serve
Streaming Stability	Consistency of token delivery rate during streamed responses	Choppy streaming (bursts then pauses) feels broken even if total latency is acceptable
Responsiveness	Time from user action (click, enter) to visible system acknowledgment	Includes network, queue, and pre-processing time — not just LLM inference

How to Measure

TTFT: Instrument the streaming callback — timestamp the first on_token event minus the request timestamp.
E2E Latency: Timestamp at request send and response complete. For multi-step agents, sum per-step latencies and add orchestration overhead.
Throughput: Run a controlled load test (see Load Testing below) and measure sustained RPS at acceptable latency percentiles (p50, p95, p99).
Streaming Stability: Measure inter-token intervals during streaming. Flag if standard deviation exceeds a threshold (e.g., >500ms gaps).

Tools

For end-to-end load and latency testing, industry-standard tools like JMeter and k6 are well-suited — they can simulate concurrent users, measure p50/p95/p99 latencies, and stress-test your full pipeline including retrieval, orchestration, and LLM inference.

For chunk-by-chunk streaming latency (TTFT, inter-token intervals, streaming stability), use streamapiperformance — an npm package purpose-built for measuring token-level timing in streamed LLM responses.

Code: examples/performance/latency_evaluator.py — a simple callable-based timing wrapper for measuring E2E latency of any LLM call.

Cost

At production scale, cost becomes one of the most critical performance dimensions. As user volume grows, every unnecessary token, redundant LLM call, and over-fetched context chunk compounds into significant spend. A single query in a multi-agent RAG pipeline can trigger 3–10+ LLM invocations — if you’re not tracking cost per journey, you’re flying blind.

Metric	What It Measures	Why It Matters
Per Model Call	Token cost (input + output) for a single LLM invocation	Baseline unit cost; varies dramatically between models (GPT-4o vs. GPT-4o-mini vs. open-source)
End-to-End Journey Cost	Total cost of resolving a user query, including all LLM calls, retrieval, and tool invocations	Multi-agent and RAG systems often make 3–10+ LLM calls per query; per-call cost alone is misleading
Token Usage Efficiency	Ratio of useful output tokens to total tokens consumed (including system prompts, retries, and context)	Bloated system prompts, unnecessary retries, and over-fetched context silently inflate costs

How to Measure

Per Model Call: Log prompt_tokens and completion_tokens from the API response. Multiply by the model’s per-token pricing.
Journey Cost: Sum all per-call costs across the full agent/RAG pipeline for a single user query. Track as a distribution (p50, p95).
Token Efficiency: (output_tokens) / (total_input_tokens + output_tokens). Low efficiency suggests system prompt bloat or over-retrieval.

Context & Memory Efficiency

For RAG and memory-augmented systems, more context is not always better. There’s a quality curve.

Metric	What It Measures	Why It Matters
Context Size vs Quality Curve	How response quality changes as you increase the number of retrieved chunks (K)	Diminishing returns — going from K=3 to K=10 may add noise without improving accuracy
Memory Size vs Relevance Curve	How memory recall quality degrades as the memory store grows	Older or less relevant memories may pollute the context window

How to Evaluate

Sweep K: Run the same test set with K=1, 3, 5, 10, 20. Plot accuracy (or faithfulness score) vs K.
Find the elbow: Identify the point where adding more chunks stops improving quality.

Load Testing

Evaluate system behavior under realistic and peak traffic conditions.

Metric	What It Measures	Why It Matters
Concurrency Limit	Maximum number of simultaneous requests before latency degrades beyond acceptable thresholds	Capacity planning; determines infrastructure scaling requirements
Peak Load Behavior	System behavior at and beyond capacity — does it degrade gracefully or fail catastrophically?	Determines whether the system queues, throttles, or crashes under burst traffic

How to Test

Use load testing tools (Locust, k6, Artillery) to simulate concurrent users.
Ramp from 1 → N concurrent requests. Record latency percentiles (p50, p95, p99) and error rates at each level.
Identify the concurrency at which p95 latency exceeds your SLA — that’s your effective concurrency limit.
Push 20% beyond that limit and observe: does the system queue gracefully, return 429s, or crash?

Tool recommendation: Locust is Python-native and easy to script custom LLM request patterns. k6 is better for high-volume HTTP benchmarks with built-in dashboards.

Reliability

How does the system behave when things go wrong?

Metric	What It Measures	Why It Matters
Retry Mechanism	Whether failed LLM calls, tool invocations, or retrieval steps are automatically retried with appropriate backoff	Transient failures (rate limits, timeouts) are common; retries prevent unnecessary user-facing errors
Graceful Degradation	System behavior when a component fails — does it fall back to a simpler path or fail entirely?	Users prefer a partial answer over a cryptic error; fallback chains maintain UX under failure conditions

What to Validate

Scenario	Expected Behavior
LLM API returns 429 (rate limited)	Retry with exponential backoff; succeed within 2–3 retries
Primary retrieval service is down	Fall back to cached results or a secondary index
One agent in a multi-agent chain times out	Orchestrator detects the timeout, skips or retries the step, and returns a partial result with a disclaimer
LLM returns unparseable output (malformed JSON)	Retry with a stricter prompt; if still fails, return a structured error to the user

Capability-Specific Performance

Performance metrics vary significantly by capability. Evaluating RAG latency is a different exercise from evaluating agent loop cost. This section covers performance considerations unique to each architectural capability.

RAG Performance

RAG applications add retrieval and embedding steps that often dominate latency and cost.

Metric	What It Measures	Why It Matters
Retrieval Latency	Time for the vector database query to return results	Affected by index size, query complexity, and metadata filtering. Typically 50–200ms but can spike with large indexes
Indexing Throughput	How fast new documents can be ingested, chunked, embedded, and indexed	Determines how fresh your knowledge base can be. Batch vs streaming ingestion have very different profiles
Embedding Cost	Cost of embedding queries and documents	Embedding models are cheap per call but expensive at scale—millions of documents or high query volume can dominate your bill
Chunk Size vs Accuracy Tradeoff	How chunk size affects retrieval accuracy and token usage	Smaller chunks improve precision but increase retrieval calls and context assembly complexity. Larger chunks reduce calls but may include noise. Profile this tradeoff empirically for your data

Agent Performance

Agents multiply costs and latency by the number of turns. A 10-turn agent is roughly 10x the cost and latency of a single call.

Metric	What It Measures	Why It Matters
Per-Turn Latency	Time for each individual agent turn (model inference + tool execution)	Total task time equals the sum of turn latencies. Track both individual turns and cumulative task duration
Cumulative Turn Cost	Total token cost across all turns in a task	Grows linearly (or worse) with turn count. Multi-step agents can consume 10x+ the tokens of a single call
Multi-Agent Coordination Overhead	Latency and token cost added by handoffs and shared-state synchronization	When multiple agents collaborate, measure the overhead of coordination itself, separate from the work done
Tool Call Parallelism Gains	Latency reduction from running independent tool calls concurrently	Agents that issue multiple independent tool calls in parallel can dramatically reduce latency. Measure the speedup from parallel vs sequential execution

Tool Execution Performance

Tool calls are often the latency bottleneck in agent systems. A slow external API can dominate total response time.

Metric	What It Measures	Why It Matters
Tool Call Latency	Time from issuing a tool call to receiving the result	Profile each tool individually—some take milliseconds, others seconds. Slow tools are candidates for caching or timeouts
Sequential vs Parallel Execution	Latency difference between running tools one at a time vs concurrently	Measure the gap between sequential and parallel execution for your common tool combinations
External API Rate Limits	How close you run to provider rate limits	External services impose rate limits that can bottleneck your agent. Monitor utilization and implement backoff/queuing before you hit them

Multimodal Performance

Non-text modalities have distinct performance profiles: images consume many tokens, audio requires real-time processing, and generated media is slow.

Metric	What It Measures	Why It Matters
Audio/Image Processing Latency	Processing time for STT, TTS, and vision model inference	Audio processing needs to be near real-time for voice agents; image processing adds seconds per image
Token Cost for Visual Inputs	Token consumption for image inputs	Images consume significantly more tokens than equivalent text descriptions. A single high-res image can use 1000+ tokens. Budget visual token costs carefully
Streaming Latency for Voice	Mouth-to-ear latency (time from user finishing speaking to hearing first word of response)	For voice agents, target under 500ms for natural conversation. Higher latency breaks conversational flow

← Previous: 2. Accuracy · Next: 4. Safety →