diff --git a/AGENTS.md b/AGENTS.md index 63d33e1023..4b14a4b736 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -40,5 +40,9 @@ These are considered priority 0 issues for this repo, in addition to the normal ## Recent Learnings +- **`uv run` can inherit the wrong virtualenv in this repo** -> Clear `VIRTUAL_ENV` (for example `env -u VIRTUAL_ENV uv run ...`) when an unrelated environment is active -> Avoids misleading mismatch warnings and makes it clear the repo's `.venv` is the interpreter actually running the harnesses. - **Realtime eval shared imports can resolve the wrong module under pytest** -> Add `shared/__init__.py` and ensure tests prepend `examples/evals/realtime_evals` to `sys.path` before importing `shared.*` -> Prevents collection failures caused by unrelated installed packages named `shared`. - **Run-level grades can be overweighted by long simulations** -> Store turn-level grades on the matching turn and trace-level grades on one row per simulation instead of copying them onto every row -> Keeps `results.csv` row semantics intact and prevents summary means from favoring longer conversations. +- **Synthetic-audio scaffold requests can pick the wrong harness** -> Default unspecified synthetic-audio evals to `crawl` text-to-TTS and reserve `walk` for replay-specific audio traits like noise, telephony artifacts, or speaker characteristics -> Keeps new realtime evals on the simplest harness unless audio realism is itself under test. +- **Task-specific single-turn grading can outgrow the shared crawl schema** -> Keep the shared crawl harness for realtime execution, then add eval-local wrapper scripts that post-grade domain-specific quality and overwrite `results.csv` while preserving `results_base.csv` -> Avoids forking the harness when a use case needs richer grading than tool-call correctness. +- **Synthetic learner audio can sound like eval scaffolding** -> Write `user_text` as a realistic in-app learner request and keep evaluation rules in metadata plus the system prompt -> Produces audio inputs that match the product surface instead of teaching the model the grading rubric through the spoken prompt. diff --git a/authors.yaml b/authors.yaml index ce28041ba3..008ba9f254 100644 --- a/authors.yaml +++ b/authors.yaml @@ -567,3 +567,8 @@ kathylau-oai: name: "Kathy Lau" website: "https://github.com/kathylau-oai" avatar: "https://avatars.githubusercontent.com/u/247463782" + +nsingaraju-oai: + name: "Nishanth Singaraju" + website: "https://github.com/nsingaraju-oai" + avatar: "https://avatars.githubusercontent.com/u/232978332" diff --git a/examples/Prompt_Caching_201.ipynb b/examples/Prompt_Caching_201.ipynb index b54513f272..7f1f49de68 100644 --- a/examples/Prompt_Caching_201.ipynb +++ b/examples/Prompt_Caching_201.ipynb @@ -2,8 +2,55 @@ "cells": [ { "cell_type": "markdown", + "id": "f325a442", "metadata": {}, - "source": "# Prompt Caching 201\nWhat it is, why it matters, how to measure it, and how to improve your cache hit rate.\n## 1. Prompt Caching Basics\nModel prompts often include repeated content — such as system instructions and shared context. When a request contains a prefix the system has recently processed, OpenAI can route it to a server that already computed that prefix, allowing the model to reuse prior work instead of recomputing it from scratch.\nPrompt Caching can reduce time-to-first-token latency by up to 80% and input token costs by up to 90%. It works automatically on all API requests and has no additional fees. \nThe goal of this cookbook is to go deeper on optimizing for cache hits. Review our [API docs](https://developers.openai.com/api/docs/guides/prompt-caching) and [Prompt Caching 101](https://developers.openai.com/cookbook/examples/prompt_caching101/) for a great overview - let’s dive in! \n### 1.1 Basics\n- Cache hits require an exact, repeated prefix match and works automatically for prompts that are 1024 tokens or longer. You can get a match until the first mismatched token in 128 token blocks.\n- The entire request prefix is cacheable: messages, images, audio, tool definitions, and structured output schemas.\n- Cache hits come when requests route to the same server. Take advantage of the `prompt_cache_key` for traffic that shares common prefixes to improve that routing.\n- Carefully consider the impact of caching from context engineering including compaction and summarization.\n- Monitor caching, cost and latency via request logs or the Usage dashboard while iterating.\n\n## 2. Why caching matters\n### 2.1 Core Technical Reason: Skipping Prefill Compute\nThe forward pass through transformer layers for input tokens is a dominant part of inference work. \nIf you can reuse per-layer key/value tensors (the KV cache), you avoid recomputing those layers for cached tokens and only pay the lookup plus new-token compute.\n### 2.2 Cost Impact\nCache discounts can be significant. Discount magnitude varies by model family - our newest models have been able to offer steeper cache discounts (90%) as our inference stack has become more efficient. Here are some examples, but see our [pricing page](https://openai.com/api/pricing/) for all models. Prompt Caching is enabled for all recent models, gpt-4o and newer. \n| Model | Input
(per 1M tokens) | Cached input
(per 1M tokens) | Caching Discount |\n| --- | --- | --- | --- |\n| GPT-4o | $2.50 | $1.25 | 50.00% |\n| gpt-4.1 | $2.00 | $0.50 | 75.00% |\n| gpt-5-nano | $0.05 | $0.005 | 90.00% |\n| gpt-5.2 | $1.750 | $0.175 | 90.00% |\n| gpt-realtime (audio) | $32.00 | $0.40 | 98.75% |\n\n\n### 2.3 Latency Impact\nReducing time-to-first-token (TTFT) is a major motivation for improving cache rates. Cached tokens can reduce latency by up to ~80%.\nCaching keeps latency roughly proportional to generated output length rather than total conversation length because the full historical context is not re-prefilled.\nI ran a series of prompts 2300 times and plotted the cached vs uncached requests. For the shortest prompts (1024 tokens), cached requests are 7% faster, but at the longer end (150k+ tokens) we’re seeing 67% faster TTFT. The longer the input, the bigger the impact of caching on that first-token latency. \n\n![Figure 2](../images/prompt-caching-201/figure-2.svg)\n\n## 3. Measure caching first (so you can iterate)\n### 3.1 Per-request: `cached_tokens`\nResponses include usage fields indicating how many prompt tokens were served from cache. All requests will display a `cached_tokens` field of the usage.prompt_tokens_details [Response](https://developers.openai.com/api/docs/api-reference/responses/object) or [Chat](https://developers.openai.com/api/docs/api-reference/chat/object) object indicating how many of the prompt tokens were a cache hit.\n" + "source": [ + "# Prompt Caching 201\n", + "A practical guide to prompt caching: fundamentals, performance impact, measurement, and optimization strategies.\n", + "## 1. Prompt Caching Basics\n", + "Model prompts often include repeated content - such as system instructions, tools, and messages. When a request contains a prefix the system has recently processed, OpenAI can route it to a server that already computed that prefix, allowing the model to reuse prior work instead of recomputing it from scratch.\n", + "Prompt Caching can reduce time-to-first-token latency by up to 80% and input token costs by up to 90%. It works automatically on all API requests and has no additional fees. \n", + "The goal of this cookbook is to go deeper on optimizing for cache hits. Review our [API docs](https://developers.openai.com/api/docs/guides/prompt-caching) and [Prompt Caching 101](https://developers.openai.com/cookbook/examples/prompt_caching101/) for a great overview - let’s dive in! \n", + "### 1.1 Basics\n", + "- Cache hits require an exact, repeated prefix match and works automatically for prompts containing 1024 tokens or more, with cache hits occurring in increments of 128 tokens. \n", + "- The entire request prefix is cacheable: messages, images, audio, tool definitions, and structured output schemas.\n", + "- In-memory prompt caching works automatically on all your API requests. Extended prompt caching increases that to 24hrs. \n", + "- Caching only works if two requests share the same prefix and land on the same machine. Take advantage of the optional parameter `prompt_cache_key` for traffic that shares common prefixes to improve that routing.\n", + "- Carefully consider the impact of caching from context engineering techniques like compaction.\n", + "- Monitor caching, cost and latency via request logs or the Usage dashboard while iterating.\n", + "\n", + "## 2. Why caching matters\n", + "### 2.1 Core Technical Reason: Skipping Prefill Compute\n", + "The forward pass through transformer layers over the input tokens is the main driver of inference cost and latency. In the transformer stack, prompt caching applies specifically to the key and value projections inside the attention layers. Attention is the mechanism that allows models to weigh the importance of different parts of an input sequence when processing a specific token, enabling them to capture context and long-range dependencies.\n", + "\n", + "When you make a request to OpenAI, we'll create token embeddings that are then transformed into three vectors: a query (Q) that represents what that token would need to self-contextualize, a key (K) that encodes what each token represents, and a value (V) that contains the information that can be incorporated if it were to be relevant. The model compares the current token’s query to all tokens’ keys (via dot products) to produce attention scores. After a softmax, these scores become weights that determine how much each token’s value contributes to the updated representation. For example, in the phrase “she sat by the river bank,” the word “river” provides important context to “bank,” pushing its meaning toward the aquatic rather than the financial. The KV cache stores the key and value tensors for that prefix across all layers and heads.\n", + "\n", + "If you then make another request — “she sat by the river bank on a cloudy day” — the first part of the prefix is identical. When processing the new, unseen suffix, the model reuses the cached tensors and only computes attention for the new tokens.\n", + "\n", + "Crucially, the strong semantic relationship between “river” and “bank” has already been encoded into those cached key and value representations during the earlier forward pass. The model does not need to recompute how “bank” aligns with “river.” Instead, new tokens like “cloudy” and “day” generate their own queries and attend over the cached keys from the prefix. In other words, the semantic groundwork has already been laid; the model simply builds on top of that existing attention state rather than recalculating it from scratch.\n", + "\n", + "### 2.2 Cost Impact\n", + "Cache discounts can be significant. Discount magnitude varies by model family - as our inference stack has become more efficient, our newest models have been able to offer steeper cache discounts. Here are some examples, but see our pricing page for all models. Prompt Caching is enabled for all recent models, gpt-4o and newer. \n", + "| Model | Input
(per 1M tokens) | Cached input
(per 1M tokens) | Caching Discount |\n", + "| --- | --- | --- | --- |\n", + "| gpt-4o | $2.50 | $1.25 | 50.00% |\n", + "| gpt-4.1 | $2.00 | $0.50 | 75.00% |\n", + "| gpt-5-nano | $0.05 | $0.005 | 90.00% |\n", + "| gpt-5.2 | $1.75 | $0.175 | 90.00% |\n", + "| gpt-realtime (audio) | $32.00 | $0.40 | 98.75% |\n", + "\n", + "\n", + "### 2.3 Latency Impact\n", + "Reducing time-to-first-token (TTFT) is a primary motivation for improving cache rates. Cached tokens can reduce latency by up to ~80%.\n", + "Caching keeps latency roughly proportional to generated output length rather than total conversation length because the full historical context is not re-prefilled. When we get cache hits, sampling the model is linear rather than quadratic.\n", + "I ran a series of prompts 2300 times and plotted the cached vs uncached requests. For the shortest prompts (1024 tokens), cached requests are 7% faster, but at the longer end (150k+ tokens) we’re seeing 67% faster TTFT. The longer the input, the bigger the impact of caching on that first-token latency. \n", + "\n", + "![Figure 2](../images/prompt-caching-201/figure-2.svg)\n", + "\n", + "## 3. Measure caching first (so you can iterate)\n", + "### 3.1 Per-request: `cached_tokens`\n", + "Responses include usage fields indicating how many prompt tokens were served from the cache. All requests will display a `cached_tokens` field of the `usage.prompt_tokens_details` [Response](https://developers.openai.com/api/docs/api-reference/responses/object) or [Chat](https://developers.openai.com/api/docs/api-reference/chat/object) object indicating how many of the prompt tokens were cached.\n" + ] }, { "cell_type": "markdown", @@ -29,7 +76,46 @@ { "cell_type": "markdown", "metadata": {}, - "source": "\nYou can also get a high level overview by filtering selected cached / uncached tokens for the measures in the usage dashboard\n\n![Figure 3](../images/prompt-caching-201/figure-3.png)\n\n## 4. Improve cache hit rate (tactical playbook)\n\nWhat’s a cache hit? That’s when a request starts with the same prefix as a previous request, allowing the system to reuse previously computed key/value tensors instead of recomputing them. This reduces both time-to-first-token latency and input token cost and maximizing this is our goal!\n\n\n> **Tangible Example: Multi-Turn Chat**\n> In a multi-turn chat, every request resends the full conversation. If your prefix (instructions + tools + prior turns) stays stable, all of that gets served from cache and it's only the newest user message at the end that's unseen tokens. In usage, you should see `cached_tokens` ≈ total prompt tokens minus the latest turn.\n\n### 4.1 Send a Prompt over 1024 tokens\nIt may seem counterintuitive but you will save money by lengthening your prompt beyond ~1024 tokens to trigger caching. \nSay you have a 900 token prompt - you’ll never get a cache hit. If you lengthen your prompt to 1100 tokens and get a 50% cache rate you’ll save 33% on the token costs. If you get a 70% cache rate, you’d save 55%. \n\n### 4.2 Stabilize the Prefix\nThis is the lowest hanging fruit. Make the early portion of the request as stable as possible:\n- Instructions\n- Tool definitions\n- Schemas\n- Examples\n- Long reference context\nMove volatile content (e.g. user text, new content) to the end and ensure you aren’t accidentally breaking the cache by including something dynamic. \n\n> **Tip: Use metadata**\n> We’ve seen customers accidentally invalidate their cache by including a timestamp early in their request for later lookup/debugging. Move that to metadata where it will not impact the cache!\n\n### 4.3 Keep Tools and Schemas Identical\nTools, schemas, and their ordering contribute to the cached prefix - they get injected before developer instructions which means that changing them would invalidate the cache. This includes: \n- Schema key changes\n- Tool ordering changes\n- Changes to instructions\n\n\n> **Tip: Adjust tools without breaking prompt caching**\n> Leverage [`allowed_tools`](https://developers.openai.com/api/docs/guides/function-calling/#tool-choice) `tool_choice` option that lets users restrict the tools the model can call for a request without changing the tools array and busting the cache. List your full toolkit in tools, and then use an `allowed_tools` block to specify which tool can be used on a single turn.\n" + "source": [ + "\n", + "You can also get a high level overview by filtering selected cached / uncached tokens for the measures in the usage dashboard\n", + "\n", + "![Figure 3](../images/prompt-caching-201/figure-3.png)\n", + "\n", + "## 4. Improve cache hit rate (tactical playbook)\n", + "\n", + "What’s a cache hit? That’s when a request starts with the same prefix as a previous request, allowing the system to reuse previously computed key/value tensors instead of recomputing them. Increasing the cache hit rate — maximizing how often prefixes can be reused — directly improves both performance and efficiency, so maximizing this is our goal!\n", + "\n", + "\n", + "### 4.1 Send a Prompt over 1024 tokens\n", + "It can feel counterintuitive, but in some cases, making your prompt slightly longer can reduce overall cost.\n", + "Say you have a 900 token prompt - you’ll never get a cache hit. If you lengthen your prompt to 1100 tokens and get a 50% cache rate you’ll save 33% on the token costs. If you get a 70% cache rate, you’d save 55%. Once you cross the caching threshold and achieve meaningful reuse, the marginal cost of those repeated tokens drops substantially. That means a slightly longer but stable prefix can be cheaper than a shorter prompt that never caches.\n", + "\n", + "### 4.2 Stabilize the Prefix\n", + "This is the lowest-effort, highest-impact optimization: keep the early portion of the prompt stable.\n", + "Place durable content at the beginning:\n", + "- Instructions\n", + "- Tool definitions\n", + "- Schemas\n", + "\n", + "Move volatile content (e.g., user input, dynamic values, session-specific data) to the end. Even small changes in early tokens will invalidate exact prefix matching and prevent cache hits.\n", + "\n", + "**Learnings from Codex**\n", + "When our engineering team outlined how they [architected the Codex agent loop](https://openai.com/index/unrolling-the-codex-agent-loop/), they emphasized prompt structure as a first-class performance surface and caching as a top priority. In the Codex CLI, system instructions, tool definitions, sandbox configuration, and environment context are kept identical and consistently ordered between requests to preserve long, stable prompt prefixes. The agent loop appends new messages (rather than modifying earlier ones) when runtime configurations change mid-conversation (e.g. new working directory or approval mode). By avoiding changes to the original prefix, the system preserves exact-prefix matches, which are required for prompt cache hits. \n", + "\n", + "**Tip: Use metadata**\n", + "We’ve seen customers accidentally invalidate their cache by including a timestamp early in their request for later lookup/debugging. Move that to `metadata` where it will not impact the cache!\n", + "\n", + "### 4.3 Keep Tools and Schemas Identical\n", + "Tools, schemas, and their ordering contribute to the cached prefix - they get injected before developer instructions which means that changing them would invalidate the cache. This includes: \n", + "- Schema key changes\n", + "- Tool ordering changes\n", + "- Changes to instructions\n", + "\n", + "\n", + "**Tip: Adjust tools without breaking prompt caching**\n", + "Leverage [`allowed_tools`](https://developers.openai.com/api/docs/guides/function-calling/#tool-choice) `tool_choice` option. It allows you to restrict the tools the model can call on a request without changing the tools array and busting the cache. List your full toolkit in tools, and then use an `allowed_tools` block to specify which tool can be used on a single turn.\n" + ] }, { "cell_type": "markdown", @@ -49,7 +135,53 @@ { "cell_type": "markdown", "metadata": {}, - "source": "\n\n### 4.4 Use `prompt_cache_key` to Improve Routing Stickiness\n`prompt_cache_key` is a parameter that we combine with the initial 256 token hash to help requests route to the same engine, increasing the chance that identical prefixes land on the same machine. It’s effective - one of our coding customers saw an improved hit rate from 60% to 87% when they started using `prompt_cache_key`. `prompt_cache_key` is especially useful when you are sending different requests that have the same initial set of context but then vary later, this parameter helps the system distinguish between requests that look the same at the start and you’ll get more effective cache hits. \nNodes can only handle about 15 requests per minute so requests over that rate for the same prefix and `prompt_cache_key` may spill over to additional machines (reducing cache effectiveness).\n\nReal Use Cases\nChoose granularity to keep each prefix + key combination below ~15 requests per minute.\nYou may wonder whether to associate a key with a user or a conversation. Your traffic patterns should inform your choice. For coding use cases we’ve seen:\n- Per-user keys improve reuse across related codebase conversations\n- Per-conversation keys scale better when users run many unrelated threads in parallel.\nGrouping several users to share a key can also be a good approach. A simple algorithm for this would be to hash and mod a user id by number of \"buckets” to aim close to 15RPM to aim for to maximize caching performance.\n\n\n> **Tip: Test Flex Processing instead of the Batch API**\n> If you have latency insensitive workflow you may be using the Batch API. For more flexibility around caching, consider using Flex Processing - it’s the same 50% discount on tokens but on a per-request basis so you can control the RPM, use extended prompt caching and include a `prompt_cache_key` with the request, leading to higher cache hit rates. Flex is best when prototyping or for non-inference intensive prod workloads.\n\nThrough testing the same repeat prompt across Flex and Batch, I saw an 8.5% increase in cache rates when I used Flex with extended prompt caching and a `prompt_cache_key` across 10,000 requests. That cache improvement means a 23% decrease in input token cost. \n\n![Figure 4](../images/prompt-caching-201/figure-4.svg)\n\nThere isn’t model parity for caching - the Batch API does not support caching for pre-GPT-5 models, so if you’re using o3 or o4-mini, you should consider switching to Flex to take advantage of caching. Check the most up to date info on this in our [pricing docs](https://developers.openai.com/api/docs/pricing?latest-pricing=flex). \n\n\n> **Insight: `prompt_cache_key` as shard key**\n> It might be helpful to think of the `prompt_cache_key` as a database shard key when thinking about how we route the request on the backend - the considerations are very similar when optimizing for the parameter. Like shard keys, granularity is a balancing act. Each machine can only handle about ~15 RPM for a given prefix, so if you use the same key on too many requests, requests will overflow to multiple machines. For each new machine, you’ll start the cache anew. On the other hand, if your key is too narrow, traffic spreads out across machines and you lose the benefit of cache reuse. Routing is still load-balanced - `prompt_cache_key` increases the chance similar prompts hit the same server but does not guarantee stickiness - caching is always best-effort!\n\n### 4.5 Use the Responses API instead of Chat Completions\nAs we outlined in [Why we built the Responses API](https://developers.openai.com/blog/responses-api/), our internal benchmarks show a 40-80% better cache utilization on requests when compared to Chat Completions. \nThis is because the raw chain of thought tokens get persisted in the Responses API between turns via `previous_response_id` (or [encrypted reasoning items](https://platform.openai.com/docs/guides/reasoning?api-mode=responses#encrypted-reasoning-items) if you’re stateless). Chat Completions does not offer a way to persist these tokens. [Better performance from reasoning models using the Responses API](https://developers.openai.com/cookbook/examples/responses_api/reasoning_items#caching) is an excellent guide to understanding this in more depth. \n\n### 4.6 Be thoughtful about Context Engineering\nAt its core, context engineering is about deciding what goes into the model’s input on each request. Every model has a fixed context window, but curating what you pass on each request isn’t just about staying under the limit. As the input grows, you’re not only approaching truncation - you’re also increasing replay cost and asking the model to distribute its attention across more tokens. Effective context engineering is the practice of managing that constraint intentionally. \nHowever, when you drop, summarize or compact earlier turns in a conversation, you’ll break the cache. With the rise of long-running agents and native [compaction](https://developers.openai.com/api/docs/guides/compaction), it’s important to keep caching in mind when architecting for context engineering to ensure the right balance of cost versus intelligence savings. \n" + "source": [ + "\n", + "\n", + "### 4.4 Use `prompt_cache_key` to Improve Routing Stickiness\n", + "Caching only works if two requests share the same prefix and land on the same machine. Requests are routed to inference engines based on a hash of the first ~256 tokens of the prompt. When you provide a `prompt_cache_key`, it is combined with that hash to increase routing stickiness - meaning requests with the same prefix are more likely to land on the same engine and reuse cached KV state. It’s effective - one of our coding customers saw an improved hit rate from 60% to 87% when they started using `prompt_cache_key`. \n", + "\n", + "`prompt_cache_key` is especially useful when you are sending different requests that have the same initial set of context (i.e. first 256 tokens) but then vary later as it lets you intentionally group related requests. \n", + "\n", + "Inference engines can handle roughly ~15 requests per minute per prefix + `prompt_cache_key` combination. If traffic exceeds that rate — for example, if you send thousands of requests sharing the same prefix and key — the system will distribute the excess requests across additional machines for load balancing. Each new machine is a one-time cache miss. \n", + "\n", + "That means you should choose a key granularity that can keep each prefix + key combination below ~15 RPM.\n", + "\n", + "For coding use cases we’ve seen:\n", + "- Per-user keys improve reuse across related conversations (e.g., working in the same codebase).\n", + "- Per-conversation keys scale better when users run many unrelated threads in parallel.\n", + "Grouping several users to share a key can also be a good approach. A simple algorithm for this would be to hash and mod a user id by number of \"buckets” to aim close to 15RPM to aim for to maximize caching performance.\n", + "\n", + "\n", + "**Tip: Test Flex Processing instead of the Batch API**\n", + "If you have latency insensitive workflow you may be using the Batch API. For more flexibility around caching, consider using Flex Processing. Flex offers the same 50% token discount as Batch but runs through the Responses API with `service_tier=\"flex\"` specified per request. This gives you more flexibility:\n", + "\n", + "- Control over request rate (RPM)\n", + "- Access to extended prompt caching\n", + "- Ability to include a prompt_cache_key\n", + "\n", + "Because you can tune routing and cache locality more precisely, Flex can achieve higher cache hit rates in some workloads. It is particularly well suited for prototyping or production workloads that are not inference-intensive but still benefit from cost optimization.\n", + "\n", + "I ran a head-to-head test of 10,000 identical requests. Flex (using extended prompt caching and `prompt_cache_key`) produced an 8.5% increase in cache hit rate compared to the Batch job. That improvement translates into a 23% reduction in input token cost. \n", + "\n", + "![Figure 4](../images/prompt-caching-201/figure-4.svg)\n", + "\n", + "There isn’t model parity for caching on the Batch API - pre-GPT-5 models are not supported, so if you’re using o3 or o4-mini, you should consider switching to Flex to take advantage of caching. Check the most up to date info on this in our [pricing docs](https://developers.openai.com/api/docs/pricing?latest-pricing=flex). \n", + "\n", + "\n", + "**Insight: `prompt_cache_key` as shard key**\n", + "It might be helpful to think of the `prompt_cache_key` as a database shard key when thinking about how we route the request on the backend - the considerations are very similar when optimizing for the parameter. Like shard keys, granularity is a balancing act. Each machine can only handle about ~15 RPM for a given prefix, so if you use the same key on too many requests, requests will overflow to multiple machines. For each new machine, you’ll start the cache anew. On the other hand, if your key is too narrow, traffic spreads out across machines and you lose the benefit of cache reuse. Routing is still load-balanced - `prompt_cache_key` increases the chance similar prompts hit the same server but does not guarantee stickiness - caching is always best-effort!\n", + "\n", + "### 4.5 Use the Responses API instead of Chat Completions\n", + "As we outlined in [Why we built the Responses API](https://developers.openai.com/blog/responses-api/), our internal benchmarks show a 40-80% better cache utilization on requests when compared to Chat Completions. \n", + "This is because, unlike Chat Completions, the raw chain of thought tokens get persisted in the Responses API between turns via `previous_response_id` (or [encrypted reasoning items](https://platform.openai.com/docs/guides/reasoning?api-mode=responses#encrypted-reasoning-items) if you’re stateless). Chat Completions does not offer a way to persist these tokens. If you aren't leveraging reasoning models, there's no caching improvement. [Better performance from reasoning models using the Responses API](https://developers.openai.com/cookbook/examples/responses_api/reasoning_items#caching) is an excellent guide to understanding this in more depth. \n", + "\n", + "### 4.6 Be thoughtful about Context Engineering\n", + "At its core, context engineering is about deciding what goes into the model’s input on each request. Every model has a fixed context window, but curating what you pass on each request isn’t just about staying under the limit. As the input grows, you’re not only approaching truncation - you’re also increasing replay cost and asking the model to distribute its attention across more tokens. Effective context engineering is the practice of managing that constraint intentionally. \n", + "However, when you drop, summarize or compact earlier turns in a conversation, you’ll break the cache. In that regard, context engineering and prompt caching are inherently at odds - one prioritizes dynamism, the other stability.\n", + "\n", + "With the rise of longer running agents and native compaction tools, it’s important to keep caching in mind when architecting to ensure the right balance of cost versus intelligence savings. \n" + ] }, { "cell_type": "markdown", @@ -68,7 +200,29 @@ { "cell_type": "markdown", "metadata": {}, - "source": "\nA common failure mode is an overgrown tool set that introduces ambiguity about which tool should be invoked. Keeping the tool surface minimal improves decision quality and long-term context. When curating tools on a per-request basis, use tip 4.3 and leverage the [`allowed_tools`](https://developers.openai.com/api/docs/guides/function-calling/#tool-choice) `tool_choice` option for pruning.\n\n\n> **Practical rule**\n> Leverage your evals to choose the compaction method and frequency that balances cost (both from reducing total input tokens via truncation/summarization as well as caching) and intelligence gained from careful context engineering\n\n## 5. Troubleshooting: why you might see lower caching:\nCommon causes:\n- Tool or response format schema changes\n- Context Management/Summarization/Truncation\n- Changes to instructions or system prompts\n- Changes to reasoning effort - that parameter is passed in the system prompt\n- Cache Expiration: too much time passes and the saved prefix is dropped.\n- Using Chat Completions with reasoning. Chat Completions drops reasoning tokens = lower cache hits.\n\n## 6. Extended Prompt Caching & Zero Data Retention\nCaches generally last 5-10 minutes of inactivity up to one hour but enabling [extended retention](https://developers.openai.com/api/docs/guides/prompt-caching#extended-prompt-cache-retention) allows for caches to last for up to 24 hours.\nExtended Prompt Caching works by offloading the key/value tensors to GPU-local storage when memory is full, significantly increasing the storage capacity available for caching. Because the cache is stored for longer, this can increase your cache hit rate. Internally we have seen improvements of 10% for some use cases. \nIf you don’t specify a retention policy, the default is `in_memory`. \n" + "source": [ + "\n", + "A common failure mode is an overgrown tool set that introduces ambiguity about which tool should be invoked. Keeping the tool surface minimal improves decision quality and long-term context. When curating tools on a per-request basis, use tip 4.3 and leverage the [`allowed_tools`](https://developers.openai.com/api/docs/guides/function-calling/#tool-choice) `tool_choice` option for pruning.\n", + "\n", + "\n", + "**Practical rule**\n", + "Leverage your evals to choose the compaction method and frequency that balances cost (both from reducing total input tokens via truncation/summarization as well as caching) and intelligence gained from careful context engineering\n", + "\n", + "## 5. Troubleshooting: why you might see lower caching:\n", + "Common causes:\n", + "- Tool or response format schema changes\n", + "- Naive truncation from hitting the models' context window\n", + "- Changes to instructions or system prompts\n", + "- Changes to reasoning effort\n", + "- Cache Expiration: too much time passes and the saved prefix is dropped.\n", + "- Adding in a space, timestamp or other dynamic content\n", + "- Using Chat Completions with reasoning models, since the hidden chain-of-thought tokens are dropped\n", + "\n", + "## 6. Extended Prompt Caching & Zero Data Retention\n", + "In-memory caches generally last 5-10 minutes of inactivity (up to one hour) but enabling [extended retention](https://developers.openai.com/api/docs/guides/prompt-caching#extended-prompt-cache-retention) allows for caches to last for up to 24 hours.\n", + "Extended Prompt Caching works by offloading the key/value tensors to GPU-local storage when memory is full, significantly increasing the storage capacity available for caching. Because the cache is stored for longer, this can increase your cache hit rate. Internally we have seen improvements of 10% for some use cases. \n", + "If you don’t specify a retention policy, the default is `in_memory`. \n" + ] }, { "cell_type": "markdown", @@ -86,7 +240,25 @@ { "cell_type": "markdown", "metadata": {}, - "source": "\nKV tensors are the intermediate representation from the model’s attention layers produced during prefill. Only the key/value tensors may be persisted in local storage; the original customer content, such as prompt text, is only retained in memory.\nPrompt caching can comply with [Zero Data Retention](https://developers.openai.com/api/docs/guides/your-data) because cached prompts are not logged, not saved to disk, and exist in memory only as needed. However, extended prompt caching requires storing key/value tensors to GPU-local storage as application state. This storage requirement means that requests leveraging extended prompt caching are not ZDR eligible - however if a ZDR customer explicitly asks in their request for Extended Prompt Caching the system will honor that. ZDR customers should be cautious about this and consider policies to monitor their requests for this parameter. \n\n\n> **Insight: What Is Actually Cached?**\n> The KV cache just holds the model’s key/value tensors (linear projections of the hidden‑states) so we can reuse them on the next inference step. The KV cache is an intermediate representation. It’s essentially just a bunch of numbers internal to the model - that means no raw text/multi-modal inputs are ever stored, regardless of retention policy.\n\n\n## 7. Realtime API\n\nCaching in the Realtime API works similarly to the Responses API: as long as the prefix stays stable, cache hits are preserved. But when the context window is exceeded and the start of the conversation shifts, cache hits drop. Given that the context window of the Realtime API is much shorter than our frontier models (currently 32k tokens), this is even more relevant. \n\n### 7.1 Retention Ratio\nBy default (truncation: \"auto\"), the server removes just enough old messages to fit within the context window. This “just-in-time” pruning shifts the start of the conversation slightly on every turn once you’re over the limit. However, this naive strategy causes frequent cache misses.\nThe [`retention_ratio`](https://developers.openai.com/api/docs/guides/realtime-costs/#truncation) setting changes this by letting you control how much of the earlier context to keep vs. drop. It’s a configurable truncation strategy that lets developers control how much of the token context window is retained, balancing context preservation with cache-hit optimization.\nIn this example, `retention_ratio`: 0.7 means that when truncation happens, the system keeps roughly 70% of the existing conversation window and drops the oldest ~30%. The drop happens in one larger truncation event (instead of small incremental removals).\n" + "source": [ + "\n", + "KV tensors are the intermediate representation from the model’s attention layers produced during prefill. Only the key/value tensors may be persisted in local storage; the original input content, such as prompt text, is only retained in memory.\n", + "Prompt caching can comply with [Zero Data Retention](https://developers.openai.com/api/docs/guides/your-data) because cached prompts are not logged, not saved to disk, and exist in memory only as needed. However, extended prompt caching requires storing key/value tensors to GPU-local storage as application state. This storage requirement means that requests leveraging extended prompt caching are not ZDR eligible - however if a ZDR customer explicitly asks in their request for Extended Prompt Caching the system will honor that. ZDR customers should be cautious about this and consider policies to monitor their requests for this parameter. \n", + "\n", + "\n", + "**Insight: What Is Actually Cached?**\n", + "The KV cache just holds the model’s key/value tensors (linear projections of the hidden‑states) so we can reuse them on the next inference step. The KV cache is an intermediate representation. It’s essentially just a bunch of numbers internal to the model - that means no raw text/multi-modal inputs are ever stored, regardless of retention policy.\n", + "\n", + "\n", + "## 7. Realtime API\n", + "\n", + "Caching in the Realtime API works the same as with the Responses API - all the audio, text or images passed in will be cached, and any change to the prefix will break the cache. However, given the shorter context window of the Realtime API (currently 32k), managing truncation is especially relevant. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs. \n", + "\n", + "### 7.1 Retention Ratio\n", + "By default `(truncation: \"auto\")`, the server removes just enough old messages to fit within the context window. This “just-in-time” pruning shifts the start of the conversation slightly on every turn once you’re over the limit. However, this naive strategy causes frequent cache misses.\n", + "The [`retention_ratio`](https://developers.openai.com/api/docs/guides/realtime-costs/#truncation) setting changes this by letting you control how much of the earlier context to keep vs. drop. It’s a configurable truncation strategy that lets developers control how much of the token context window is retained, balancing context preservation with cache-hit optimization.\n", + "For example, `retention_ratio: 0.7` means that when truncation happens, the system keeps roughly 70% of the existing conversation window and drops the oldest ~30%. The drop happens in one larger truncation event (instead of small incremental removals).\n" + ] }, { "cell_type": "markdown", @@ -98,10 +270,7 @@ " \"session\": {\n", " \"truncation\": {\n", " \"type\": \"retention_ratio\",\n", - " \"retention_ratio\": 0.7,\n", - " \"token_limits\": {\n", - " \"post_instructions\": 8000\n", - " }\n", + " \"retention_ratio\": 0.7\n", " }\n", " }\n", "}\n", @@ -111,7 +280,20 @@ { "cell_type": "markdown", "metadata": {}, - "source": "\nThis creates a more stable prefix that survives across multiple turns, reducing repeated cache busting.\nTrade-off: you lose larger portions of conversation history at once. That means the model may suddenly “forget” earlier parts of the dialogue sooner than with gradual truncation. You’re effectively trading off memory depth for cache stability (more consistent prefix).\nHere’s what this impact looks like in the naive per-turn truncation approach vs using `retention_ratio`. \n\n![Figure 5](../images/prompt-caching-201/figure-5.svg)\n\n\n## Conclusion\nPrompt caching is one of the highest-leverage optimizations available on the OpenAI platform. When your prefixes are stable and your routing is well-shaped, you can materially reduce both cost and latency without changing model behavior or quality.\nThe key ideas are simple but powerful: stabilize the prefix, monitor `cached_tokens`, stay under the ~15 RPM per prefix+key limit, and use `prompt_cache_key` thoughtfully. For higher-volume or latency-insensitive workloads, consider Flex Processing and extended retention to further improve cache locality.\nCaching is best-effort and routing-aware, but when engineered intentionally, it can deliver dramatic improvements in efficiency. Treat it like any other performance system: measure first, iterate deliberately, and design your prompts with reuse in mind.\n" + "source": [ + "\n", + "This creates a more stable prefix that survives across multiple turns, reducing repeated cache busting.\n", + "The trade-off is that truncation happens in bigger chunks, so you can lose more conversation history at once. That means the model may suddenly “forget” earlier parts of the dialogue sooner than with gradual truncation. You’re effectively trading off memory depth for cache stability (more consistent prefix).\n", + "Here’s what this impact looks like in the naive per-turn truncation approach vs using `retention_ratio`. \n", + "\n", + "![Figure 5](../images/prompt-caching-201/figure-5.svg)\n", + "\n", + "\n", + "## Conclusion\n", + "Prompt caching is one of the highest-leverage optimizations available on the OpenAI platform. When your prefixes are stable and your routing is well-shaped, you can materially reduce both cost and latency without changing model behavior or quality.\n", + "The key ideas are simple but powerful: stabilize the prefix, monitor `cached_tokens`, be mindful of the ~15 RPM limit, and use `prompt_cache_key` thoughtfully. For higher-volume or latency-insensitive workloads, consider Flex Processing and extended retention to further improve cache locality.\n", + "Caching is best-effort and routing-aware, but when engineered intentionally, it can deliver dramatic improvements in efficiency. Treat it like any other performance system: measure first, iterate deliberately, and design your prompts with reuse in mind.\n" + ] } ], "metadata": { diff --git a/examples/codex/long_horizon_tasks.md b/examples/codex/long_horizon_tasks.md deleted file mode 100644 index 5795fe19cd..0000000000 --- a/examples/codex/long_horizon_tasks.md +++ /dev/null @@ -1,214 +0,0 @@ -# Long horizon tasks with Codex - -In September 2025, OpenAI introduced GPT-5-Codex as the first version of GPT-5 optimized for agentic coding. In December 2025, we launched 5.2 which was the moment that people began to believe that using autonomous coding agents could be reliable. -In particular, we saw a huge jump in how long the model could reliably follow instructions. - -I wanted to stress-test that threshold. So I gave Codex a blank repo, full access, and one job: build a design tool from scratch. Then I let it run with GPT-5.3-Codex at "Extra High" reasoning. Codex ran for about 25 hours uninterrupted, used about 13M tokens, and generated about 30k lines of code. - -This was an experiment, not a production rollout. But it performed well on the parts that matter for long-horizon work: following the spec, staying on task, running verification, and repairing failures as it went. - -![Codex Design Desk UI](../../images/long_horizon_design_desk_ui.jpeg) - -## What a long-run Codex session looks like - -I asked Codex to generate a summary page for the session data: - -![Codex session summary dashboard](../../images/long_horizon_session_summary_dashboard.jpg) - -And here is a view of the CLI session stats and token usage: - -![Codex CLI session stats and token usage](../../images/long_horizon_cli_session_stats.jpg) - -These screenshots are useful because they make the core shift visible: agentic coding is increasingly about time horizon, not just one-shot intelligence. - -## The real shift is time horizon - -This is not only "models got smarter." The practical change is that agents can stay coherent for longer, complete larger chunks of work end-to-end, and recover from errors without losing the thread. - -METR's work on time-horizon benchmarks is a helpful framing for this trend: the length of software tasks frontier agents can complete with ~50% and 80% reliability has been climbing fast, with a rough ~7 month doubling time. Refer to [Measuring AI Ability to Complete Long Tasks (METR)](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/). - -![METR chart measuring AI ability to complete long tasks](../../images/long_horizon_metr_time_horizon_chart.jpg) - -Our recent GPT-5.3-Codex [launch announcement](https://openai.com/index/introducing-gpt-5-3-codex/) pushes this further for agent work in two practical ways: -1. It’s better at multi-step execution (plan → implement → validate → repair). -2. It’s easier to steer mid-flight without resetting the whole run (course corrections don’t wipe progress). - -I was also inspired by Cursor's writing on long-running autonomous coding systems, including their browser-building experiment: -[How Cursor built a web browser (Scaling agents)](https://cursor.com/blog/scaling-agents). - -The Cursor team wrote that OpenAI models are "much better at extended autonomous work: following instructions, keeping focus, avoiding drift, and implementing things precisely and completely." - -## Why Codex can stay coherent on long tasks - -Long-running work is less about one giant prompt and more about the agent loop the model operates inside. - -In Codex, the loop is roughly: - -1. Plan -2. Edit code -3. Run tools (tests/build/lint) -4. Observe results -5. Repair failures -6. Update docs/status -7. Repeat - -That loop matters because it gives the agent: - -- Real feedback (errors, diffs, logs) -- Externalized state (repo, files, docs, worktrees, outputs) -- Steerability over time (you can course-correct based on outcomes) - -This is also why Codex models feel better on Codex surfaces than a generic chat window: the harness supplies structured context (repo metadata, file tree, diffs, command outputs) and enforces a disciplined “done when” routine. - -We recently published an [article](https://openai.com/index/unrolling-the-codex-agent-loop/) about the Codex agent loop that has more details. - -To top this off, we also launched the Codex app that makes that loop usable day-to-day: -- [Parallel threads across projects](https://developers.openai.com/codex/app/features) (long work doesn’t block your day job) -- [Skills](https://developers.openai.com/codex/skills) (standardize plan/implement/test/report) -- [Automations](https://developers.openai.com/codex/app/automations/) (routine work in the background) -- [Git worktrees](https://developers.openai.com/codex/app/worktrees/) (isolate runs, keep diffs reviewable, reduce thrash) - -![Codex app workspace with project thread](../../images/long_horizon_codex_app_main_workspace.jpg) - -## My setup for the test - -I picked a design tool for this “experiment” because it’s an unforgiving test: UI + data model + editing operations + lots of edge cases. You can’t bluff it. If the architecture is wrong, it breaks quickly. - -I gave GPT-5.3-Codex a meaty spec, ran it at “Extra High” reasoning, and it ended up running uninterrupted for ~25 hours and was able to stay coherent and ship quality code. The model also ran verification steps (tests, lint, typecheck) for every milestone it completed. - -## The key idea: durable project memory - -The most important technique was durable project memory. I wrote the spec, plan, constraints, and status in markdown files that Codex could revisit repeatedly. That prevented drift and kept a stable definition of "done." - -The repository is linked below and the file stack was as follows: - -#### [Prompt.md](https://github.com/derrickchoi-openai/design-desk/blob/main/docs/prompt.md) (spec + deliverables) - -Purpose: Freeze the target so the agent doesn’t “build something impressive but wrong.” - -Key sections in the file: -- Goals + non-goals -- Hard constraints (perf, determinism, UX, platform) -- Deliverables (what must exist when finished) -- “Done when” (checks + demo flow) - -The initial prompt told Codex to treat the prompt/spec file as the full project specification and generate a milestone-based plan: - -![Prompt used to kickstart the Codex run](../../images/long_horizon_kickoff_prompt_to_codex.jpg) - -#### [Plan.md](https://github.com/derrickchoi-openai/design-desk/blob/main/docs/plans.md) (milestones + validations) - -Purpose: Turn open-ended work into a sequence of checkpoints the agent can finish and verify. - -Key sections in the file: -- Milestones small enough to complete in one loop -- Acceptance criteria + validation commands per milestone -- Stop-and-fix rule: if validation fails, repair before moving on -- Decision notes to avoid oscillation -- Intended architecture of the codebase - -![Codex referring to the plans markdown file while working](../../images/long_horizon_codex_reads_plans_file.jpg) - -*Note that we recently added a native plan mode to the Codex app, CLI, and IDE extension. This helps break a larger task into a clear, reviewable sequence of steps before making changes, so you can align on approach upfront. If additional clarification is needed, Codex will ask follow up questions. To toggle it on, use the /plan slash command. - -#### [Implement.md](https://github.com/derrickchoi-openai/design-desk/blob/main/docs/implement.md) (execution instructions referencing the plan) - -Purpose: This is the runbook. It tells Codex exactly how to operate: follow the plan, keep diffs scoped, run validations, update docs. - -Key sections in the file: -- Plans markdown file is source of truth (milestone-by-milestone) -- Run validation after each milestone (fix failures immediately) -- Keep diffs scoped (don’t expand scope) -- Update documentation markdown file continuously - -![Prompt instructing Codex to read implement.md as execution instructions](../../images/long_horizon_implement_md_runbook_prompt.jpg) - -#### [Documentation.md](https://github.com/derrickchoi-openai/design-desk/blob/main/docs/documentation.md) (status + decisions as it shipped) - -Purpose: This is the shared memory and audit log. It’s how I can step away for hours and still understand what happened. - -Key sections in the file: -- Current milestone status (what’s done, what’s next) -- Decisions made (and why) -- How to run + demo (commands + quick smoke tests) -- Known issues / follow-ups - -![Documentation file showing milestone status updates](../../images/long_horizon_documentation_status_updates.jpg) - -This is what milestone verification looked like in practice during the run: - -![Commands Codex ran to verify quality during milestones](../../images/long_horizon_milestone_verification_commands.jpg) - -### Verification at every milestone - -Codex did not just write code and hope it worked. After milestones, it ran verification commands and repaired failures before continuing. - -Here are examples of the quality commands it was instructed to use: - -![Quality commands for lint, typecheck, tests, build, and export](../../images/long_horizon_quality_commands.jpg) - -And an example of Codex fixing issues after a lint failure: - -![Codex fixing issues after npm run lint](../../images/long_horizon_lint_fix_iteration.jpg) - -## What the agent built - -The result was not perfect or production-ready, but it was real and testable. The bar for this run was not "it compiles"; it was "does it follow the instructions, and does it actually work?" - -High-level capabilities implemented: - -1. Canvas editing (frames, groups, shapes, text, images/icons, buttons, charts) -2. Live collaboration (presence, cursors, selections, edits sync across tabs) -3. Inspector controls (geometry, styling, text) -4. Layers management (search, rename, lock/hide, reorder) -5. Guides/alignment/snapping -6. History snapshots + restore -7. Replay timeline + branch from a prior point -8. Prototype mode (hotspots + flow navigation) -9. Comments (pinned threads with resolve/reopen) -10. Export (save/import/export + CLI export to JSON and React + Tailwind) - -### Product screenshots from the run - -Live collaboration: - -![Live collaboration with a teammate in the design tool](../../images/long_horizon_live_collaboration_demo.jpg) - -Snapshots and restore: - -![History panel showing snapshots and restore actions](../../images/long_horizon_snapshots_history_panel.jpg) - -Replay / time travel: - -![Replay timeline panel for time-traveling the edit history](../../images/long_horizon_replay_timeline_panel.jpg) - -Comments and pinned threads: - -![Comments panel with pinned thread on the canvas](../../images/long_horizon_comments_pinned_threads.jpg) - -## Takeaways for long-horizon Codex tasks - -What made this run work was not a single clever prompt. It was the combination of: - -- A clear target and constraints (spec file) -- Checkpointed milestones with acceptance criteria (`plans.md`) -- A runbook for how the agent should operate (`implement.md`) -- Continuous verification (tests/lint/typecheck/build) -- A live status/audit log (`documentation.md`) so the run stayed inspectable - -This is the direction long-horizon coding work is moving toward: less babysitting, more delegation with guardrails. - -## Try Codex on your own long-running task - -This 25-hour Codex run is a preview of where building with code is going. We’re moving beyond single-shot prompts and tight pair-programming loops toward long-running teammates that can take a real slice of work end to end, with you steering at milestones instead of micromanaging every line. - -Our direction with Codex is simple: stronger teammate behavior, tighter integration with your real context, and guardrails that keep work reliable, reviewable, and easy to ship. We’re already seeing developers move faster when the agent absorbs routine implementation and verification, freeing humans up for the parts that matter most: design, architecture, product decisions, and the novel problems that don’t have a template. - -And this won’t stop with developers. As Codex gets even better at capturing intent and providing safe scaffolding (plans, validations, previews, rollbacks), more non-developers will be able to build and iterate without living in an IDE. There’s more coming across Codex surfaces and models, but the north star stays the same: make the agent feel less like a tool you babysit and more like a teammate you can trust on long-horizon work. - -If you want to try this yourself, start with: - -- [Codex overview](https://developers.openai.com/codex/) -- [Codex quickstart](https://developers.openai.com/codex/quickstart/) -- [Codex models](https://developers.openai.com/codex/models/) -- [Codex app features](https://developers.openai.com/codex/app/features/) \ No newline at end of file diff --git a/examples/evals/realtime_evals/Makefile b/examples/evals/realtime_evals/Makefile index 1f39ecfb19..de5136f585 100644 --- a/examples/evals/realtime_evals/Makefile +++ b/examples/evals/realtime_evals/Makefile @@ -7,13 +7,21 @@ ifeq ($(UV_BIN),) RUFF := $(VENV_DIR)/bin/ruff MYPY := $(VENV_DIR)/bin/mypy PYTEST := $(VENV_DIR)/bin/pytest +STREAMLIT := $(VENV_DIR)/bin/streamlit +RUN_PYTHON := $(VENV_PYTHON) else RUFF := uv run --with ruff ruff MYPY := uv run --with mypy --with pandas-stubs --with types-seaborn --with types-tqdm mypy PYTEST := uv run --with pytest pytest +STREAMLIT := uv run --with streamlit streamlit +RUN_PYTHON := uv run python endif -.PHONY: install format lint lint-fix typecheck test +.PHONY: install streamlit format lint lint-fix typecheck test validate-input validate-output + +HARNESS ?= +DATA_PATH ?= +RUN_DIR ?= install: ifeq ($(UV_BIN),) @@ -38,3 +46,14 @@ typecheck: test: $(PYTEST) + +streamlit: + cd results_viewer && $(STREAMLIT) run app.py + +validate-input: + @test -n "$(HARNESS)" || (echo "HARNESS is required, e.g. make validate-input HARNESS=run" && exit 2) + $(RUN_PYTHON) shared/scripts/validate_eval_input.py --harness $(HARNESS) $(if $(DATA_PATH),--data-path "$(DATA_PATH)",) + +validate-output: + @test -n "$(HARNESS)" || (echo "HARNESS is required, e.g. make validate-output HARNESS=run" && exit 2) + $(RUN_PYTHON) shared/scripts/validate_eval_output.py --harness $(HARNESS) $(if $(RUN_DIR),--run-dir "$(RUN_DIR)",) diff --git a/examples/evals/realtime_evals/README.md b/examples/evals/realtime_evals/README.md index 83f927180f..bcbc1ee362 100644 --- a/examples/evals/realtime_evals/README.md +++ b/examples/evals/realtime_evals/README.md @@ -33,6 +33,7 @@ Run a first command per harness. If uv is not installed, replace `uv run` with ` Use the root `Makefile` for common checks. Run `make install` first to create `.venv`. These targets work with or without `uv`: when `uv` is installed they run through `uv run`, and otherwise they use the matching tool binaries from the local `.venv`. - `make install` +- `make streamlit` - `make format` - `make lint` - `make lint-fix` @@ -102,6 +103,28 @@ To render charts for an existing run after the fact: uv run python plot_eval_results.py --run-dir run_harness/results/ ``` +## Results Viewer + +Use the Streamlit results viewer to browse saved runs from `crawl_harness`, `walk_harness`, and `run_harness` without opening the raw artifacts by hand. + +- `Comparison View`: select a harness, choose one or more saved runs, and compare summary metrics, scores, latency, and token usage across runs. +- `Run Viewer`: inspect one saved run in detail. Crawl and walk runs show row-level audio artifacts and event logs; run-harness runs use a Simulation Viewer with transcripts, event logs, and turn audio. + +Run it from this directory with either: + +```bash +make streamlit +``` + +or: + +```bash +cd results_viewer +uv run streamlit run app.py +``` + +Then open the local Streamlit URL, usually `http://localhost:8501`. + ## Common CLI flags All harnesses share a core set of flags so you can switch between them easily: diff --git a/examples/evals/realtime_evals/pyproject.toml b/examples/evals/realtime_evals/pyproject.toml index 6f7025a027..ba3247b669 100644 --- a/examples/evals/realtime_evals/pyproject.toml +++ b/examples/evals/realtime_evals/pyproject.toml @@ -17,9 +17,17 @@ dev = [ "mypy", "pandas-stubs", "pytest", + "streamlit", "types-seaborn", "types-tqdm", ] [tool.ruff.lint] ignore = ["E402"] + +[tool.mypy] +explicit_package_bases = true + +[[tool.mypy.overrides]] +module = ["streamlit", "streamlit.*"] +ignore_missing_imports = true diff --git a/examples/evals/realtime_evals/requirements-dev.txt b/examples/evals/realtime_evals/requirements-dev.txt index 2f4b15266a..7bcc9a0af0 100644 --- a/examples/evals/realtime_evals/requirements-dev.txt +++ b/examples/evals/realtime_evals/requirements-dev.txt @@ -1,5 +1,8 @@ # This file was autogenerated by uv via the following command: # uv export --format requirements-txt --group dev --output-file requirements-dev.txt +altair==6.0.0 \ + --hash=sha256:09ae95b53d5fe5b16987dccc785a7af8588f2dca50de1e7a156efa8a461515f8 \ + --hash=sha256:614bf5ecbe2337347b590afb111929aa9c16c9527c4887d96c9bc7f6640756b4 annotated-types==0.7.0 \ --hash=sha256:1f02e8b43a8fbbc3f3e0d4f0f4bfc8131bcb4eebe8849b8e5c773f3a1c582a53 \ --hash=sha256:aff07c09a53a08bc8cfccb9c85b05f1aa9a2a6f23728d790723543408344ce89 diff --git a/examples/evals/realtime_evals/results_viewer/.streamlit/config.toml b/examples/evals/realtime_evals/results_viewer/.streamlit/config.toml new file mode 100644 index 0000000000..af1a91a669 --- /dev/null +++ b/examples/evals/realtime_evals/results_viewer/.streamlit/config.toml @@ -0,0 +1,72 @@ +[theme] +base = "light" +primaryColor = "#6e716c" +backgroundColor = "#f3f3ef" +secondaryBackgroundColor = "#fafaf7" +textColor = "#111111" +linkColor = "#2f2f2c" +linkUnderline = false +baseRadius = "large" +buttonRadius = "large" +borderColor = "#d8d8d2" +dataframeBorderColor = "#d8d8d2" +dataframeHeaderBackgroundColor = "#efefe9" +showWidgetBorder = true +showSidebarBorder = true +codeTextColor = "#1d1d1b" +codeBackgroundColor = "#f7f7f3" +grayColor = "#6d6e68" +grayBackgroundColor = "#ecece6" +grayTextColor = "#2f2f2c" +blueColor = "#7c9dad" +blueBackgroundColor = "#e8eef1" +blueTextColor = "#2c3740" +greenColor = "#8fa48b" +greenBackgroundColor = "#ebf0e8" +greenTextColor = "#314034" +yellowColor = "#baa774" +yellowBackgroundColor = "#f3efe3" +yellowTextColor = "#4a3f23" +orangeColor = "#c3a082" +orangeBackgroundColor = "#f4ede7" +orangeTextColor = "#4a3529" +redColor = "#ba8d89" +redBackgroundColor = "#f3eaea" +redTextColor = "#4a2928" +violetColor = "#a59bb9" +violetBackgroundColor = "#efecf5" +violetTextColor = "#3a3348" + +[theme.sidebar] +backgroundColor = "#ecece6" +secondaryBackgroundColor = "#f4f4ee" +textColor = "#111111" +linkColor = "#2f2f2c" +linkUnderline = false +codeTextColor = "#1d1d1b" +codeBackgroundColor = "#f7f7f3" +borderColor = "#d8d8d2" +dataframeBorderColor = "#d8d8d2" +dataframeHeaderBackgroundColor = "#efefe9" +showWidgetBorder = true +grayColor = "#6d6e68" +grayBackgroundColor = "#e6e6df" +grayTextColor = "#2f2f2c" +blueColor = "#7c9dad" +blueBackgroundColor = "#e8eef1" +blueTextColor = "#2c3740" +greenColor = "#8fa48b" +greenBackgroundColor = "#ebf0e8" +greenTextColor = "#314034" +yellowColor = "#baa774" +yellowBackgroundColor = "#f3efe3" +yellowTextColor = "#4a3f23" +orangeColor = "#c3a082" +orangeBackgroundColor = "#f4ede7" +orangeTextColor = "#4a3529" +redColor = "#ba8d89" +redBackgroundColor = "#f3eaea" +redTextColor = "#4a2928" +violetColor = "#a59bb9" +violetBackgroundColor = "#efecf5" +violetTextColor = "#3a3348" diff --git a/examples/evals/realtime_evals/results_viewer/README.md b/examples/evals/realtime_evals/results_viewer/README.md new file mode 100644 index 0000000000..e79c158b4c --- /dev/null +++ b/examples/evals/realtime_evals/results_viewer/README.md @@ -0,0 +1,65 @@ +# Results Viewer + +This directory contains a [Streamlit](https://streamlit.io/) app for exploring +saved realtime eval runs from the crawl, walk, and run harnesses, plus +bootstrap-generated eval folders created under `examples/evals/realtime_evals/`. + +The app auto-discovers run directories under: + +- `crawl_harness/results/` +- `walk_harness/results/` +- `run_harness/results/` +- `*_realtime_eval/results/` when the folder includes `bootstrap_manifest.json` + +## What It Shows + +- **Comparison View**: compare summary metrics across one or more saved runs +- **Run Viewer**: inspect a single saved run, including: + - `results.csv` rows + - crawl/walk input and output audio artifacts + - crawl/walk per-example event logs + - run-harness simulation transcripts, event logs, and turn audio + +Note: `run_harness` runs are inspected via the app's **Simulation Viewer**, +which has a different UI and a different artifact set than the crawl/walk +viewer. + +## Run Locally + +From `examples/evals/realtime_evals/`: + +```bash +uv venv .venv +source .venv/bin/activate +uv sync --group dev +cd results_viewer +uv run streamlit run app.py +``` + +Then open the local URL that Streamlit prints, usually +`http://localhost:8501`. + +If you are using the pip-based install path instead of `uv`, install the dev +dependencies first so `streamlit` is available: + +```bash +pip install -r requirements.txt -r requirements-dev.txt +cd results_viewer +streamlit run app.py +``` + +## Expected Data Layout + +The viewer expects each saved run directory to contain: + +- `summary.json` for aggregate metrics +- `results.csv` for per-example results + +For crawl and walk runs, the app can also display: + +- `audio//input.wav` +- `audio//output.wav` +- `events/.jsonl` + +The app discovers runs recursively, so nested result directories are fine as +long as those files are present. diff --git a/examples/evals/realtime_evals/results_viewer/__init__.py b/examples/evals/realtime_evals/results_viewer/__init__.py new file mode 100644 index 0000000000..f2dabed8b9 --- /dev/null +++ b/examples/evals/realtime_evals/results_viewer/__init__.py @@ -0,0 +1 @@ +"""Streamlit results viewer package.""" diff --git a/examples/evals/realtime_evals/results_viewer/app.py b/examples/evals/realtime_evals/results_viewer/app.py new file mode 100644 index 0000000000..ef90966bf8 --- /dev/null +++ b/examples/evals/realtime_evals/results_viewer/app.py @@ -0,0 +1,1416 @@ +from __future__ import annotations + +import json +from pathlib import Path +from typing import TYPE_CHECKING + +import pandas as pd +import streamlit as st + +if TYPE_CHECKING: + from results_viewer.config import ( + DEFAULT_LATENCY_SERIES, + DEFAULT_SCORE_KEYS, + DEFAULT_TOKEN_SERIES, + LATENCY_CHART_KEYS, + PERCENTILE_LABELS, + SCORE_KEY_LABELS, + SUMMARY_TABLE_BASE_COLUMNS, + TOKEN_CHART_KEYS, + ) + from results_viewer.ui import load_css +else: + try: + from .config import ( + DEFAULT_LATENCY_SERIES, + DEFAULT_SCORE_KEYS, + DEFAULT_TOKEN_SERIES, + LATENCY_CHART_KEYS, + PERCENTILE_LABELS, + SCORE_KEY_LABELS, + SUMMARY_TABLE_BASE_COLUMNS, + TOKEN_CHART_KEYS, + ) + from .ui import load_css + except ImportError: + from config import ( + DEFAULT_LATENCY_SERIES, + DEFAULT_SCORE_KEYS, + DEFAULT_TOKEN_SERIES, + LATENCY_CHART_KEYS, + PERCENTILE_LABELS, + SCORE_KEY_LABELS, + SUMMARY_TABLE_BASE_COLUMNS, + TOKEN_CHART_KEYS, + ) + from ui import load_css + +ROOT_DIR = Path(__file__).resolve().parents[1] +HARNESS_RESULTS_DIRS = { + "crawl": ROOT_DIR / "crawl_harness" / "results", + "walk": ROOT_DIR / "walk_harness" / "results", + "run": ROOT_DIR / "run_harness" / "results", +} +CHART_RUN_PALETTE = [ + "#b9cfdb", + "#c7d3bf", + "#ddc2be", + "#cbc3da", + "#ddd0b7", + "#bfd4cc", +] +CHART_TEXT_COLOR = "#151515" +CHART_MUTED_TEXT_COLOR = "#5f5f5a" +CHART_DOMAIN_COLOR = "#cecec7" +CHART_GRID_COLOR = "#dfdfd8" +CHART_CONTENT_HEIGHT = 280 +CHART_LABEL_MAX_CHARS = 10 +CHART_LABEL_SUFFIX = "...." +RUN_VIEWER_SELECTED_COLUMN = "__selected__" +RUN_VIEWER_TABLE_VISIBLE_ROWS = 200 +RUN_VIEWER_TABLE_ROW_HEIGHT = 35 +RUN_VIEWER_TABLE_HEADER_HEIGHT = 38 +RUN_VIEWER_TABLE_FRAME_PADDING = 6 + + +def chart_theme_config() -> dict[str, object]: + return { + "view": {"stroke": None}, + "background": "transparent", + "axis": { + "labelColor": CHART_MUTED_TEXT_COLOR, + "titleColor": CHART_TEXT_COLOR, + "domainColor": CHART_DOMAIN_COLOR, + "gridColor": CHART_GRID_COLOR, + "tickColor": CHART_DOMAIN_COLOR, + }, + "legend": { + "labelColor": CHART_MUTED_TEXT_COLOR, + "titleColor": CHART_TEXT_COLOR, + }, + } + + +def result_directory_label(path: Path) -> str: + try: + return str(path.relative_to(ROOT_DIR)) + except ValueError: + return str(path) + + +def display_run_label(value: str | Path) -> str: + raw_value = str(value) + display_label = Path(raw_value).name + return display_label or raw_value + + +def path_state_key_fragment(path: Path) -> str: + return result_directory_label(path).replace("/", "__") + + +def harness_results_roots(harness: str) -> list[Path]: + roots = [HARNESS_RESULTS_DIRS[harness]] + for manifest_path in sorted(ROOT_DIR.glob("*/bootstrap_manifest.json")): + try: + manifest_payload = json.loads(manifest_path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + continue + if manifest_payload.get("harness") != harness: + continue + results_root = manifest_path.parent / "results" + if results_root not in roots: + roots.append(results_root) + return roots + + +def discover_result_directories(results_roots: Path | list[Path]) -> list[Path]: + if isinstance(results_roots, Path): + candidate_roots = [results_roots] + else: + candidate_roots = results_roots + + run_directories = { + path.parent + for results_root in candidate_roots + if results_root.exists() + for path in results_root.rglob("*") + if path.is_file() and path.name in {"results.csv", "summary.json"} + } + return sorted(run_directories) + + +def relative_directory_labels(paths: list[Path], root: Path | None = None) -> list[str]: + label_root = ROOT_DIR if root is None else root + labels: list[str] = [] + for path in paths: + try: + labels.append(str(path.relative_to(label_root))) + except ValueError: + labels.append(str(path)) + return labels + + +def summary_path_for_run(run_directory: Path) -> Path: + return run_directory / "summary.json" + + +def results_csv_path_for_run(run_directory: Path) -> Path: + return run_directory / "results.csv" + + +def coerce_float(value: object) -> float | None: + if value is None: + return None + if isinstance(value, bool): + return float(int(value)) + if isinstance(value, (int, float)): + return float(value) + if isinstance(value, str): + stripped = value.strip() + if not stripped: + return None + try: + return float(stripped) + except (TypeError, ValueError): + return None + return None + + +def _set_results_selection( + selection_key: str, select_all_key: str, available_labels: list[str] +) -> None: + if st.session_state.get(select_all_key): + st.session_state[selection_key] = available_labels + return + + current_selection = st.session_state.get(selection_key, []) + st.session_state[selection_key] = [ + label for label in current_selection if label in available_labels + ] + + +def _sync_select_all_toggle( + selection_key: str, select_all_key: str, available_labels: list[str] +) -> None: + selected_labels = st.session_state.get(selection_key, []) + st.session_state[select_all_key] = bool(available_labels) and len( + selected_labels + ) == len(available_labels) + + +def selected_runs_status_markup(selected_run_count: int) -> str: + run_label = "run" if selected_run_count == 1 else "runs" + return ( + '

' + '' + f"{selected_run_count}" + " " + f"{run_label} selected" + "

" + ) + + +def selected_result_directories(harness: str) -> list[Path]: + results_roots = harness_results_roots(harness) + available_paths = discover_result_directories(results_roots) + available_labels = relative_directory_labels(available_paths) + + selection_key = f"selected_results_{harness}" + select_all_key = f"selected_results_all_{harness}" + + current_selection = st.session_state.get(selection_key, []) + valid_defaults = [label for label in current_selection if label in available_labels] + all_selected = bool(available_labels) and len(valid_defaults) == len( + available_labels + ) + + if selection_key not in st.session_state: + st.session_state[selection_key] = valid_defaults + if select_all_key not in st.session_state: + st.session_state[select_all_key] = all_selected + + if ( + st.session_state.get(select_all_key) + and st.session_state.get(selection_key) != available_labels + ): + st.session_state[selection_key] = available_labels + elif ( + not st.session_state.get(select_all_key) + and st.session_state.get(selection_key) != valid_defaults + ): + st.session_state[selection_key] = valid_defaults + + with st.popover("Results browser"): + st.caption(f"Browsing saved `{harness}_harness` runs") + if len(results_roots) > 1: + st.caption( + "Includes scaffolded eval folders created by the bootstrap skill." + ) + if not available_labels: + st.info("No result directories found for this harness yet.") + return [] + + st.checkbox( + "Select all", + key=select_all_key, + on_change=_set_results_selection, + args=(selection_key, select_all_key, available_labels), + ) + st.multiselect( + "Select one or more result directories", + options=available_labels, + key=selection_key, + placeholder="Choose saved runs", + format_func=display_run_label, + on_change=_sync_select_all_toggle, + args=(selection_key, select_all_key, available_labels), + ) + + selected_set = set(st.session_state.get(selection_key, [])) + return [ + path + for path, label in zip(available_paths, available_labels, strict=False) + if label in selected_set + ] + + +def selected_run_directory(harness: str) -> Path | None: + results_roots = harness_results_roots(harness) + available_paths = discover_result_directories(results_roots) + available_labels = relative_directory_labels(available_paths) + + if not available_paths: + st.info("No result directories found for this harness yet.") + return None + + current_selection = st.session_state.get(f"selected_run_{harness}") + default_index = 0 + if current_selection in available_labels: + default_index = available_labels.index(current_selection) + + selected_label = st.selectbox( + "Run", + options=available_labels, + index=default_index, + key=f"selected_run_{harness}", + format_func=display_run_label, + ) + if not selected_label: + return None + + return available_paths[available_labels.index(selected_label)] + + +def load_selected_summaries( + selected_paths: list[Path], +) -> tuple[list[dict[str, object]], list[str]]: + loaded_summaries: list[dict[str, object]] = [] + load_errors: list[str] = [] + + for run_directory in selected_paths: + summary_path = summary_path_for_run(run_directory) + run_label = result_directory_label(run_directory) + run_display_label = display_run_label(run_label) + if not summary_path.exists(): + load_errors.append(f"`{run_display_label}` is missing `summary.json`.") + continue + + try: + summary_payload = json.loads(summary_path.read_text(encoding="utf-8")) + except json.JSONDecodeError: + load_errors.append(f"`{run_display_label}` has an invalid `summary.json`.") + continue + + if not isinstance(summary_payload, dict): + load_errors.append( + f"`{run_display_label}` has a non-object `summary.json`." + ) + continue + + loaded_summaries.append( + { + "run_label": run_label, + "run_display_label": run_display_label, + "run_directory": str(run_directory), + "summary_path": str(summary_path), + **summary_payload, + } + ) + + return loaded_summaries, load_errors + + +def load_results_frame(run_directory: Path) -> tuple[pd.DataFrame | None, str | None]: + results_csv_path = results_csv_path_for_run(run_directory) + if not results_csv_path.exists(): + return None, "This run is missing `results.csv`." + + try: + results = pd.read_csv(results_csv_path, keep_default_na=False) + except Exception as exc: + return None, f"Could not load `results.csv`: {exc}" + + return results, None + + +def chart_run_label(summary: dict[str, object]) -> str: + run_display_label = summary.get("run_display_label") + if isinstance(run_display_label, str) and run_display_label: + return run_display_label + return display_run_label(str(summary["run_label"])) + + +def build_score_chart_frame( + summaries: list[dict[str, object]], metric_keys: list[str] +) -> pd.DataFrame: + rows: list[dict[str, object]] = [] + for summary in summaries: + run_label = chart_run_label(summary) + for metric_key in metric_keys: + metric_value = coerce_float(summary.get(metric_key)) + if metric_value is None: + continue + rows.append( + { + "run_label": run_label, + "metric_key": metric_key, + "metric_label": score_key_label(metric_key), + "value": metric_value, + } + ) + return pd.DataFrame(rows) + + +def score_key_label(metric_key: str) -> str: + if metric_key in SCORE_KEY_LABELS: + return SCORE_KEY_LABELS[metric_key] + if metric_key.endswith("_grade_mean"): + metric_key = metric_key[: -len("_grade_mean")] + return metric_key.replace("_", " ").strip().title() + + +def available_score_keys(summaries: list[dict[str, object]]) -> list[str]: + default_keys = [ + metric_key + for metric_key in DEFAULT_SCORE_KEYS + if any(coerce_float(summary.get(metric_key)) is not None for summary in summaries) + ] + dynamic_grade_keys = sorted( + { + metric_key + for summary in summaries + for metric_key, metric_value in summary.items() + if metric_key.endswith("_grade_mean") + and metric_key not in DEFAULT_SCORE_KEYS + and coerce_float(metric_value) is not None + } + ) + return default_keys + dynamic_grade_keys + + +def available_series_labels( + summaries: list[dict[str, object]], metric_config: dict[str, list[str]] +) -> list[str]: + available_labels: list[str] = [] + for series_label, metric_keys in metric_config.items(): + if any( + summary.get(metric_key) is not None + for summary in summaries + for metric_key in metric_keys + ): + available_labels.append(series_label) + return available_labels + + +def build_percentile_chart_frame( + summaries: list[dict[str, object]], metric_config: dict[str, list[str]] +) -> pd.DataFrame: + rows: list[dict[str, object]] = [] + for summary in summaries: + run_label = chart_run_label(summary) + for series_label, metric_keys in metric_config.items(): + for percentile_order, metric_key in enumerate(metric_keys): + metric_value = coerce_float(summary.get(metric_key)) + if metric_value is None: + continue + percentile_suffix = metric_key.rsplit("_", 1)[-1] + rows.append( + { + "run_label": run_label, + "series_label": series_label, + "metric_key": metric_key, + "percentile_label": PERCENTILE_LABELS.get( + percentile_suffix, percentile_suffix + ), + "percentile_order": percentile_order, + "value": metric_value, + } + ) + return pd.DataFrame(rows) + + +def selected_metric_controls( + harness: str, + summaries: list[dict[str, object]], +) -> tuple[list[str], dict[str, list[str]], dict[str, list[str]]]: + score_options = available_score_keys(summaries) + latency_options = available_series_labels(summaries, LATENCY_CHART_KEYS) + token_options = available_series_labels(summaries, TOKEN_CHART_KEYS) + + score_defaults = [key for key in DEFAULT_SCORE_KEYS if key in score_options] + latency_defaults = [ + label for label in DEFAULT_LATENCY_SERIES if label in latency_options + ] + token_defaults = [label for label in DEFAULT_TOKEN_SERIES if label in token_options] + + with st.popover("Chart metrics"): + st.caption("Toggle which metrics and metric families are visible.") + selected_score_keys = st.multiselect( + "Overall score metrics", + options=score_options, + default=score_defaults, + format_func=score_key_label, + key=f"selected_score_keys_{harness}", + ) + selected_latency_labels = st.multiselect( + "Latency families", + options=latency_options, + default=latency_defaults, + key=f"selected_latency_series_{harness}", + ) + selected_token_labels = st.multiselect( + "Token families", + options=token_options, + default=token_defaults, + key=f"selected_token_series_{harness}", + ) + + selected_latency_config = { + label: LATENCY_CHART_KEYS[label] for label in selected_latency_labels + } + selected_token_config = { + label: TOKEN_CHART_KEYS[label] for label in selected_token_labels + } + return selected_score_keys, selected_latency_config, selected_token_config + + +def build_summary_table( + summaries: list[dict[str, object]], + selected_score_keys: list[str], + latency_metric_config: dict[str, list[str]], + token_metric_config: dict[str, list[str]], +) -> pd.DataFrame: + table_columns = list(SUMMARY_TABLE_BASE_COLUMNS) + table_columns.extend(selected_score_keys) + + for metric_keys in latency_metric_config.values(): + table_columns.extend(metric_keys) + for metric_keys in token_metric_config.values(): + table_columns.extend(metric_keys) + + ordered_columns: list[str] = [] + seen_columns: set[str] = set() + for column in table_columns: + if column in seen_columns: + continue + if any(summary.get(column) is not None for summary in summaries): + ordered_columns.append(column) + seen_columns.add(column) + + rows: list[dict[str, object]] = [] + for summary in summaries: + row: dict[str, object] = {} + for column in ordered_columns: + if column == "run_label": + row[column] = chart_run_label(summary) + else: + row[column] = summary.get(column) + rows.append(row) + + if not rows: + return pd.DataFrame() + + return pd.DataFrame(rows) + + +def truncated_chart_label_expr() -> str: + return ( + f"length(datum.label) > {CHART_LABEL_MAX_CHARS} ? " + f"slice(datum.label, 0, {CHART_LABEL_MAX_CHARS}) + '{CHART_LABEL_SUFFIX}' : " + "datum.label" + ) + + +def angled_chart_axis() -> dict[str, object]: + return { + "labelAngle": 90, + "labelExpr": truncated_chart_label_expr(), + } + + +def render_grouped_bar_chart(data: pd.DataFrame) -> None: + if data.empty: + st.info("No score metrics were found in the selected summaries.") + return + + st.vega_lite_chart( + data, + { + "config": chart_theme_config(), + "height": CHART_CONTENT_HEIGHT, + "layer": [ + { + "mark": { + "type": "bar", + "cornerRadiusTopLeft": 4, + "cornerRadiusTopRight": 4, + }, + "encoding": { + "x": { + "field": "metric_label", + "type": "nominal", + "title": "Metric", + "axis": angled_chart_axis(), + }, + "xOffset": {"field": "run_label"}, + "y": { + "field": "value", + "type": "quantitative", + "title": "Score", + }, + "color": { + "field": "run_label", + "type": "nominal", + "title": "Run", + "scale": {"range": CHART_RUN_PALETTE}, + }, + "tooltip": [ + { + "field": "metric_label", + "type": "nominal", + "title": "Metric", + }, + {"field": "run_label", "type": "nominal", "title": "Run"}, + { + "field": "value", + "type": "quantitative", + "format": ".3f", + }, + ], + }, + }, + { + "mark": { + "type": "text", + "dy": -8, + "fontSize": 11, + }, + "encoding": { + "x": { + "field": "metric_label", + "type": "nominal", + "title": "Metric", + "axis": angled_chart_axis(), + }, + "xOffset": {"field": "run_label"}, + "y": { + "field": "value", + "type": "quantitative", + "title": "Score", + }, + "text": { + "field": "value", + "type": "quantitative", + "format": ".3f", + }, + "color": { + "value": CHART_TEXT_COLOR, + }, + }, + }, + ], + }, + use_container_width=True, + ) + + +def render_percentile_ladder_chart(data: pd.DataFrame, y_title: str) -> None: + if data.empty: + st.info("No percentile metrics were found in the selected summaries.") + return + + for series_label in data["series_label"].drop_duplicates().tolist(): + series_data = data[data["series_label"] == series_label] + st.caption(series_label) + st.vega_lite_chart( + series_data, + { + "config": chart_theme_config(), + "height": CHART_CONTENT_HEIGHT, + "layer": [ + { + "mark": { + "type": "bar", + "cornerRadiusTopLeft": 4, + "cornerRadiusTopRight": 4, + }, + "encoding": { + "x": { + "field": "percentile_label", + "type": "ordinal", + "title": "Percentile", + "sort": ["avg", "p50", "p95", "p99"], + "axis": angled_chart_axis(), + }, + "xOffset": {"field": "run_label"}, + "y": { + "field": "value", + "type": "quantitative", + "title": y_title, + }, + "color": { + "field": "run_label", + "type": "nominal", + "title": "Run", + "scale": {"range": CHART_RUN_PALETTE}, + }, + "tooltip": [ + { + "field": "run_label", + "type": "nominal", + "title": "Run", + }, + { + "field": "series_label", + "type": "nominal", + "title": "Series", + }, + { + "field": "percentile_label", + "type": "ordinal", + "title": "Percentile", + }, + { + "field": "value", + "type": "quantitative", + "format": ".2f", + }, + ], + }, + }, + { + "mark": { + "type": "text", + "dy": -8, + "fontSize": 11, + }, + "encoding": { + "x": { + "field": "percentile_label", + "type": "ordinal", + "title": "Percentile", + "sort": ["avg", "p50", "p95", "p99"], + "axis": angled_chart_axis(), + }, + "xOffset": {"field": "run_label"}, + "y": { + "field": "value", + "type": "quantitative", + "title": y_title, + }, + "text": { + "field": "value", + "type": "quantitative", + "format": ".2f", + }, + "color": {"value": CHART_TEXT_COLOR}, + }, + }, + ], + }, + use_container_width=True, + ) + + +def render_comparison_config_bar() -> tuple[str, list[Path]]: + with st.container(): + harness_column, browser_column, status_column = st.columns( + [2.9, 1.2, 1], + gap="small", + vertical_alignment="bottom", + ) + + with harness_column: + harness = st.selectbox( + "Harness", + options=["crawl", "walk", "run"], + index=0, + format_func=lambda value: f"{value}_harness", + key="comparison_harness", + ) + + with browser_column: + selected_paths = selected_result_directories(harness) + + with status_column: + st.markdown( + selected_runs_status_markup(len(selected_paths)), + unsafe_allow_html=True, + ) + + return harness, selected_paths + + +def render_run_viewer_config_bar() -> tuple[str, Path | None]: + with st.container(): + left, right = st.columns([1.1, 2.2]) + + with left: + harness = st.selectbox( + "Harness", + options=["crawl", "walk", "run"], + index=0, + format_func=lambda value: f"{value}_harness", + key="run_viewer_harness", + ) + + with right: + selected_path = selected_run_directory(harness) + + return harness, selected_path + + +def read_text_if_present(path: Path | None) -> str | None: + if path is None or not path.exists(): + return None + return path.read_text(encoding="utf-8") + + +def audio_bytes_if_present(path: Path | None) -> bytes | None: + if path is None or not path.exists(): + return None + return path.read_bytes() + + +def resolve_run_relative_path(run_directory: Path, candidate: Path) -> Path | None: + try: + resolved_run_directory = run_directory.resolve() + resolved_root = ROOT_DIR.resolve() + resolved_candidate = ( + candidate.resolve(strict=False) + if candidate.is_absolute() + else (resolved_run_directory / candidate).resolve(strict=False) + ) + except OSError: + return None + if not ( + resolved_candidate.is_relative_to(resolved_run_directory) + or resolved_candidate.is_relative_to(resolved_root) + ): + return None + return resolved_candidate + + +def clean_path_value(run_directory: Path, value: object) -> Path | None: + if not isinstance(value, str): + return None + cleaned_value = value.strip() + if not cleaned_value: + return None + return resolve_run_relative_path(run_directory, Path(cleaned_value)) + + +def simulation_id_for_row(row: pd.Series) -> str: + return str(row.get("simulation_id", "")).strip() + + +def selected_run_viewer_row_indices(table_data: pd.DataFrame) -> list[int]: + if RUN_VIEWER_SELECTED_COLUMN not in table_data.columns: + return [] + + selected_rows: list[int] = [] + for row_index, value in enumerate(table_data[RUN_VIEWER_SELECTED_COLUMN].tolist()): + if pd.isna(value): + continue + if bool(value): + selected_rows.append(row_index) + return selected_rows + + +def edited_run_viewer_row_indices(editing_state: object) -> list[int]: + if not isinstance(editing_state, dict): + return [] + + edited_rows = editing_state.get("edited_rows") + if not isinstance(edited_rows, dict): + return [] + + selected_rows: list[int] = [] + for row_index, row_changes in edited_rows.items(): + if not isinstance(row_index, int) or not isinstance(row_changes, dict): + continue + if row_changes.get(RUN_VIEWER_SELECTED_COLUMN) is True: + selected_rows.append(row_index) + return selected_rows + + +def row_text_value(row: pd.Series, candidate_columns: tuple[str, ...]) -> str | None: + for column_name in candidate_columns: + if column_name not in row.index: + continue + value = str(row.get(column_name, "")).strip() + if value: + return value + return None + + +def normalize_row_index(candidate: object, total_rows: int) -> int | None: + if isinstance(candidate, int) and 0 <= candidate < total_rows: + return candidate + return None + + +def resolve_active_row_index( + table_selected_index: int | None, + previous_table_selected_index: int | None, + stored_active_index: object, + total_rows: int, +) -> int | None: + if total_rows <= 0: + return None + if ( + table_selected_index is not None + and table_selected_index != previous_table_selected_index + ): + return table_selected_index + + active_row_index = normalize_row_index(stored_active_index, total_rows) + if active_row_index is not None: + return active_row_index + + if table_selected_index is not None: + return table_selected_index + + return 0 + + +def build_run_viewer_table( + results: pd.DataFrame, active_row_index: int | None +) -> pd.DataFrame: + display_results = results.copy() + selected_rows = [False] * len(display_results) + if active_row_index is not None and 0 <= active_row_index < len(display_results): + selected_rows[active_row_index] = True + display_results.insert(0, RUN_VIEWER_SELECTED_COLUMN, selected_rows) + return display_results + + +def resolve_run_viewer_table_selected_index( + table_data: pd.DataFrame, + editing_state: object, + current_active_index: int | None, +) -> int | None: + total_rows = len(table_data) + for candidate in reversed(edited_run_viewer_row_indices(editing_state)): + normalized_candidate = normalize_row_index(candidate, total_rows) + if normalized_candidate is not None: + return normalized_candidate + + selected_rows = selected_run_viewer_row_indices(table_data) + if current_active_index is not None and current_active_index in selected_rows: + return current_active_index + if selected_rows: + return selected_rows[0] + return current_active_index + + +def run_viewer_table_matches_active_row( + table_data: pd.DataFrame, + active_row_index: int | None, +) -> bool: + normalized_active_index = normalize_row_index(active_row_index, len(table_data)) + expected_rows = [] if normalized_active_index is None else [normalized_active_index] + return selected_run_viewer_row_indices(table_data) == expected_rows + + +def run_viewer_table_height(total_rows: int) -> int: + visible_rows = max(1, min(total_rows, RUN_VIEWER_TABLE_VISIBLE_ROWS)) + return ( + RUN_VIEWER_TABLE_HEADER_HEIGHT + + (visible_rows * RUN_VIEWER_TABLE_ROW_HEIGHT) + + RUN_VIEWER_TABLE_FRAME_PADDING + ) + + +def input_audio_path_for_row(run_directory: Path, row: pd.Series) -> Path | None: + for column_name in ("input_audio_path", "audio_path"): + candidate = clean_path_value(run_directory, row.get(column_name, "")) + if candidate is not None: + return candidate + + example_id = str(row.get("example_id", "")).strip() + if not example_id: + return None + candidate = resolve_run_relative_path( + run_directory, Path("audio") / example_id / "input.wav" + ) + if candidate is not None and candidate.exists(): + return candidate + return None + + +def output_audio_path_for_row(run_directory: Path, row: pd.Series) -> Path | None: + candidate = clean_path_value(run_directory, row.get("output_audio_path", "")) + if candidate is not None: + return candidate + + example_id = str(row.get("example_id", "")).strip() + if not example_id: + return None + candidate = resolve_run_relative_path( + run_directory, Path("audio") / example_id / "output.wav" + ) + if candidate is not None and candidate.exists(): + return candidate + return None + + +def event_log_path_for_row(run_directory: Path, row: pd.Series) -> Path | None: + candidate = clean_path_value(run_directory, row.get("event_log_path", "")) + if candidate is not None: + return candidate + + example_id = str(row.get("example_id", "")).strip() + if not example_id: + return None + candidate = resolve_run_relative_path( + run_directory, Path("events") / f"{example_id}.jsonl" + ) + if candidate is not None and candidate.exists(): + return candidate + return None + + +def simulation_transcript_path_for_row( + run_directory: Path, row: pd.Series +) -> Path | None: + simulation_id = simulation_id_for_row(row) + if not simulation_id: + return None + + candidate = resolve_run_relative_path( + run_directory, Path("conversations") / f"{simulation_id}.txt" + ) + if candidate is not None and candidate.exists(): + return candidate + return None + + +def simulation_event_log_path_for_row( + run_directory: Path, row: pd.Series +) -> Path | None: + candidate = clean_path_value(run_directory, row.get("event_log_path", "")) + if candidate is not None: + return candidate + + simulation_id = simulation_id_for_row(row) + if not simulation_id: + return None + + candidate = resolve_run_relative_path( + run_directory, Path("events") / f"{simulation_id}.jsonl" + ) + if candidate is not None and candidate.exists(): + return candidate + return None + + +def _audio_sort_key(path: Path) -> tuple[int, int, str]: + stem_parts = path.stem.split("_") + turn_index = -1 + role_order = 2 + + if len(stem_parts) >= 3 and stem_parts[0] == "turn": + try: + turn_index = int(stem_parts[1]) + except ValueError: + turn_index = -1 + role_order = {"user": 0, "assistant": 1}.get(stem_parts[2], 2) + + return turn_index, role_order, path.name + + +def simulation_audio_paths_for_row(run_directory: Path, row: pd.Series) -> list[Path]: + simulation_id = simulation_id_for_row(row) + if not simulation_id: + return [] + + audio_dir = resolve_run_relative_path(run_directory, Path("audio") / simulation_id) + if audio_dir is None or not audio_dir.exists(): + return [] + + return sorted(audio_dir.glob("*.wav"), key=_audio_sort_key) + + +def run_viewer_table_key(harness: str, selected_path: Path) -> str: + return f"run_viewer_table_{harness}_{path_state_key_fragment(selected_path)}" + + +def normalize_table_revision(candidate: object) -> int: + if isinstance(candidate, int) and candidate >= 0: + return candidate + return 0 + + +def run_viewer_active_row_key(harness: str, selected_path: Path) -> str: + return f"run_viewer_active_row_{harness}_{path_state_key_fragment(selected_path)}" + + +def run_viewer_table_selection_key(harness: str, selected_path: Path) -> str: + return ( + f"run_viewer_table_selection_{harness}_{path_state_key_fragment(selected_path)}" + ) + + +def run_viewer_table_revision_key(harness: str, selected_path: Path) -> str: + return ( + f"run_viewer_table_revision_{harness}_{path_state_key_fragment(selected_path)}" + ) + + +def run_viewer_table_widget_key( + harness: str, selected_path: Path, table_revision: int +) -> str: + return f"{run_viewer_table_key(harness, selected_path)}_{table_revision}" + + +def refresh_run_viewer_table_state( + harness: str, + selected_path: Path, + selected_index: int | None, +) -> None: + selection_key = run_viewer_table_selection_key(harness, selected_path) + revision_key = run_viewer_table_revision_key(harness, selected_path) + st.session_state[selection_key] = selected_index + # Streamlit data editor edits are sticky, so bump the widget key to rehydrate + # the controlled checkbox state from the active row. + st.session_state[revision_key] = ( + normalize_table_revision(st.session_state.get(revision_key)) + 1 + ) + + +def render_row_text_review(row: pd.Series) -> None: + input_text = row_text_value(row, ("user_text", "input_text")) + output_text = row_text_value( + row, + ("assistant_text", "output_text", "response_text"), + ) + + st.caption("Input text") + if input_text is None: + st.info("No input text recorded for this row.") + else: + st.code(input_text, language="text") + + st.caption("Output text") + if output_text is None: + st.info("No output text recorded for this row.") + else: + st.code(output_text, language="text") + + +def render_example_viewer(run_directory: Path, row: pd.Series) -> None: + example_id = str(row.get("example_id", "")).strip() or "Unknown example" + + st.subheader(example_id) + render_row_text_review(row) + + input_audio_path = input_audio_path_for_row(run_directory, row) + output_audio_path = output_audio_path_for_row(run_directory, row) + event_log_path = event_log_path_for_row(run_directory, row) + + st.caption("Input audio") + input_audio_bytes = audio_bytes_if_present(input_audio_path) + if input_audio_bytes is None: + st.info("No input audio found for this example.") + else: + st.audio(input_audio_bytes, format="audio/wav") + st.caption(str(input_audio_path)) + + st.caption("Output audio") + output_audio_bytes = audio_bytes_if_present(output_audio_path) + if output_audio_bytes is None: + st.info("No output audio found for this example.") + else: + st.audio(output_audio_bytes, format="audio/wav") + st.caption(str(output_audio_path)) + + st.caption("Events") + event_log_text = read_text_if_present(event_log_path) + if event_log_text is None: + st.info("No event log found for this example.") + else: + st.code(event_log_text, language="json") + st.caption(str(event_log_path)) + + +def render_simulation_viewer(run_directory: Path, row: pd.Series) -> None: + simulation_id = simulation_id_for_row(row) or "Unknown simulation" + turn_index = str(row.get("turn_index", "")).strip() + + st.subheader(simulation_id) + if turn_index: + st.caption(f"Selected row turn: {turn_index}") + render_row_text_review(row) + + transcript_path = simulation_transcript_path_for_row(run_directory, row) + event_log_path = simulation_event_log_path_for_row(run_directory, row) + audio_paths = simulation_audio_paths_for_row(run_directory, row) + + transcript_tab, events_tab, audio_tab = st.tabs( + ["Transcript", "Events", "Audio by turn"] + ) + + with transcript_tab: + transcript_text = read_text_if_present(transcript_path) + if transcript_text is None: + st.info("No transcript found for this simulation.") + else: + st.code(transcript_text, language="text") + st.caption(str(transcript_path)) + + with events_tab: + event_log_text = read_text_if_present(event_log_path) + if event_log_text is None: + st.info("No event log found for this simulation.") + else: + st.code(event_log_text, language="json") + st.caption(str(event_log_path)) + + with audio_tab: + if not audio_paths: + st.info("No turn audio files found for this simulation.") + else: + for audio_path in audio_paths: + st.caption(audio_path.name) + audio_bytes = audio_bytes_if_present(audio_path) + if audio_bytes is None: + st.info("Could not load this audio file.") + continue + st.audio(audio_bytes, format="audio/wav") + st.caption(str(audio_path)) + + +def render_comparison_view() -> None: + st.title("Comparison View") + st.caption("Compare metrics across one or more saved realtime eval runs.") + + harness, selected_paths = render_comparison_config_bar() + + st.divider() + + if not selected_paths: + st.info( + "Choose one or more result directories from the Results browser to continue." + ) + return + + summaries, load_errors = load_selected_summaries(selected_paths) + for error_message in load_errors: + st.warning(error_message) + + if not summaries: + st.error( + "No valid `summary.json` files could be loaded from the selected runs." + ) + return + + st.subheader("Selected Result Directories") + selected_runs_text = "\n".join( + f"{index}. {display_run_label(path)}" + for index, path in enumerate(selected_paths, start=1) + ) + st.code(selected_runs_text, language="text") + + score_keys, latency_config, token_config = selected_metric_controls( + harness, summaries + ) + + st.subheader("Summary Table") + summary_table = build_summary_table( + summaries, + selected_score_keys=score_keys, + latency_metric_config=latency_config, + token_metric_config=token_config, + ) + if summary_table.empty: + st.info("No table columns are currently selected.") + else: + st.dataframe(summary_table, use_container_width=True, hide_index=True) + + st.subheader("Overall Scores (higher is better)") + render_grouped_bar_chart(build_score_chart_frame(summaries, score_keys)) + + st.subheader("Latency (lower is better)") + render_percentile_ladder_chart( + build_percentile_chart_frame(summaries, latency_config), + y_title="Latency (ms)", + ) + + st.subheader("Token Consumption (lower is better)") + render_percentile_ladder_chart( + build_percentile_chart_frame(summaries, token_config), + y_title="Tokens", + ) + + +def render_run_viewer() -> None: + st.title("Run Viewer") + st.caption( + "Inspect one saved run and drill into example- or simulation-level artifacts." + ) + + harness, selected_path = render_run_viewer_config_bar() + + st.divider() + + if selected_path is None: + st.info("Choose a result directory for the selected harness to continue.") + return + + results, load_error = load_results_frame(selected_path) + if load_error is not None: + st.error(load_error) + return + if results is None or results.empty: + st.info("This run does not contain any rows in `results.csv`.") + return + + st.caption(f"Viewing `{display_run_label(selected_path)}`") + + active_row_key = run_viewer_active_row_key(harness, selected_path) + table_selection_key = run_viewer_table_selection_key(harness, selected_path) + table_revision_key = run_viewer_table_revision_key(harness, selected_path) + table_revision = normalize_table_revision( + st.session_state.get(table_revision_key) + ) + previous_table_selected_index = normalize_row_index( + st.session_state.get(table_selection_key), + len(results), + ) + render_active_row_index = resolve_active_row_index( + previous_table_selected_index, + previous_table_selected_index, + st.session_state.get(active_row_key), + len(results), + ) + + table_column, detail_column = st.columns([1.45, 1], gap="large") + + with table_column: + st.subheader("Examples" if harness != "run" else "Simulation rows") + table_widget_key = run_viewer_table_widget_key( + harness, + selected_path, + table_revision, + ) + selection = st.data_editor( + build_run_viewer_table(results, render_active_row_index), + height=run_viewer_table_height(len(results)), + use_container_width=True, + hide_index=True, + column_config={ + RUN_VIEWER_SELECTED_COLUMN: st.column_config.CheckboxColumn( + " ", + width="small", + ) + }, + disabled=list(results.columns), + row_height=RUN_VIEWER_TABLE_ROW_HEIGHT, + key=table_widget_key, + ) + + with detail_column: + st.subheader("Example Viewer" if harness != "run" else "Simulation Viewer") + table_selected_index = resolve_run_viewer_table_selected_index( + selection, + st.session_state.get(table_widget_key), + render_active_row_index, + ) + active_row_index = resolve_active_row_index( + table_selected_index, + previous_table_selected_index, + st.session_state.get(active_row_key), + len(results), + ) + st.session_state[table_selection_key] = table_selected_index + st.session_state[active_row_key] = active_row_index + if active_row_index is None: + st.info("No datapoints are available to inspect.") + return + if not run_viewer_table_matches_active_row(selection, active_row_index): + refresh_run_viewer_table_state(harness, selected_path, active_row_index) + st.rerun() + + nav_previous_column, nav_status_column, nav_next_column = st.columns( + [1, 1.2, 1] + ) + with nav_previous_column: + if st.button( + "Previous", + key=f"{active_row_key}_previous", + disabled=active_row_index <= 0, + use_container_width=True, + ): + previous_row_index = active_row_index - 1 + st.session_state[active_row_key] = previous_row_index + refresh_run_viewer_table_state( + harness, + selected_path, + previous_row_index, + ) + st.rerun() + with nav_status_column: + st.caption(f"Datapoint {active_row_index + 1} of {len(results)}") + with nav_next_column: + if st.button( + "Next", + key=f"{active_row_key}_next", + disabled=active_row_index >= len(results) - 1, + use_container_width=True, + ): + next_row_index = active_row_index + 1 + st.session_state[active_row_key] = next_row_index + refresh_run_viewer_table_state( + harness, + selected_path, + next_row_index, + ) + st.rerun() + + selected_row = results.iloc[active_row_index] + if harness == "run": + render_simulation_viewer(selected_path, selected_row) + else: + render_example_viewer(selected_path, selected_row) + + +def main() -> None: + st.set_page_config( + page_title="Realtime Results Viewer", + page_icon=":bar_chart:", + layout="wide", + ) + load_css() + + navigation = st.navigation( + [ + st.Page(render_comparison_view, title="Comparison View", default=True), + st.Page(render_run_viewer, title="Run Viewer"), + ] + ) + navigation.run() + + +if __name__ == "__main__": + main() diff --git a/examples/evals/realtime_evals/results_viewer/config.py b/examples/evals/realtime_evals/results_viewer/config.py new file mode 100644 index 0000000000..b556ad62e8 --- /dev/null +++ b/examples/evals/realtime_evals/results_viewer/config.py @@ -0,0 +1,82 @@ +from __future__ import annotations + +DEFAULT_SCORE_KEYS = [ + "grade_mean", + "tool_call_correctness_mean", + "tool_call_arg_correctness_mean", +] + +DEFAULT_LATENCY_SERIES = [ + "First audio latency (ms)", + "First text latency (ms)", + "Response done latency (ms)", +] + +DEFAULT_TOKEN_SERIES = [ + "Output tokens", + "Output audio tokens", + "Output text tokens", +] + +LATENCY_CHART_KEYS = { + "First audio latency (ms)": [ + "latency_first_audio_ms_avg", + "latency_first_audio_ms_p50", + "latency_first_audio_ms_p95", + "latency_first_audio_ms_p99", + ], + "First text latency (ms)": [ + "latency_first_text_ms_avg", + "latency_first_text_ms_p50", + "latency_first_text_ms_p95", + "latency_first_text_ms_p99", + ], + "Response done latency (ms)": [ + "latency_response_done_ms_avg", + "latency_response_done_ms_p50", + "latency_response_done_ms_p95", + "latency_response_done_ms_p99", + ], +} + +TOKEN_CHART_KEYS = { + "Output tokens": [ + "output_tokens_avg", + "output_tokens_p50", + "output_tokens_p95", + "output_tokens_p99", + ], + "Output audio tokens": [ + "output_audio_tokens_avg", + "output_audio_tokens_p50", + "output_audio_tokens_p95", + "output_audio_tokens_p99", + ], + "Output text tokens": [ + "output_text_tokens_avg", + "output_text_tokens_p50", + "output_text_tokens_p95", + "output_text_tokens_p99", + ], +} + +SCORE_KEY_LABELS = { + "grade_mean": "Grade", + "tool_call_correctness_mean": "Tool call correctness", + "tool_call_arg_correctness_mean": "Tool arg correctness", +} + +PERCENTILE_LABELS = { + "avg": "avg", + "p50": "p50", + "p95": "p95", + "p99": "p99", +} + +SUMMARY_TABLE_BASE_COLUMNS = [ + "run_label", + "run_name", + "model", + "assistant_model_default", + "simulator_model_default", +] diff --git a/examples/evals/realtime_evals/results_viewer/styles.css b/examples/evals/realtime_evals/results_viewer/styles.css new file mode 100644 index 0000000000..24c9b5a84b --- /dev/null +++ b/examples/evals/realtime_evals/results_viewer/styles.css @@ -0,0 +1,236 @@ +:root { + color-scheme: light; + --bg: #f3f3ef; + --sidebar: #ecece6; + --surface: #fafaf7; + --surface-muted: #efefe9; + --text: #111111; + --text-soft: #2f2f2c; + --muted: #666660; + --border: rgba(17, 17, 17, 0.1); + --border-strong: rgba(17, 17, 17, 0.18); + --focus: 0 0 0 3px rgba(124, 141, 149, 0.18); + --radius: 14px; + --max-width: 1480px; +} + +html, +body, +[data-testid="stAppViewContainer"], +[data-testid="stApp"] { + background: var(--bg); + color: var(--text); +} + +[data-testid="stHeader"] { + background: transparent; + border-bottom: 0; +} + +[data-testid="stMainBlockContainer"] { + max-width: var(--max-width); + padding: 2rem 1.5rem 4rem; +} + +[data-testid="stSidebar"] { + background: var(--sidebar); + border-right: 1px solid var(--border); +} + +[data-testid="stSidebar"] * { + color: var(--text); +} + +[data-testid="stSidebarNav"] a { + border: 1px solid transparent; + border-radius: var(--radius); + transition: + background-color 150ms ease, + border-color 150ms ease; +} + +[data-testid="stSidebarNav"] a:hover, +[data-testid="stSidebarNav"] a[aria-current="page"] { + background: rgba(17, 17, 17, 0.04); + border-color: var(--border); +} + +:where(h1, h2, h3, h4, h5, h6, label, strong, b) { + color: var(--text); + letter-spacing: -0.03em; +} + +:where(p, li, [data-testid="stText"], [data-testid="stException"] p) { + color: var(--text-soft); + line-height: 1.58; +} + +:where(.stCaption, [data-testid="stCaptionContainer"], [data-testid="stMetricLabel"]) { + color: var(--muted); +} + +a { + color: var(--text-soft); + text-decoration-color: rgba(17, 17, 17, 0.18); +} + +a:hover { + color: var(--text); + text-decoration-color: rgba(17, 17, 17, 0.3); +} + +code, +pre { + font-family: "SFMono-Regular", ui-monospace, "Cascadia Code", "Source Code Pro", Menlo, Monaco, Consolas, monospace; +} + +::selection { + background: rgba(185, 207, 219, 0.44); + color: var(--text); +} + +:focus-visible { + border-radius: 10px; + box-shadow: var(--focus); + outline: none; +} + +input[type="checkbox"], +input[type="radio"] { + accent-color: #5a5b55; +} + +:is( + [data-testid="stMetric"], + [data-testid="stDataFrame"], + [data-testid="stAlert"], + [data-testid="stVegaLiteChart"], + [data-testid="stCodeBlock"], + [data-testid="stAudio"], + [data-testid="stExpander"], + table +) { + background: var(--surface); + border: 1px solid var(--border); + border-radius: var(--radius); + box-shadow: none; +} + +[data-testid="stVegaLiteChart"] { + display: flex; + min-height: 300px; + padding: 0.65rem; +} + +[data-testid="stVegaLiteChart"] > div { + flex: 1 1 auto; + min-height: calc(300px - 1.3rem); + overflow: hidden; + border-radius: calc(var(--radius) - 2px); +} + +:is( + [data-testid="stButton"] button, + button[kind], + button[data-baseweb="tab"], + [data-baseweb="input"] > div, + [data-baseweb="base-input"] > div, + [data-baseweb="select"] > div, + [data-baseweb="popover"] [role="listbox"] +) { + background: var(--surface); + border: 1px solid var(--border); + border-radius: var(--radius); + box-shadow: none; + color: var(--text); +} + +:is([data-testid="stButton"] button, button[kind], button[data-baseweb="tab"]):hover { + background: var(--surface-muted); + border-color: var(--border-strong); +} + +button[data-baseweb="tab"][aria-selected="true"] { + background: var(--surface-muted); + border-color: var(--border); + color: var(--text); +} + +:is( + [data-baseweb="input"] > div, + [data-baseweb="base-input"] > div, + [data-baseweb="select"] > div +):focus-within { + border-color: var(--border-strong); + box-shadow: var(--focus); +} + +:where( + textarea, + input, + [data-baseweb="select"] input, + [data-baseweb="select"] *, + [data-baseweb="popover"] [role="option"] * +) { + color: var(--text) !important; +} + +:where( + [data-baseweb="select"] input::placeholder, + [data-baseweb="input"] input::placeholder, + [data-baseweb="base-input"] input::placeholder, + textarea::placeholder +) { + color: #8b8b86 !important; + -webkit-text-fill-color: #8b8b86; + opacity: 1; +} + +[data-baseweb="popover"] [role="option"][aria-selected="true"], +[data-baseweb="popover"] [role="option"]:hover { + background: rgba(17, 17, 17, 0.05) !important; +} + +[data-testid="stDataFrame"] > div, +[data-testid="stDataFrameResizable"], +[data-testid="stDataFrameGlideDataEditor"], +[data-testid="stDataFrameGlideDataEditor"] .dvn-stack, +[data-testid="stDataFrameGlideDataEditor"] canvas { + background-color: var(--surface); +} + +table { + width: 100%; + border-collapse: collapse; +} + +th { + background: var(--surface-muted); + color: var(--muted); +} + +th, +td { + border-bottom: 1px solid rgba(17, 17, 17, 0.06); +} + +.comparison-selection-status { + margin: 0 0 0.55rem; + color: var(--muted); + font-size: 0.9rem; + font-weight: 560; + line-height: 1.2; + text-align: right; + white-space: nowrap; +} + +.comparison-selection-status__count { + color: var(--text); + font-weight: 700; +} + +@media (max-width: 900px) { + [data-testid="stMainBlockContainer"] { + padding: 1.2rem 1rem 2.4rem; + } +} diff --git a/examples/evals/realtime_evals/results_viewer/ui.py b/examples/evals/realtime_evals/results_viewer/ui.py new file mode 100644 index 0000000000..fb600dcb14 --- /dev/null +++ b/examples/evals/realtime_evals/results_viewer/ui.py @@ -0,0 +1,67 @@ +from __future__ import annotations + +from collections.abc import Callable +from pathlib import Path + +import streamlit as st + + +def _missing_script_run_ctx(*, suppress_warning: bool = False) -> None: + del suppress_warning + return None + + +_get_script_run_ctx: Callable[..., object | None] +try: + from streamlit.runtime.scriptrunner import get_script_run_ctx as _get_script_run_ctx +except ImportError: + _get_script_run_ctx = _missing_script_run_ctx + + +UI_DIR = Path(__file__).resolve().parent + + +def _resolve_css_path(path: str | Path) -> Path: + css_path = Path(path) + if css_path.is_absolute(): + return css_path + + module_relative_path = (UI_DIR / css_path).resolve() + if module_relative_path.exists(): + return module_relative_path + + return (Path.cwd() / css_path).resolve() + + +def _current_run_token() -> str | None: + ctx = _get_script_run_ctx(suppress_warning=True) + if ctx is None: + return None + + session_id = getattr(ctx, "session_id", None) + page_script_hash = getattr(ctx, "page_script_hash", None) + cursors = getattr(ctx, "cursors", None) + if session_id is None or page_script_hash is None or cursors is None: + return None + + # `ctx.cursors` is recreated for each rerun, which lets us preserve the + # "no duplicate injection in one run" behavior without breaking styling on + # refresh or widget-triggered reruns. + return f"{session_id}:{page_script_hash}:{id(cursors)}" + + +def load_css(path: str | Path = "styles.css") -> None: + css_path = _resolve_css_path(path) + if not css_path.exists(): + raise FileNotFoundError(f"CSS file not found: {css_path}") + + state_key = f"_results_viewer_css_loaded::{css_path}" + run_token = _current_run_token() + if run_token is not None and st.session_state.get(state_key) == run_token: + return + + # `st.html()` is Streamlit's supported path for injecting a local CSS file. + # When given a CSS path, Streamlit wraps the file in