Prompt caching (reusing previously processed prompt tokens instead of reprocessing them) reduces cost and latency on repeated requests. By default, Auriko handles cache optimization automatically. For fine-grained control, you can specify cache control manually.Documentation Index
Fetch the complete documentation index at: https://docs.auriko.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- An Auriko API key
- Python 3.10+ with the OpenAI SDK (
pip install openai) or the auriko SDK (pip install auriko)- OR Node.js 18+ with the OpenAI SDK (
npm install openai) or@auriko/sdk(npm install @auriko/sdk)
- OR Node.js 18+ with the OpenAI SDK (
How it works
Auriko optimizes caching for each provider when your request includes reusable prompt content. On subsequent requests sharing the same prompt prefix, the provider serves cached tokens at reduced cost and lower latency. Auriko accounts for each provider’s caching economics (token thresholds, discount depths, and read/write prices) when choosing where to route. Over time, the system learns your usage patterns to improve estimation accuracy. Create separate workspaces for different use cases to get better predictions. Auriko is a zero data retention proxy. Your prompts, responses, and content are never read, logged, or stored. Pattern calibration uses usage metadata only. Read the Privacy Policy for details.Send a cached request
Send a request with a reusable system prompt:Override caching per provider
Auriko handles caching automatically for supported providers. For explicit control, each provider accepts specific fields. When you supply one, Auriko skips automatic injection and uses your value.| Provider | Field | Effect |
|---|---|---|
| Anthropic | cache_control: {"type": "ephemeral"} on content blocks | Marks specific content for caching |
| OpenAI | prompt_cache_key (string) | Improves cache hit rate for repeated conversations |
| OpenAI | prompt_cache_retention: "24h" | Extends cache lifetime to 24 hours |
| Fireworks | user (string) | Improves cache reuse across conversation turns |
When you provide any of these fields, Auriko skips automatic cache injection for that provider.
Anthropic — cache_control
Add cache_control to content blocks to mark specific content for caching:
"ephemeral". This follows the provider’s default retention behavior. cache_control applies to Anthropic models only. For other providers, automatic optimization handles caching.
OpenAI — prompt_cache_key and prompt_cache_retention
prompt_cache_key improves cache hit rate for repeated conversations. prompt_cache_retention: "24h" extends the cache lifetime to 24 hours.
prompt_cache_retention is supported on gpt-4.1+ and gpt-5+ models only. It isn’t compatible with ZDR data policy. Omit it if your workspace uses ZDR.
Fireworks — user
On Fireworks, requests with the same user value benefit from improved cache reuse across conversation turns.
Check cache usage
For/v1/chat/completions responses, cache hit information appears in usage.prompt_tokens_details:
cached_tokens shows how many prompt tokens were served from cache. Auriko normalizes this field across all providers in the OpenAI-format response.
cache_creation_tokens shows how many tokens were written to prompt cache on this request. This field is populated for Anthropic models.
For /v1/messages responses, cache tokens appear as top-level usage fields:
input_tokens represents only the non-cached portion. Total input tokens = input_tokens + cache_read_input_tokens + cache_creation_input_tokens.
Check cache savings
Cache savings appear inrouting_metadata.cost when savings are greater than zero:
cache_savings_percent is an integer (0-100) showing the percentage saved compared to uncached cost. cache_savings_usd shows the dollar amount saved.
Check cache usage in streams
Cache metrics appear in the final streaming chunk alongsideusage and routing_metadata. See Streaming for details on consuming trailing chunks.
Improve cache hits
You can improve cache hit rates by structuring your requests for reuse.- Long, stable system prompts: Place reusable instructions in the system message. The prompt prefix is what providers cache.
- Few-shot examples: Static example blocks are reused across requests.
- Static before dynamic: Put content that doesn’t change before content that does.
- Multi-turn conversations: Shared prompt prefixes get better cache reuse across requests.
- Steady request cadence: Providers expire cached tokens after inactivity. Steady flow keeps entries warm.
Look up cache pricing
The model directory exposes cache pricing for every supported provider. Query it to seecache_read_price, cache_write_price, and supports_prompt_caching per model:
Troubleshoot
| Symptom | Fix |
|---|---|
cached_tokens always 0 (first request) | The first request creates the cache. Send a follow-up with the same prefix. |
cached_tokens always 0 (unsupported model) | Check supports_prompt_caching in the model directory. |
cached_tokens always 0 (unique prompts) | Caching requires a shared prefix. Add a reusable system prompt. |
cached_tokens always 0 (short prompt) | Your prompt may be below the provider’s minimum token threshold. Add more reusable content to the system message. |
| Lower-than-expected savings | Move static content before dynamic content in messages. |
| Lower-than-expected savings (gaps between requests) | Providers expire cached tokens after inactivity. Maintain steady request flow. |
cache_savings_percent not in response | The field appears only when savings are greater than zero. |
Resources
- Cost optimization — cache economics in routing
- Streaming — cache metrics in streaming responses
- Model directory — cache pricing and support per model
- Response metadata —
routing_metadata.costfields