Prerequisites
- An Auriko API key
- Python 3.10+ with
aurikoSDK installed (pip install auriko)- OR Node.js 18+ with
@auriko/sdkinstalled (npm install @auriko/sdk)
- OR Node.js 18+ with
How it works
Auriko intercepts outgoing requests and injects provider-specific caching directives when conversations exceed provider-specific token thresholds. You send requests normally — caching happens transparently. When a subsequent request shares the same prompt prefix, the provider serves the cached portion at a reduced cost and lower latency.See it in action
Send a normal request — caching is automatic:Provider support
Auriko injects caching directives for four providers. Each uses a different mechanism:| Provider | User action | Auriko behavior |
|---|---|---|
| Anthropic | None — automatic | Injects cache_control: {type: "ephemeral"} when estimated tokens exceed a per-model threshold (1024–4096 tokens depending on model). Skips if user already added cache_control blocks. |
| OpenAI | None — automatic | Injects prompt_cache_key for server affinity on all requests with a conversation ID. Adds prompt_cache_retention: "24h" for supported models (gpt-5, gpt-5.1, gpt-5.2, gpt-4.1 families). Skips retention for zdr data policy. |
| Fireworks | None — automatic | Sets user field to conversation ID for same-replica routing and KV-cache reuse. |
| xAI | None — automatic | Injects x-grok-conv-id header (UUID4 derived from conversation ID) for session affinity. |
Anthropic token thresholds
Caching is only activated when the estimated prompt token count exceeds the model-specific threshold:| Model family | Threshold |
|---|---|
| claude-sonnet-4-5, claude-sonnet-4, claude-opus-4, claude-opus-4-1, claude-3-7-sonnet | 1024 tokens |
| claude-sonnet-4-6, claude-3-5-haiku, claude-3-haiku | 2048 tokens |
| claude-haiku-4-5, claude-opus-4-5, claude-opus-4-6 | 4096 tokens |
Manual cache control
If you need fine-grained control (for example, caching a specific system prompt block), addcache_control blocks manually. When Auriko detects any existing cache_control in the request, it skips auto-injection entirely.
Check cache usage
Cache hit information appears in the responseusage object:
cached_tokens field shows how many prompt tokens were served from cache.
When to use
Caching works best with:- Multi-turn conversations — the shared conversation prefix grows with each turn
- Long system prompts — reused across many requests
- Few-shot examples — static example blocks cached across calls
- Unique prompts — no shared prefix to cache
- Very short prompts — below provider token thresholds
- Single-turn requests — no subsequent requests to benefit from the cache