Skip to main content
Reduce costs and latency by reusing cached prompt prefixes across requests. Auriko automatically injects cache control directives for all supported providers — no user action needed.

Prerequisites

  • An Auriko API key
  • Python 3.10+ with auriko SDK installed (pip install auriko)
    • OR Node.js 18+ with @auriko/sdk installed (npm install @auriko/sdk)

How it works

Auriko intercepts outgoing requests and injects provider-specific caching directives when conversations exceed provider-specific token thresholds. You send requests normally — caching happens transparently. When a subsequent request shares the same prompt prefix, the provider serves the cached portion at a reduced cost and lower latency.

See it in action

Send a normal request — caching is automatic:
import os
from auriko import Client

client = Client(
    api_key=os.environ["AURIKO_API_KEY"],
    base_url="https://api.auriko.ai/v1"
)

response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant..."},
        {"role": "user", "content": "Explain async/await in Python."}
    ]
)

# Check cache usage in the response
usage = response.usage
if hasattr(usage, "prompt_tokens_details") and usage.prompt_tokens_details:
    cached = getattr(usage.prompt_tokens_details, "cached_tokens", 0)
    print(f"Cached tokens: {cached}")
print(f"Total prompt tokens: {usage.prompt_tokens}")

Provider support

Auriko injects caching directives for four providers. Each uses a different mechanism:
ProviderUser actionAuriko behavior
AnthropicNone — automaticInjects cache_control: {type: "ephemeral"} when estimated tokens exceed a per-model threshold (1024–4096 tokens depending on model). Skips if user already added cache_control blocks.
OpenAINone — automaticInjects prompt_cache_key for server affinity on all requests with a conversation ID. Adds prompt_cache_retention: "24h" for supported models (gpt-5, gpt-5.1, gpt-5.2, gpt-4.1 families). Skips retention for zdr data policy.
FireworksNone — automaticSets user field to conversation ID for same-replica routing and KV-cache reuse.
xAINone — automaticInjects x-grok-conv-id header (UUID4 derived from conversation ID) for session affinity.

Anthropic token thresholds

Caching is only activated when the estimated prompt token count exceeds the model-specific threshold:
Model familyThreshold
claude-sonnet-4-5, claude-sonnet-4, claude-opus-4, claude-opus-4-1, claude-3-7-sonnet1024 tokens
claude-sonnet-4-6, claude-3-5-haiku, claude-3-haiku2048 tokens
claude-haiku-4-5, claude-opus-4-5, claude-opus-4-64096 tokens
Requests below the threshold skip auto-injection because Anthropic charges for cache writes on small requests.

Manual cache control

If you need fine-grained control (for example, caching a specific system prompt block), add cache_control blocks manually. When Auriko detects any existing cache_control in the request, it skips auto-injection entirely.

Check cache usage

Cache hit information appears in the response usage object:
{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200
    }
  }
}
The cached_tokens field shows how many prompt tokens were served from cache.

When to use

Caching works best with:
  • Multi-turn conversations — the shared conversation prefix grows with each turn
  • Long system prompts — reused across many requests
  • Few-shot examples — static example blocks cached across calls
Caching provides minimal benefit for:
  • Unique prompts — no shared prefix to cache
  • Very short prompts — below provider token thresholds
  • Single-turn requests — no subsequent requests to benefit from the cache