Prompt Caching

Reduce costs and latency by reusing cached prompt prefixes across requests. Auriko automatically injects cache control directives for all supported providers — no user action needed.

Prerequisites

An Auriko API key
Python 3.10+ with auriko SDK installed (pip install auriko)
- OR Node.js 18+ with @auriko/sdk installed (npm install @auriko/sdk)

How it works

Auriko intercepts outgoing requests and injects provider-specific caching directives when conversations exceed provider-specific token thresholds. You send requests normally — caching happens transparently. When a subsequent request shares the same prompt prefix, the provider serves the cached portion at a reduced cost and lower latency.

See it in action

Send a normal request — caching is automatic:

import os
from auriko import Client

client = Client(
    api_key=os.environ["AURIKO_API_KEY"],
    base_url="https://api.auriko.ai/v1"
)

response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant..."},
        {"role": "user", "content": "Explain async/await in Python."}
    ]
)

# Check cache usage in the response
usage = response.usage
if hasattr(usage, "prompt_tokens_details") and usage.prompt_tokens_details:
    cached = getattr(usage.prompt_tokens_details, "cached_tokens", 0)
    print(f"Cached tokens: {cached}")
print(f"Total prompt tokens: {usage.prompt_tokens}")

Provider support

Auriko injects caching directives for four providers. Each uses a different mechanism:

Provider	User action	Auriko behavior
Anthropic	None — automatic	Injects `cache_control: {type: "ephemeral"}` when estimated tokens exceed a per-model threshold (1024–4096 tokens depending on model). Skips if user already added `cache_control` blocks.
OpenAI	None — automatic	Injects `prompt_cache_key` for server affinity on all requests with a conversation ID. Adds `prompt_cache_retention: "24h"` for supported models (gpt-5, gpt-5.1, gpt-5.2, gpt-4.1 families). Skips retention for zdr data policy.
Fireworks	None — automatic	Sets `user` field to conversation ID for same-replica routing and KV-cache reuse.
xAI	None — automatic	Injects `x-grok-conv-id` header (UUID4 derived from conversation ID) for session affinity.

Anthropic token thresholds

Caching is only activated when the estimated prompt token count exceeds the model-specific threshold:

Model family	Threshold
claude-sonnet-4-5, claude-sonnet-4, claude-opus-4, claude-opus-4-1, claude-3-7-sonnet	1024 tokens
claude-sonnet-4-6, claude-3-5-haiku, claude-3-haiku	2048 tokens
claude-haiku-4-5, claude-opus-4-5, claude-opus-4-6	4096 tokens

Requests below the threshold skip auto-injection because Anthropic charges for cache writes on small requests.

Manual cache control

If you need fine-grained control (for example, caching a specific system prompt block), add cache_control blocks manually. When Auriko detects any existing cache_control in the request, it skips auto-injection entirely.

Check cache usage

Cache hit information appears in the response usage object:

{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200
    }
  }
}

The cached_tokens field shows how many prompt tokens were served from cache.

When to use

Caching works best with:

Multi-turn conversations — the shared conversation prefix grows with each turn
Long system prompts — reused across many requests
Few-shot examples — static example blocks cached across calls

Caching provides minimal benefit for:

Unique prompts — no shared prefix to cache
Very short prompts — below provider token thresholds
Single-turn requests — no subsequent requests to benefit from the cache

Get Started

Guides

Frameworks

Prompt Caching

Prerequisites

How it works

See it in action

Provider support

Anthropic token thresholds

Manual cache control

Check cache usage

When to use

Get Started

Guides

Frameworks

​Prerequisites

​How it works

​See it in action

​Provider support

​Anthropic token thresholds

​Manual cache control

​Check cache usage

​When to use

Prerequisites

How it works

See it in action

Provider support

Anthropic token thresholds

Manual cache control

Check cache usage

When to use