Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.auriko.ai/llms.txt

Use this file to discover all available pages before exploring further.

Prompt caching (reusing previously processed prompt tokens instead of reprocessing them) reduces cost and latency on repeated requests. By default, Auriko handles cache optimization automatically. For fine-grained control, you can specify cache control manually.

Prerequisites

  • An Auriko API key
  • Python 3.10+ with the OpenAI SDK (pip install openai) or the auriko SDK (pip install auriko)
    • OR Node.js 18+ with the OpenAI SDK (npm install openai) or @auriko/sdk (npm install @auriko/sdk)

How it works

Auriko optimizes caching for each provider when your request includes reusable prompt content. On subsequent requests sharing the same prompt prefix, the provider serves cached tokens at reduced cost and lower latency. Auriko accounts for each provider’s caching economics (token thresholds, discount depths, and read/write prices) when choosing where to route. Over time, the system learns your usage patterns to improve estimation accuracy. Create separate workspaces for different use cases to get better predictions. Auriko is a zero data retention proxy. Your prompts, responses, and content are never read, logged, or stored. Pattern calibration uses usage metadata only. Read the Privacy Policy for details.

Send a cached request

Send a request with a reusable system prompt:
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["AURIKO_API_KEY"],
    base_url="https://api.auriko.ai/v1",
)

response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant..."},
        {"role": "user", "content": "Explain async/await in Python."},
    ],
)

usage = response.usage
if hasattr(usage, "prompt_tokens_details") and usage.prompt_tokens_details:
    cached = getattr(usage.prompt_tokens_details, "cached_tokens", 0)
    print(f"Cached tokens: {cached}")
print(f"Total prompt tokens: {usage.prompt_tokens}")

Override caching per provider

Auriko handles caching automatically for supported providers. For explicit control, each provider accepts specific fields. When you supply one, Auriko skips automatic injection and uses your value.
ProviderFieldEffect
Anthropiccache_control: {"type": "ephemeral"} on content blocksMarks specific content for caching
OpenAIprompt_cache_key (string)Improves cache hit rate for repeated conversations
OpenAIprompt_cache_retention: "24h"Extends cache lifetime to 24 hours
Fireworksuser (string)Improves cache reuse across conversation turns
When you provide any of these fields, Auriko skips automatic cache injection for that provider.

Anthropic — cache_control

Add cache_control to content blocks to mark specific content for caching:
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["AURIKO_API_KEY"],
    base_url="https://api.auriko.ai/v1",
)

response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": [
            {"type": "text", "text": "You are a helpful coding assistant with deep knowledge of Python, JavaScript, and Rust. You follow best practices and explain your reasoning step by step.",
             "cache_control": {"type": "ephemeral"}},
        ]},
        {"role": "user", "content": "Explain async/await in Python."},
    ],
)
The only supported type is "ephemeral". This follows the provider’s default retention behavior. cache_control applies to Anthropic models only. For other providers, automatic optimization handles caching.

OpenAI — prompt_cache_key and prompt_cache_retention

prompt_cache_key improves cache hit rate for repeated conversations. prompt_cache_retention: "24h" extends the cache lifetime to 24 hours. prompt_cache_retention is supported on gpt-4.1+ and gpt-5+ models only. It isn’t compatible with ZDR data policy. Omit it if your workspace uses ZDR.
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["AURIKO_API_KEY"],
    base_url="https://api.auriko.ai/v1",
)

response = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain async/await in Python."},
    ],
    extra_body={
        "prompt_cache_key": "my-conversation-123",
        "prompt_cache_retention": "24h",
    },
)

Fireworks — user

On Fireworks, requests with the same user value benefit from improved cache reuse across conversation turns.
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["AURIKO_API_KEY"],
    base_url="https://api.auriko.ai/v1",
)

response = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain async/await in Python."},
    ],
    user="my-conversation-123",
)

Check cache usage

For /v1/chat/completions responses, cache hit information appears in usage.prompt_tokens_details:
{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200,
      "cache_creation_tokens": 300
    }
  }
}
cached_tokens shows how many prompt tokens were served from cache. Auriko normalizes this field across all providers in the OpenAI-format response. cache_creation_tokens shows how many tokens were written to prompt cache on this request. This field is populated for Anthropic models. For /v1/messages responses, cache tokens appear as top-level usage fields:
{
  "usage": {
    "input_tokens": 300,
    "output_tokens": 200,
    "cache_read_input_tokens": 1200,
    "cache_creation_input_tokens": 0
  }
}
input_tokens represents only the non-cached portion. Total input tokens = input_tokens + cache_read_input_tokens + cache_creation_input_tokens.

Check cache savings

Cache savings appear in routing_metadata.cost when savings are greater than zero:
{
  "routing_metadata": {
    "cost": {
      "usd": 0.0042,
      "cache_savings_percent": 47,
      "cache_savings_usd": 0.0037
    }
  }
}
cache_savings_percent is an integer (0-100) showing the percentage saved compared to uncached cost. cache_savings_usd shows the dollar amount saved.

Check cache usage in streams

Cache metrics appear in the final streaming chunk alongside usage and routing_metadata. See Streaming for details on consuming trailing chunks.

Improve cache hits

You can improve cache hit rates by structuring your requests for reuse.
  • Long, stable system prompts: Place reusable instructions in the system message. The prompt prefix is what providers cache.
  • Few-shot examples: Static example blocks are reused across requests.
  • Static before dynamic: Put content that doesn’t change before content that does.
  • Multi-turn conversations: Shared prompt prefixes get better cache reuse across requests.
  • Steady request cadence: Providers expire cached tokens after inactivity. Steady flow keeps entries warm.
See Cost optimization for more strategies.

Look up cache pricing

The model directory exposes cache pricing for every supported provider. Query it to see cache_read_price, cache_write_price, and supports_prompt_caching per model:
import os
import httpx

response = httpx.get(
    "https://api.auriko.ai/v1/directory/models",
    headers={"Authorization": f"Bearer {os.environ['AURIKO_API_KEY']}"},
)

for model_id, model in response.json()["models"].items():
    for provider in model.get("providers", []):
        for tier in provider.get("tiers", []):
            if tier.get("cache_read_price"):
                print(f"{model_id} ({provider['provider']}): "
                      f"read=${tier['cache_read_price']}/M, "
                      f"write=${tier.get('cache_write_price', 'N/A')}/M")
Providers offer discounted rates for cache reads compared to standard input pricing. Some charge a surcharge for cache writes. Check the directory for current prices.

Troubleshoot

SymptomFix
cached_tokens always 0 (first request)The first request creates the cache. Send a follow-up with the same prefix.
cached_tokens always 0 (unsupported model)Check supports_prompt_caching in the model directory.
cached_tokens always 0 (unique prompts)Caching requires a shared prefix. Add a reusable system prompt.
cached_tokens always 0 (short prompt)Your prompt may be below the provider’s minimum token threshold. Add more reusable content to the system message.
Lower-than-expected savingsMove static content before dynamic content in messages.
Lower-than-expected savings (gaps between requests)Providers expire cached tokens after inactivity. Maintain steady request flow.
cache_savings_percent not in responseThe field appears only when savings are greater than zero.

Resources