Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.auriko.ai/llms.txt

Use this file to discover all available pages before exploring further.

Auriko’s proprietary cost model computes the expected cost (the predicted cost accounting for caching, pricing tiers, and your usage patterns) of each request at every available provider and routes to the cheapest one.

Prerequisites

  • An Auriko API key
  • Python 3.10+ with the OpenAI SDK (pip install openai) or the auriko SDK (pip install auriko)
    • OR Node.js 18+ with the OpenAI SDK (npm install openai) or @auriko/sdk (npm install @auriko/sdk)

Enable cost optimization

To route by cost, set optimize to "cost":
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["AURIKO_API_KEY"],
    base_url="https://api.auriko.ai/v1"
)

response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"gateway": {"routing": {"optimize": "cost"}}}
)

Understand the cost model

Pricing page rates show what a cached token costs if it gets cached. They don’t tell you which tokens get cached or under what conditions. Two providers quoting identical rates can produce different bills on the same workload. Auriko maintains a proprietary data pipeline and cost model that tracks provider-side caching mechanics, estimates your usage patterns, and predicts the expected cost of each request at every available provider.

Provider tracking

Auriko’s data pipeline tracks each provider’s caching mechanics: discount depths, minimum token thresholds, block granularity, write costs, expiration windows, and pricing tiers that shift with context length. This data updates as providers change infrastructure.

Usage estimation

Auriko estimates request-level variables from your usage patterns: prefix length, reuse frequency, request timing, conversation depth, and output volume. This predicts how each provider’s caching performs for your specific traffic. Auriko is a zero data retention proxy. Pattern estimation uses usage metadata only. Read the Privacy Policy for details.

Per-request cost prediction

For each request, the cost model combines provider data and usage estimates to compute the expected cost at every available provider. It routes to the cheapest one. This is a per-request decision, not a static ranking. A provider with higher list prices can be cheaper over a multi-turn conversation if its caching mechanics produce more cache hits for your workload. Cached tokens cost less than uncached tokens. Cache reads cost less than regular input, but writing to cache can cost more. The cost model accounts for these differences.

Set latency constraints

To optimize for cost while enforcing a latency ceiling, add max_ttft_ms:
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"gateway": {"routing": {
        "optimize": "cost",
        "max_ttft_ms": 1000,
    }}}
)

Maximize savings with cost-focus

cost-focus aggressively minimizes cost with minimal weight on other factors:
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Summarize this document..."}],
    extra_body={"gateway": {"routing": {"optimize": "cost-focus"}}}
)
You can also use the suffix shortcut:
response = client.chat.completions.create(
    model="gpt-5.4:cost-focus",
    messages=[{"role": "user", "content": "Summarize this document..."}]
)
StrategyBehavior
costFavors cheaper providers while considering performance and latency
cost-focusRoutes to the cheapest provider with minimal weight on other factors
Both strategies account for cache economics. cost-focus weights cost more aggressively. For the general base vs. focus explanation, see Base vs. focus.

Set cost ceilings

To exclude providers above a price threshold, set max_cost_per_1m:
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"gateway": {"routing": {
        "optimize": "cost",
        "max_cost_per_1m": 10.00,
    }}}
)
Auriko calculates cost as the average of input and output price per 1M tokens. Providers exceeding this ceiling are excluded from routing. For fine-grained quality and cost constraints, see Advanced routing.

Restrict key source

If you have negotiated provider rates through your own API keys, force requests to use only BYOK keys for cost control:
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"gateway": {"routing": {
        "optimize": "cost",
        "only_byok": True,
    }}}
)
See Advanced routing for the full constraint API and Bring Your Own Key for BYOK setup.

Track cost and savings

Every response includes the billable cost in cost.usd. The usage breakdown shows prompt_tokens, cached_tokens, and completion_tokens.
cost = response.routing_metadata.cost
print(f"Total cost: ${cost.usd:.6f}")

# Check cache usage
usage = response.usage
if hasattr(usage, "prompt_tokens_details") and usage.prompt_tokens_details:
    cached = getattr(usage.prompt_tokens_details, "cached_tokens", 0)
    print(f"Cached tokens: {cached}")
print(f"Total prompt tokens: {usage.prompt_tokens}")
Auriko normalizes cache reporting across all providers. Regardless of which provider served your request, you read cached_tokens from usage.prompt_tokens_details. When cost-optimized routing triggers a failover, Auriko falls back in cost order to the next cheapest eligible provider. The cost and savings data in each response reflect the output of Auriko’s cost model, not list-price arithmetic.

Optimize your workload

Structure your workload to maximize cost savings.
  • Long, stable system prompts: Maximize cache reuse across requests.
  • Consistent conversation IDs: These help providers maintain cache affinity.
  • Steady request cadence: Bursty traffic can defeat cache expiration windows.
  • Prompt length: Prompts below provider minimum token thresholds get zero cache discount.
  • Strategy choice: cost-focus aggressively minimizes cost. cost adds weight to latency and performance.
  • Monitor: Track cost and savings in the dashboard.

Apply to use cases

Background processing

Batch processing with cost-focus routing:
for doc in documents:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Summarize: {doc}"}],
        extra_body={"gateway": {"routing": {"optimize": "cost-focus"}}}
    )
    save_summary(doc.id, response.choices[0].message.content)

With latency budget

Cost routing with a latency constraint:
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=conversation,
    extra_body={"gateway": {"routing": {
        "optimize": "cost",
        "max_ttft_ms": 1000,
    }}}
)

Monitor costs

Track your cost savings in the Auriko dashboard:
  • Total spend by day/week/month
  • Cost per model
  • Cost per provider
  • Savings vs. single-provider baseline

View Dashboard

Monitor your usage and costs in real-time