Cost Gating in Production
llm-ports ships USD-denominated cost gating as a first-class feature. The thesis: real users care about dollars per day, not requests per hour. Request-count gating is a weak proxy — a 16k-token completion costs ~30× a 500-token one, but request count treats them the same.
This guide covers: how cost is computed per call, how to set caps, how the in-memory backend works, when to swap in a Redis backend, and how to override pricing when providers raise prices between releases.
How cost is computed
Each adapter ships a pricing table keyed by model id. After every call, the adapter:
- Reads the token usage from the provider response (
usage.input_tokens,usage.output_tokens,usage.cache_read_input_tokens, etc.) - Looks up the model's pricing entry:
inputPer1M,outputPer1M, optionalcacheReadPer1M,cacheWritePer1M - Computes USD cost via
computeChatCost(usage, pricing) - Returns it on the result object as
result.cost.totalUSD - Records it against the provider's running totals via the registered
CostBackend
If the model id has no pricing entry, the call fails fast with an explicit error — no silent zero-costing. That includes models you might add to env config but forget to price; the fix is one of:
- Add the entry to the adapter's
pricing.tsand submit a PR - Override locally via
pricingOverrides(next section)
Set caps in env
# Cost cap: $50/day on this provider
LLM_PROVIDER_PREMIUM=anthropic|claude-sonnet-4-6-20250514|cost:50/day
# Multiple windows: per-hour AND per-day. First to trip blocks.
LLM_PROVIDER_GPT_TIGHT=openai|gpt-5|cost:1/hour,cost:20/day
# Combine with request gating
LLM_PROVIDER_BALANCED=openai|gpt-5-mini|req:500/hour,cost:5/day
# Local Ollama: no cost, no limit
LLM_PROVIDER_LOCAL=ollama|llama3.3|unlimitedAvailable windows: cost:N/hour, cost:N/day, cost:N/month. Multiple may apply at once.
Override pricing for a model
Providers raise prices. Releases lag. Use pricingOverrides for the daily-use escape hatch:
import { createRegistryFromEnv } from "@llm-ports/core";
import { createAnthropicAdapter } from "@llm-ports/adapter-anthropic";
export const registry = createRegistryFromEnv({
adapters: {
anthropic: createAnthropicAdapter({ apiKey: process.env.ANTHROPIC_API_KEY! }),
},
pricingOverrides: {
// Override the bundled rate; effective immediately, no release needed
"claude-sonnet-4-6-20250514": { inputPer1M: 3.5, outputPer1M: 17.5 },
// Add a model not in the bundled table at all
"claude-experimental-x": { inputPer1M: 5, outputPer1M: 25 },
},
});When a release ships updated bundled pricing, you can drop the override.
Inspect cost per call
Every call returns full cost breakdown:
const result = await llm.generateText({ taskType: "triage", prompt });
result.cost.inputUSD; // input tokens cost (incl. cache reads/writes)
result.cost.outputUSD; // output tokens cost
result.cost.totalUSD; // sum
result.cost.cacheDiscountUSD; // savings vs paying full input rate (when cache used)
result.usage.inputTokens; // raw token counts
result.usage.outputTokens;
result.usage.cacheReadTokens; // present if model used prompt cacheWire this to your analytics / observability pipeline via the observability hooks on capabilities or directly in your activity wrappers.
Swap in a Redis backend for multi-process
The default InMemoryBudget and InMemoryCost backends work for single-process apps. For multi-process deployments (Temporal workers, multiple API server replicas, etc.), the counters need to be shared. Implement the CostBackend interface against Redis:
import type { CostBackend, CostCheckResult, CostLimit } from "@llm-ports/core";
import { createRegistryFromEnv } from "@llm-ports/core";
import { createClient } from "redis";
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
class RedisCost implements CostBackend {
async recordCost(alias: string, usd: number): Promise<void> {
const now = Date.now();
const key = `cost:${alias}`;
// Use a sorted set with timestamp scores; trim entries older than 30d
await redis.zAdd(key, { score: now, value: `${now}:${usd}` });
await redis.zRemRangeByScore(key, 0, now - 30 * 24 * 60 * 60 * 1000);
}
async check(alias: string, limit: CostLimit): Promise<CostCheckResult> {
if (limit.kind === "unlimited") {
return { allowed: true, current: 0, limit: Infinity };
}
// Sum recent entries by window; trip if any cap exceeded
// (full implementation in @llm-ports/backend-redis when it ships in v0.2)
// ...
}
}
export const registry = createRegistryFromEnv({
adapters: { ... },
cost: new RedisCost(),
});A reference Redis backend will ship as @llm-ports/backend-redis in v0.2.
Sunset / monthly budgets
Set a monthly cap to enforce a hard ceiling on spend:
LLM_PROVIDER_BUDGET=anthropic|claude-haiku-4-5|cost:0.5/hour,cost:5/day,cost:100/monthThe monthly window prevents drift across days. Combined with hourly and daily windows, you get layered defense: hourly catches a runaway script, daily catches a buggy automation, monthly catches everything else.
What this DOESN'T do
- Doesn't enforce per-tenant or per-user budgets. That's an application concern. The registry alias is application-level. For per-tenant gating, you'd build it as a wrapper that selects a per-tenant alias.
- Doesn't refund failed calls. If a call fails mid-stream after some tokens were charged, the cost gate doesn't know. In practice provider failures usually charge $0 anyway.
- Doesn't predict cost ahead of time. Budget check happens before the call; cost recording happens after. A single call can trip the cap mid-day; subsequent calls are blocked but the tripping call itself succeeds.
Reading next
- Multi-provider routing → — fallback chain mechanics
- Cost vs request gating → — when to use which