Semantic caching

Features

Semantic caching

Repeat and near-duplicate prompts are common in production. Neural Router can serve them from a per-workspace cache — cutting cost and latency — and attribute the savings.

How it works

When caching is enabled, Neural Router fingerprints each request and checks for an exact or semantically similar prior response within the TTL. A hit returns instantly without calling a provider; a miss routes normally and stores the result. You see hit rate, cached entries, and dollars saved on the Cache page.

Enabling it

Turn caching on per workspace under Cache, and choose a TTL (15 minutes to 24 hours). Caching applies to eligible requests automatically.

Per-request control

Override the workspace default per request with the cache field — for example to bypass the cache for a request that must be fresh:

request
{
  "model": "auto",
  "messages": [ ... ],
  "cache": { "mode": "bypass" }   // or "read_write" (default) | "read_only"
}

Savings attribution

Every cache hit is credited against what the equivalent provider call would have cost, so the savings shown reflect real avoided spend. Pair caching with the cost objective for the largest reduction.