Production

Caching Strategies

MemorySync uses a shared cache layer to accelerate embedding lookups, dedupe repeated queries, and shave latency off the recall path. The cache is fully managed by the platform — you don’t provision, scale, or invalidate it. This page explains what the cache buys you, when it kicks in, and how the platform guarantees correctness even when the cache is bypassed.

Cache Philosophy

The cache layer is designed around three principles that shape every cached lookup:

Accelerator, not a dependency. The cache is purely an optimization. If it is unavailable for any reason, the recall and ingestion pipelines fall back to the primary stores transparently. Your requests never fail because of a cache problem.
Tenant-isolated. Cache namespaces are scoped per organization. One tenant’s cache contents are never visible to another, and a noisy tenant cannot push another tenant’s working set out of cache.
Correctness over hit-rate. When the underlying data changes (memory updated, deleted, or its retrieval configuration is changed), the affected cache entries are invalidated synchronously. The cache never returns stale content after a write has been acknowledged.

What Gets Cached

The platform decides what to cache automatically based on cost, reuse rate, and staleness tolerance. The high-impact categories are:

Category	Why it’s cached
Embedding vectors	Embeddings for memory content and query text are deterministic for the same input. Caching them avoids paying for repeated embedding computation on identical text.
Hot recall results	When the exact same query is repeated within the staleness window, the assembled result list can be served from cache instead of replaying the full pipeline.
Tenant configuration	Retrieval weights, gating thresholds, and feature flags rarely change. They are cached to avoid a configuration lookup on every request.
Rate-limit counters	Per-key sliding-window counters live in the cache so that every replica sees the same current usage instantly.

Cost impact

For workloads with repetitive query patterns, the embedding cache typically eliminates 60–80% of embedding compute on the recall path. This shows up as a flat reduction in recall latency and is automatically reflected in your platform usage.

Freshness & Staleness Windows

Different cached objects have different staleness tolerances. The platform chooses the right window for each category:

Long-lived (1 hour). Deterministic outputs such as embedding vectors and hash lookups. The same input always produces the same output, so a longer window is safe.
Medium-lived (5 minutes). Recall result lists and tenant configuration. Long enough to absorb burst traffic, short enough that a configuration change becomes effective within minutes without manual invalidation.
Short-lived (seconds). Rate-limit counters and ephemeral session state. These windows match the rate-limit interval so counters are always accurate.
Write-invalidated. Anything tied to a specific memory (its embedding, its retrieval slot in recall caches) is invalidated immediately when that memory is updated or deleted — the staleness window does not apply to writes.

Invalidation Guarantees

The platform uses explicit invalidation, not eventual consistency, for anything tied to a write. The contract you can rely on:

Update or delete. When you call POST /memory/update or POST /memory/delete, every cache entry that references that memory is invalidated before the API returns. A follow-up query never sees the old version.
Tenant configuration changes. When you change retrieval weights, gating thresholds, or any tenant-level setting through the dashboard, cached results that depended on the old configuration are invalidated immediately.
Tier transitions and compaction. Internal lifecycle events that change which memories are returned for a query (tier move, compaction merge, soft-delete) invalidate the affected recall caches as part of the same transaction that performs the change.
Time-based expiration. For data without a clear write to attach invalidation to (e.g. aggregate usage figures), the staleness window from the previous section guarantees eventual consistency within the documented bound.

Miss & Outage Behavior

Because the cache is an accelerator, a miss or a cache-tier outage is handled silently:

Cache miss. The pipeline computes the value from the primary source (embedding service, vector store, configuration store), uses it, and writes it back to the cache so the next request lands on a hit.
Cache-tier outage. If the cache tier itself is impaired, the platform skips it entirely and serves every request directly from primary stores. Requests are still correct — only the recall latency rises by the embedding/lookup cost. You may see the platform-wide degraded health flag during such events.
No bypass on writes. Even when the cache is down, write operations (add, update, delete) still complete fully and durably. The lack of a cache only affects how fast reads return, never whether they are correct.

Design principle

The cache layer is intentionally orthogonal to correctness. If you ever see a cache-related symptom, it is bounded to latency — your data is never affected.

What You Control

There is no cache configuration surface for customers. Sizing, eviction, replication, and warming are all platform-managed. The two things that do affect your hit rate are entirely under your control:

Query stability. Repeated queries with identical text and identical filters hit the cache. If your application normalises queries (lower-casing, trimming whitespace) before sending them, you get a meaningfully higher hit rate.
Memory churn rate. The more often you update or delete memories, the more cache invalidation occurs. If your application bulk-updates frequently, expect to spend less of your read traffic on cache hits during the busy window.

If you have an unusual workload where the default caching behavior is not a clear win (e.g. a project where every query is unique by design), contact support — the platform can adjust the policy for your tenant.

← Previous

Horizontal Scaling

Worker Tuning