MemorySyncMemorySync
Retrieval

Performance Optimization

Knobs that affect recall latency, the caches that sit on the hot path, and the SLA tiers a query can run in. Numbers are taken from the recall and cache services in code.

The two caches on the hot path

CacheKeyTTLHit benefit
Query embeddingemb:{sha256_digest[:24]}1 hourCuts the query path by 200–400 ms when the same string repeats.
Recall resultquery:{hash}:{user_id}:{project_id}5 minutesWhole-response cache. Used only when the caller did not supply custom weights.

Top-k and its cost

  • Default k = 5 is the empirically chosen sweet spot.
  • Past k = 20 latency rises roughly linearly with k because reranking is per-candidate.
  • If you only need to display three results, ask for three; do not request 50 and slice client-side.

SLA tiers and when each runs

TierTargetSearchesWhen auto-routed
low< 100 mshot onlyShort, simple queries.
medium< 500 mshot + warmDefault for most traffic.
high< 2 sall tiersLong, analytical, multi-hop traversal queries.

Callers can force a tier with computation_tier. If the SLA is missed, sla_met in the response is false and degraded may be set, but the request still returns whatever it managed to gather.

What runs in parallel and what does not

  • Within a single query: cluster centroid load and row-batch fetch overlap.
  • Within a single query: depth-N graph neighbours fan out in parallel within a depth level.
  • Across queries: every request runs on its own coroutine — there is no global lock.
  • Index calls share a connection pool of 100; embedding calls share a separate pool with its own concurrency cap.

Embedding fallbacks and graceful degradation

  • Primary embedding service unavailable → exponential-backoff retries.
  • Retries exhausted → offline transformer if cached, otherwise hash-based deterministic vector.
  • Recall still works — quality drops, the request never errors.
  • If you see degraded=true + low semantic_score values, the embedding fallback fired.

Timeouts and circuit breakers

  • Query embedding timeout is 5 seconds → HTTP 504 if exceeded.
  • Index calls retry on failure (50 → 100 → 200 ms) and surface HTTP 503 after three consecutive failures.
  • When the index breaker is open, requests fail fast for 30 seconds before the next probe.

What not to do if you want fast recall

  • Do not request k = 50 for a UI that shows three.
  • Do not pass custom weights on every call — that disables the result cache.
  • Do not ask for traversal_depth = 3 by default; depth 1 is plenty for most product surfaces.
  • Do not put queries longer than a paragraph through POST /memory/query — embed and pre-summarise on your side first.