Retrieval
Performance Optimization
Knobs that affect recall latency, the caches that sit on the hot path, and the SLA tiers a query can run in. Numbers are taken from the recall and cache services in code.
The two caches on the hot path
| Cache | Key | TTL | Hit benefit |
|---|---|---|---|
| Query embedding | emb:{sha256_digest[:24]} | 1 hour | Cuts the query path by 200–400 ms when the same string repeats. |
| Recall result | query:{hash}:{user_id}:{project_id} | 5 minutes | Whole-response cache. Used only when the caller did not supply custom weights. |
Top-k and its cost
- Default
k = 5is the empirically chosen sweet spot. - Past
k = 20latency rises roughly linearly withkbecause reranking is per-candidate. - If you only need to display three results, ask for three; do not request 50 and slice client-side.
SLA tiers and when each runs
| Tier | Target | Searches | When auto-routed |
|---|---|---|---|
low | < 100 ms | hot only | Short, simple queries. |
medium | < 500 ms | hot + warm | Default for most traffic. |
high | < 2 s | all tiers | Long, analytical, multi-hop traversal queries. |
Callers can force a tier with computation_tier. If the SLA is missed, sla_met in the response is false and degraded may be set, but the request still returns whatever it managed to gather.
What runs in parallel and what does not
- Within a single query: cluster centroid load and row-batch fetch overlap.
- Within a single query: depth-N graph neighbours fan out in parallel within a depth level.
- Across queries: every request runs on its own coroutine — there is no global lock.
- Index calls share a connection pool of 100; embedding calls share a separate pool with its own concurrency cap.
Embedding fallbacks and graceful degradation
- Primary embedding service unavailable → exponential-backoff retries.
- Retries exhausted → offline transformer if cached, otherwise hash-based deterministic vector.
- Recall still works — quality drops, the request never errors.
- If you see
degraded=true+ lowsemantic_scorevalues, the embedding fallback fired.
Timeouts and circuit breakers
- Query embedding timeout is 5 seconds → HTTP 504 if exceeded.
- Index calls retry on failure (50 → 100 → 200 ms) and surface HTTP 503 after three consecutive failures.
- When the index breaker is open, requests fail fast for 30 seconds before the next probe.
What not to do if you want fast recall
- Do not request
k = 50for a UI that shows three. - Do not pass custom
weightson every call — that disables the result cache. - Do not ask for
traversal_depth = 3by default; depth 1 is plenty for most product surfaces. - Do not put queries longer than a paragraph through
POST /memory/query— embed and pre-summarise on your side first.