Retrieval

Performance Optimization

Knobs that affect recall latency, the caches that sit on the hot path, and the SLA tiers a query can run in. Numbers are taken from the recall and cache services in code.

The two caches on the hot path

Cache	Key	TTL	Hit benefit
Query embedding	`emb:{sha256_digest[:24]}`	1 hour	Cuts the query path by 200–400 ms when the same string repeats.
Recall result	`query:{hash}:{user_id}:{project_id}`	5 minutes	Whole-response cache. Used only when the caller did not supply custom `weights`.

Top-k and its cost

Default k = 5 is the empirically chosen sweet spot.
Past k = 20 latency rises roughly linearly with k because reranking is per-candidate.
If you only need to display three results, ask for three; do not request 50 and slice client-side.

SLA tiers and when each runs

Tier	Target	Searches	When auto-routed
`low`	< 100 ms	hot only	Short, simple queries.
`medium`	< 500 ms	hot + warm	Default for most traffic.
`high`	< 2 s	all tiers	Long, analytical, multi-hop traversal queries.

Callers can force a tier with computation_tier. If the SLA is missed, sla_met in the response is false and degraded may be set, but the request still returns whatever it managed to gather.

What runs in parallel and what does not

Within a single query: cluster centroid load and row-batch fetch overlap.
Within a single query: depth-N graph neighbours fan out in parallel within a depth level.
Across queries: every request runs on its own coroutine — there is no global lock.
Index calls share a connection pool of 100; embedding calls share a separate pool with its own concurrency cap.

Embedding fallbacks and graceful degradation

Primary embedding service unavailable → exponential-backoff retries.
Retries exhausted → offline transformer if cached, otherwise hash-based deterministic vector.
Recall still works — quality drops, the request never errors.
If you see degraded=true + low semantic_score values, the embedding fallback fired.

Timeouts and circuit breakers

Query embedding timeout is 5 seconds → HTTP 504 if exceeded.
Index calls retry on failure (50 → 100 → 200 ms) and surface HTTP 503 after three consecutive failures.
When the index breaker is open, requests fail fast for 30 seconds before the next probe.

What not to do if you want fast recall

Do not request k = 50 for a UI that shows three.
Do not pass custom weights on every call — that disables the result cache.
Do not ask for traversal_depth = 3 by default; depth 1 is plenty for most product surfaces.
Do not put queries longer than a paragraph through POST /memory/query — embed and pre-summarise on your side first.

← Previous

Query Execution Flow

Edge Cases