MemorySyncMemorySync
Debugging

Slow Queries

Memory queries pass through a multi-stage retrieval pipeline that turns a raw text query into a ranked, deduplicated, token-budgeted result set. Understanding what each stage does is the key to diagnosing why a query is slow and where the time is being spent.

The Retrieval Pipeline

Every query passes through these stages in order. Most queries are dominated by the first two (query embedding + vector search) — the rest are fast in-memory operations:

  1. Embed the query — convert your text query into a vector representation.
  2. Vector search — pull a candidate set from the vector index, scoped to the active user, environment, and project.
  3. Filter — drop deleted memories and those below the importance threshold.
  4. Conflict-resolve — for memories with the same fact identity, keep only the most recently updated.
  5. Rank — score remaining candidates by relevance, recency, and importance.
  6. Semantic dedup — collapse near-duplicate results.
  7. Trim to top_k — the platform enforces a maximum results-per-query cap.
  8. Token budget — trim the final list to fit within the token budget without splitting any individual memory.
  9. Record recall signals — mark which memories actually appeared in results so future queries can prefer the ones you actively use.

Why a Wider Candidate Pool Helps

The platform always pulls a wider candidate pool from the vector index than the number of results you ultimately receive. This headroom lets the filtering, conflict-resolution, and dedup stages remove poor matches and stale duplicates without leaving you with too few results.

Performance implication: The candidate-pool size is the same regardless of how many results you request, so requesting fewer results does not make the query faster — the savings are only in post-processing, which is negligible compared to the embedding and vector-search steps.

How Candidates Are Ranked

After filtering, remaining candidates are scored on three factors:

  • Semantic similarity — how close a memory’s vector is to your query’s vector.
  • Recency — more recently updated memories are preferred.
  • Importance — memories the platform considers high-value rank above low-value ones.

For most queries, semantic similarity dominates. For queries that look like preference questions (e.g., “what does the user prefer”), the importance signal is given more weight; for queries that ask about recent events, recency is given more weight. This adaptation is transparent and does not add measurable latency.

Filters That Reduce Results

Several filters run between the vector search and the final result set. Each one can eliminate candidates, potentially leaving you with fewer results than the top_k you requested:

Filter What It Removes
Importance floorLow-importance memories that were stored but not deemed valuable enough for retrieval
Deleted memoriesMemories that have been explicitly deleted, superseded, or expired by retention
Environment scopeMemories from a different environment (for example, development memories never appear in production queries)
Project scopeMemories from a different project
Conflict resolutionOlder versions of the same fact — only the most recently updated is kept
Semantic dedupNear-duplicate memories that would otherwise waste result slots

💡 Key insight: If your query returns fewer results than top_k, it’s not a performance problem — the filters correctly eliminated low-quality or duplicate candidates. An empty result set usually means nothing matched the active environment and project scope, not that the system is slow.

Token Budget

After ranking and dedup, the final result set is trimmed to fit within a token budget so it stays small enough to be useful as LLM context.

How it works: The platform walks through the ranked results in order and adds them to the response until adding the next memory would exceed the budget. A memory is never split mid-content — it is either included in full or excluded entirely.

Why this affects query speed: The budget itself does not add latency. The budget governs response payload size, which can affect network transfer time for clients with slow connections.

Tuning tip: If your use case needs more context, request a larger token budget. If your use case is latency-sensitive and you only need one or two key facts, request a smaller top_k with a smaller budget — the response stays compact and parses faster on your side.

Two Layers of Deduplication

Two separate deduplication passes run on the candidate set so the platform never wastes a result slot on near-duplicates:

Conflict resolution. When two memories represent the same underlying fact, only the most recently updated one is kept. Example: if a user’s preferred city was “London” and is later updated to “Berlin”, only “Berlin” survives.

Semantic dedup. Pairs of remaining candidates that mean the same thing in slightly different words are collapsed to whichever scored higher. Example: “User likes dark theme” and “Preferred UI theme: dark” will not both appear in your results.

What Makes Queries Slow

In practice, query latency is dominated by two external operations — everything else is fast in-memory work:

  • Query embedding — turns your text query into a vector. Typically the largest single contributor to wall-clock latency. Very long queries are a little slower than short ones.
  • Vector search — retrieves the candidate pool. Latency grows with the size of your tenant’s memory collection.
  • Filtering, ranking, and dedup — in-memory work over a small candidate set. Negligible.

Cold start: The very first query after a long idle period may be slower than subsequent queries while internal connections warm up. Sustained traffic eliminates this overhead.

✅ Quick check: Measure total round-trip time end-to-end (for example, with curl -w '%{time_total}\n'). If it is consistently slow, check GET /health — if every component is healthy, the bottleneck is on the network or in your tenant-specific data shape.