How It Works

Memory Query Flow

This page traces a single POST /memory/query from the inbound request to the ranked response. The query path is read-only on the database, but it touches the embedding service, the vector index, the database (for hydration), and the cache — in that order.

The six stages of one query

1Validate & resolve scope. Auth, project, tenant resolved exactly as on the write path.
2Embed the query string. The same embedding service used at write time generates the query vector. The cache is checked first; on miss, the service is called.
3Vector recall. The vector index searches the per-user namespace and returns the top-K candidates ranked by cosine similarity, filtered by retention_status="active", environment, project_id, and any caller-supplied filters.
4Hydrate. One database query — SELECT … WHERE vector_id IN (…) — pulls the full memories rows for the candidates. Decryption happens here.
5Rank. A weighted formula combines semantic similarity, recency, importance, usage, decay, consistency, quality, and confidence into a single score.
6Bookkeep. Recall signals (usage_count++, last_accessed_at=now()) are written for the returned memories so future ranking and tier transitions can use them.

What filters are honoured at recall, not after

Filter	Where it applies	Allowed values
`tier`	Vector index metadata filter — never returns the wrong-tier rows.	`"hot"` \| `"warm"` \| `"cold"` (or omitted = all)
`tags`	Vector index metadata filter — exact match on any.	Array of strings.
`session_id`	Vector index metadata filter.	String.
`metadata_filter`	Post-hydration filter on the database row.	Free-form JSON match.
`environment`	Always implicit — taken from the credential, cannot be set in the body.	`development` \| `staging` \| `production` (set by the credential)

The ranking formula, in plain English

The final score is a weighted sum of seven signals, each normalised to [0, 1]:

Signal	Default weight	What it measures
Semantic similarity	~0.50	Cosine similarity between query vector and stored vector.
Recency	~0.20	Decays smoothly with age. Newer wins ties.
Importance	~0.15	`importance` from creation, updated by the intelligence pipeline.
Usage	~0.05	`usage_count` normalised by tenant percentile.
Quality	~0.05	`quality_score` from the semantic engine.
Consistency	~0.025	`consistency_score` — penalised when contradicted.
Decay	~0.025	`decay_score` — drops on long no-recall stretches.

Why the vector namespace is per-user

Vector recall scopes to the namespace memorysync-user-{user_id}. This is not an optimisation — it is a safety boundary. Even if the metadata filter is misconfigured, a query physically cannot return another user's vectors because they live in a different namespace. Tenant isolation works the same way at the database layer; the vector layer reinforces it.

What the response includes

JSON

{
  "memories": [
    {
      "id": 18421,
      "content": "User Alice prefers concise replies and uses dark mode.",
      "tier": "hot",
      "score": 0.83,
      "score_breakdown": {
        "semantic": 0.91,
        "recency":  0.62,
        "importance": 0.70,
        "usage": 0.40
      },
      "embedding_version": "v3",
      "tags": ["preference", "ui"],
      "created_at": "2026-05-04T10:14:32Z"
    }
  ],
  "explanation": "Top match is a recent high-importance preference fact.",
  "request_id": "req_3f9c1ab2"
}

What makes a query return fewer rows than you expect

Tier filter. By default, queries scan all tiers. If you set tier="hot", cold and warm rows are excluded at the index — not just deprioritised.
Embedding-version mismatch. If you upserted vectors under one model and queried under another, only same-version vectors are eligible.
Importance gate. The default score floor drops anything below ~0.5 importance unless explicitly requested.
Project scope. Queries are project-scoped. A memory added under one project is invisible to queries under another, even with the same user.

What bookkeeping runs after the response

usage_count++ on each returned memory.
last_accessed_at = now() on each returned memory.
One memory_events row per recall, used by the re-evaluation flow to spot hot memories.
Cache update — top-K result cached for identical follow-up queries (5-minute TTL).

← Previous

Memory Ingestion Flow

Embedding Pipeline