How It Works

Embedding Pipeline

The embedding pipeline turns text into the fixed-length vector that the recall engine searches against. This page describes how the input is prepared, when the embedding service is called, how cost is tracked, and what the version stamp on every memory means.

What text actually gets embedded

Not the raw input. Before the embedding call, the input passes through a preprocessing step that strips leading/trailing whitespace, normalises internal whitespace, and truncates to a maximum input length the service accepts. The truncation point is logged on the memory so audits can confirm exactly what was embedded.

The call path

1Caller calls embedding_manager.embed_text(text).
2Manager preprocesses the text and computes a deterministic cache key (sha256 of the preprocessed string + the embedding version).
3Cache lookup. Hit: vector returned in microseconds. Miss: continue.
4Embedding service called with the preprocessed text. Token count and cost in cents are recorded.
5Vector returned. Cache populated for next time.
6Caller writes the vector and the embedding version onto the memory row.

Cost and token tracking, exactly as the code records it

Every call records a token count.
Every call records a cost in cents (integer cents, never floats).
Both numbers roll up to the tenant's daily and monthly counters; the counters reset at midnight UTC and on month boundary respectively.

The embedding_version stamp on every memory

Every memory row carries an embedding_version field populated when the vector was generated. This serves three purposes:

Recall is restricted to vectors with the same version as the query — mixing versions silently degrades similarity scores, so the system refuses to mix them.
If you change the embedding configuration, existing rows keep their old version and remain searchable by old-version queries; new rows get the new version. There is no flag-day rebuild.
The version surfaces on every memory in the API response so external observers can confirm what produced a vector.

What happens when a call fails

Failure	Behaviour	Caller sees
Transient error (`5xx`, timeout)	Up to 2 retries with exponential backoff inside the request.	Eventual success → `200`; eventual failure → write-path soft degradation.
Rate limit (`429`)	Falls back to soft degradation immediately — the row commits without a vector and a background task retries.	`200` with `status="pending_embedding"`.
Permanent error (`4xx`)	No retry. Surfaces as a hard failure on the request.	`5xx` on the API call.

Why the embedding cache is not optional

Two callers writing the exact same string within minutes of each other is the common case (a customer with two identical chat messages, two integration syncs of the same source object, an SDK retry). Without the cache, both calls spend tokens. With the cache, the second call is a microsecond lookup. The cache is keyed by the hash of the preprocessed text, so any whitespace difference correctly produces a separate vector.

How to debug a recall that feels wrong

Check version match. If the query vector is v3 and the target memory is v2, the system will not return it. Re-run the query under the original version, or backfill.
Check the preprocessed text. The preprocessing log line shows what was actually embedded — if your input had non-printing characters or unusual whitespace, the embedded form may differ from what you expect.
Check the cost log. If the cost log shows zero tokens spent for a known-novel input, the cache was hit by accident — usually because the input was identical to an earlier write.

← Previous

Memory Query Flow

Summarization Pipeline