How It Works

Failure Modes & Retries

Every external call in MemorySync has a documented failure mode and a documented retry policy. This page is the contract: when X breaks, the system does Y. Read it before you rely on a behaviour during an incident.

Failures, classified by what the system does about them

Class	Examples	Response
Hard dependency down	Primary database unavailable.	Return `503`. No degradation — the database is the only hard dependency. Health checks fail; load balancer drains the instance.
Soft dependency down	Cache unreachable, vector index slow, embedding service rate-limited.	Degrade gracefully on the read path; queue and replay on the write path. Return `200` where possible with a degraded-mode flag.
Caller error	Invalid header, missing scope, malformed body.	Return `4xx` with an explicit error code. No retries — the caller has to fix it.
Transient task failure	Webhook target returns `5xx`, SIEM endpoint times out.	Retry with exponential backoff up to a per-task ceiling, then dead-letter.

The circuit breaker, used everywhere external

Outbound calls (SIEM forwarding, webhook delivery, integration sync, embedding service) wrap in a circuit breaker with three states.

State	Behaviour	Transitions out
CLOSED	Calls go through normally. Failures are counted.	Failures exceed the threshold (default 5 in a row) → OPEN.
OPEN	Calls short-circuit immediately with a no-op error — the target is not contacted.	Cooldown window elapses (default 60 s) → HALF_OPEN.
HALF_OPEN	One probe call is allowed through.	Probe succeeds → CLOSED. Probe fails → OPEN, cooldown restarts.

Retry schedules by task type

Task	Schedule	Max retries
SIEM forwarding	1 s → 2 s → 5 s → 10 s → 30 s + jitter (factor 0.25).	5
Webhook delivery	Exponential backoff with jitter, configured per endpoint.	Per endpoint config; default ceiling.
Integration sync	60 s → 300 s → 900 s.	3
Intelligence re-evaluation	Exponential backoff, capped at 300 s.	2
Embedding (write path, blocking)	Backoff inside the request — at most 2 retries before the request errors.	2

What “jitter” means in those schedules

Two clients failing at the same instant would otherwise retry at the exact same instants — piling on the recovering target and causing it to fail again. Jitter scrambles each retry by a small random offset so the herd spreads out. + jitter (factor 0.25) means each delay is randomised by ±25% of itself: a 5 s slot lands somewhere in [3.75 s, 6.25 s]. The base schedule is the average; jitter is the controlled randomness around it.

Per-endpoint backpressure

Outbound integrations carry per-target concurrency caps so one slow endpoint cannot starve the worker pool. SIEM forwarding allows at most 3 concurrent in-flight requests per configured endpoint; webhook delivery allows 10 globally with a per-endpoint semaphore. When a target is slow, the rest of the queue keeps draining for other tenants.

When a task is moved to the dead-letter queue

Retry ceiling reached — the task has burned every retry slot and the last attempt still failed.
Permanent error received — a 4xx from the target that retrying will not fix (e.g. 410 Gone, 403 on a deactivated webhook).
Consecutive-failure threshold tripped — for SIEM forwarders, after 5 consecutive failures the entire forwarder is paused and remaining batches go to dead-letter until an operator re-enables it.

Soft degradation on the write path

Some failures are tolerable for the caller. POST /memory/add uses a soft-degradation path: when the embedding service is rate-limited and the embedding cache misses, the row still commits to the database with vector_id=null, and a background task retries the embedding. The API returns 200 with status="pending_embedding" instead of failing the request.

What the caller sees when things break

JSON

// hard dependency down
HTTP/1.1 503 Service Unavailable
Retry-After: 30
{ "error": { "code": "DATABASE_UNAVAILABLE", "request_id": "..." } }

// soft dependency down (write path)
HTTP/1.1 200 OK
{ "id": 18421, "status": "pending_embedding", "embedding_version": null }

// caller error
HTTP/1.1 400 Bad Request
{ "error": { "code": "PROJECT_REQUIRED", "request_id": "..." } }

← Previous

Caching Layers

Memory Ingestion Flow