MemorySyncMemorySync
How It Works

Failure Modes & Retries

Every external call in MemorySync has a documented failure mode and a documented retry policy. This page is the contract: when X breaks, the system does Y. Read it before you rely on a behaviour during an incident.

Failures, classified by what the system does about them

ClassExamplesResponse
Hard dependency downPrimary database unavailable.Return 503. No degradation — the database is the only hard dependency. Health checks fail; load balancer drains the instance.
Soft dependency downCache unreachable, vector index slow, embedding service rate-limited.Degrade gracefully on the read path; queue and replay on the write path. Return 200 where possible with a degraded-mode flag.
Caller errorInvalid header, missing scope, malformed body.Return 4xx with an explicit error code. No retries — the caller has to fix it.
Transient task failureWebhook target returns 5xx, SIEM endpoint times out.Retry with exponential backoff up to a per-task ceiling, then dead-letter.

The circuit breaker, used everywhere external

Outbound calls (SIEM forwarding, webhook delivery, integration sync, embedding service) wrap in a circuit breaker with three states.

StateBehaviourTransitions out
CLOSEDCalls go through normally. Failures are counted.Failures exceed the threshold (default 5 in a row) → OPEN.
OPENCalls short-circuit immediately with a no-op error — the target is not contacted.Cooldown window elapses (default 60 s) → HALF_OPEN.
HALF_OPENOne probe call is allowed through.Probe succeeds → CLOSED. Probe fails → OPEN, cooldown restarts.

Retry schedules by task type

TaskScheduleMax retries
SIEM forwarding1 s → 2 s → 5 s → 10 s → 30 s + jitter (factor 0.25).5
Webhook deliveryExponential backoff with jitter, configured per endpoint.Per endpoint config; default ceiling.
Integration sync60 s → 300 s → 900 s.3
Intelligence re-evaluationExponential backoff, capped at 300 s.2
Embedding (write path, blocking)Backoff inside the request — at most 2 retries before the request errors.2
What “jitter” means in those schedules
Two clients failing at the same instant would otherwise retry at the exact same instants — piling on the recovering target and causing it to fail again. Jitter scrambles each retry by a small random offset so the herd spreads out. + jitter (factor 0.25) means each delay is randomised by ±25% of itself: a 5 s slot lands somewhere in [3.75 s, 6.25 s]. The base schedule is the average; jitter is the controlled randomness around it.

Per-endpoint backpressure

Outbound integrations carry per-target concurrency caps so one slow endpoint cannot starve the worker pool. SIEM forwarding allows at most 3 concurrent in-flight requests per configured endpoint; webhook delivery allows 10 globally with a per-endpoint semaphore. When a target is slow, the rest of the queue keeps draining for other tenants.

When a task is moved to the dead-letter queue

  • Retry ceiling reached — the task has burned every retry slot and the last attempt still failed.
  • Permanent error received — a 4xx from the target that retrying will not fix (e.g. 410 Gone, 403 on a deactivated webhook).
  • Consecutive-failure threshold tripped — for SIEM forwarders, after 5 consecutive failures the entire forwarder is paused and remaining batches go to dead-letter until an operator re-enables it.

Soft degradation on the write path

Some failures are tolerable for the caller. POST /memory/add uses a soft-degradation path: when the embedding service is rate-limited and the embedding cache misses, the row still commits to the database with vector_id=null, and a background task retries the embedding. The API returns 200 with status="pending_embedding" instead of failing the request.

What the caller sees when things break

JSON
// hard dependency down
HTTP/1.1 503 Service Unavailable
Retry-After: 30
{ "error": { "code": "DATABASE_UNAVAILABLE", "request_id": "..." } }

// soft dependency down (write path)
HTTP/1.1 200 OK
{ "id": 18421, "status": "pending_embedding", "embedding_version": null }

// caller error
HTTP/1.1 400 Bad Request
{ "error": { "code": "PROJECT_REQUIRED", "request_id": "..." } }