Failure Modes & Retries
Every external call in MemorySync has a documented failure mode and a documented retry policy. This page is the contract: when X breaks, the system does Y. Read it before you rely on a behaviour during an incident.
Failures, classified by what the system does about them
| Class | Examples | Response |
|---|---|---|
| Hard dependency down | Primary database unavailable. | Return 503. No degradation — the database is the only hard dependency. Health checks fail; load balancer drains the instance. |
| Soft dependency down | Cache unreachable, vector index slow, embedding service rate-limited. | Degrade gracefully on the read path; queue and replay on the write path. Return 200 where possible with a degraded-mode flag. |
| Caller error | Invalid header, missing scope, malformed body. | Return 4xx with an explicit error code. No retries — the caller has to fix it. |
| Transient task failure | Webhook target returns 5xx, SIEM endpoint times out. | Retry with exponential backoff up to a per-task ceiling, then dead-letter. |
The circuit breaker, used everywhere external
Outbound calls (SIEM forwarding, webhook delivery, integration sync, embedding service) wrap in a circuit breaker with three states.
| State | Behaviour | Transitions out |
|---|---|---|
| CLOSED | Calls go through normally. Failures are counted. | Failures exceed the threshold (default 5 in a row) → OPEN. |
| OPEN | Calls short-circuit immediately with a no-op error — the target is not contacted. | Cooldown window elapses (default 60 s) → HALF_OPEN. |
| HALF_OPEN | One probe call is allowed through. | Probe succeeds → CLOSED. Probe fails → OPEN, cooldown restarts. |
Retry schedules by task type
| Task | Schedule | Max retries |
|---|---|---|
| SIEM forwarding | 1 s → 2 s → 5 s → 10 s → 30 s + jitter (factor 0.25). | 5 |
| Webhook delivery | Exponential backoff with jitter, configured per endpoint. | Per endpoint config; default ceiling. |
| Integration sync | 60 s → 300 s → 900 s. | 3 |
| Intelligence re-evaluation | Exponential backoff, capped at 300 s. | 2 |
| Embedding (write path, blocking) | Backoff inside the request — at most 2 retries before the request errors. | 2 |
+ jitter (factor 0.25) means each delay is randomised by ±25% of itself: a 5 s slot lands somewhere in [3.75 s, 6.25 s]. The base schedule is the average; jitter is the controlled randomness around it.Per-endpoint backpressure
Outbound integrations carry per-target concurrency caps so one slow endpoint cannot starve the worker pool. SIEM forwarding allows at most 3 concurrent in-flight requests per configured endpoint; webhook delivery allows 10 globally with a per-endpoint semaphore. When a target is slow, the rest of the queue keeps draining for other tenants.
When a task is moved to the dead-letter queue
- Retry ceiling reached — the task has burned every retry slot and the last attempt still failed.
- Permanent error received — a
4xxfrom the target that retrying will not fix (e.g.410 Gone,403on a deactivated webhook). - Consecutive-failure threshold tripped — for SIEM forwarders, after 5 consecutive failures the entire forwarder is paused and remaining batches go to dead-letter until an operator re-enables it.
Soft degradation on the write path
Some failures are tolerable for the caller. POST /memory/add uses a soft-degradation path: when the embedding service is rate-limited and the embedding cache misses, the row still commits to the database with vector_id=null, and a background task retries the embedding. The API returns 200 with status="pending_embedding" instead of failing the request.
What the caller sees when things break
// hard dependency down
HTTP/1.1 503 Service Unavailable
Retry-After: 30
{ "error": { "code": "DATABASE_UNAVAILABLE", "request_id": "..." } }
// soft dependency down (write path)
HTTP/1.1 200 OK
{ "id": 18421, "status": "pending_embedding", "embedding_version": null }
// caller error
HTTP/1.1 400 Bad Request
{ "error": { "code": "PROJECT_REQUIRED", "request_id": "..." } }