MemorySyncMemorySync
Production

Horizontal Scaling

MemorySync scales horizontally on your behalf. As your traffic grows, the platform adds capacity automatically across the API tier, recall pipeline, and background workers. You don’t provision instances, configure autoscaling, or worry about session affinity — every plan ships with elastic capacity by default.

Scaling Model

The platform separates compute into two independent scaling groups that respond to different signals:

  • Request-serving capacity. Handles synchronous traffic to POST /memory/add, POST /memory/query, and the management endpoints. Capacity tracks live request rate and pipeline-stage latency.
  • Background work capacity. Handles asynchronous jobs such as tier transitions, retention sweeps, compaction, re-evaluation, summarisation, webhook delivery, and audit forwarding. Capacity tracks queue depth, not request rate.

Both groups scale independently. A read-heavy workload grows request-serving capacity without changing background capacity; a heavy ingestion job grows background capacity without inflating the API tier. Customers do not configure either group — sizing is automatic.

Elastic Capacity

Capacity is added and removed continuously in response to load signals. The platform optimises for two competing goals: latency stays inside the published budget, and capacity removal never thrashes (so traffic dips don’t cause cold starts on the next spike).

BehaviorHow it works
Fast scale-outSpikes in traffic add capacity within seconds, without queueing requests at the edge. There is no manual warm-up step before a launch.
Smooth scale-inWhen demand drops, capacity is removed gradually so that a follow-up spike doesn’t require a cold start. Customer traffic never lands on instances that are mid-shutdown.
Always-on baselineEven at zero traffic, the platform keeps a redundant baseline of warm capacity so first-byte latency stays low for the first request after a quiet window.
Pre-warm hooksCustomers expecting a known launch window (marketing event, batch import) can request a pre-warm. Support adjusts the floor capacity ahead of time so the spike is invisible to end users.

Stateless Request Handling

Every instance serving your traffic is interchangeable. The platform does not require sticky sessions, and your application should never assume two consecutive requests land on the same instance:

  • Authentication is per-request. Each call carries its own API key. There is no server-side session that needs to be replicated.
  • Caching is shared. Cache reads/writes go through the platform’s shared cache layer. A cache hit on one instance is visible to every other instance immediately.
  • Uploads are durable. Files uploaded to extraction or DSR export endpoints are written to shared durable storage before the response is returned. The next instance reading them sees a consistent view.
  • No client-side affinity headers. Your callers do not need to send affinity cookies or pin to a specific edge. Standard HTTP load-balancing applies and is invisible to the integrator.

Tenant Isolation Under Load

Multi-tenant isolation is enforced at every layer so that a spike on one organization’s account never affects another’s latency:

  • Tenant resolution. Every request is routed to the correct organization via the API key and X-Project-ID header. The tenant identity is established before any business logic runs, and the request can never cross into another tenant’s data.
  • Per-key rate limiting. Rate limits apply at the API-key level. One tenant exceeding its rate limit produces 429 responses on that tenant’s traffic only — every other tenant continues unaffected.
  • Per-tenant failure isolation. Upstream provider failures (model errors, embedding outages) trip a circuit breaker for the affected tenant only. Other organizations on the same platform keep operating normally.
  • Fair-share scheduling. Background queues are fair-scheduled across tenants. A heavy backfill on one organization cannot starve another organization’s sweeps.

Concurrent Worker Safety

As background capacity scales, multiple workers may process the same job type at the same time. The platform guarantees they never step on each other:

  • Lock-and-skip claims. Sweep jobs (retention, tier transitions, re-evaluation, compaction) claim candidates with row-level locks that skip already-claimed work. Two workers never process the same record.
  • Atomic state transitions. Each state change (tier promote/demote, memory compaction merge, soft-delete) is wrapped in a single transaction. Either it commits in full or rolls back — partial transitions are impossible.
  • No starvation. Records skipped because another worker held the lock are picked up on the very next sweep cycle. The sweep cadence guarantees every eligible record is processed within a bounded time window.
  • Bounded batch size. Each sweep processes a fixed batch so that a single worker never monopolises shared resources, even during a backfill.

Scaling Limits & Quotas

Your plan governs how much elastic capacity you have access to. The platform never blocks individual requests for capacity reasons — it scales to meet legitimate traffic — but it enforces the plan quotas you signed up for:

LimitBehavior
Per-key rate limitPer-second and per-minute caps to absorb hot-loops and runaway clients. Exceeding the cap returns 429 Too Many Requests; the limiter resets on the next interval.
Monthly plan quotaPlan-level caps on monthly adds and queries. Once the cap is reached, the platform silently degrades — requests return 200 OK with empty results and no memory is stored. See the Silent Degradation page for the full contract.
Burst toleranceThe rate limiter uses a sliding window so that short-lived bursts inside your plan rate are absorbed without 429s.
Background concurrencyPer-tenant concurrency for heavy background work (large ingestion, DSR exports) is shaped automatically so that a single backfill cannot starve interactive recall.

If you expect to exceed your plan’s monthly quota or per-key rate limit, contact billing in advance. The platform will not throw errors that surprise your application — the only failure mode for plan-quota exhaustion is the documented silent-degradation path.