Sluice

Mission & Concepts

Why Sluice exists, design philosophy, and key concepts behind SDK-first cooperative rate limiting.

The problem

Multiple Ontopix products share the same vendor API quotas. When audit-service and stats-service both call OpenAI, they draw from the same rate limit pool. Without coordination, one product can exhaust the quota for all others, causing unpredictable 429 errors across the platform.

Why not a proxy, queue, or gateway?

ApproachWhy we rejected it
ProxyAdds a network hop to every vendor call. Increases latency, becomes a single point of failure, and requires its own scaling and monitoring.
QueueVendor calls are latency-sensitive. Queuing adds delay and makes request-response patterns awkward. Products already have their own SQS queues for retries.
GatewayCorrect long-term answer (see ADR-005), but too heavy for v0.1.0. A gateway needs routing, auth, and observability. Sluice solves the immediate coordination problem without that overhead.

Sluice is SDK-first. Products import a library, call acquire() or slot(), and make vendor calls directly. The only shared infrastructure is a DynamoDB table.

Design philosophy

Cooperative rate limiting. Sluice does not block or reject calls. It tells the caller how many tokens are available and, if none remain, exactly how long to wait. The caller decides whether to sleep inline or requeue.

Deterministic retry delays. When quota is exhausted, acquire() returns a retry_in value computed from the refill rate. No guessing, no exponential backoff against the vendor. The caller knows precisely when capacity will be available.

SDK-first. No sidecar, no proxy, no agent. Products add a dependency and configure environment variables. The SDK handles DynamoDB transactions, optimistic locking, and lease management internally.

Key concepts

Dimension

A vendor + metric pair that identifies a specific rate limit. Format: {vendor}#{metric}.

Examples:

  • openai#rpm -- OpenAI requests per minute
  • openai#tpm -- OpenAI tokens per minute
  • elevenlabs#characters -- ElevenLabs character quota

Each dimension is a separate row in DynamoDB with its own capacity and refill rate.

Bucket

The DynamoDB row that tracks a dimension's quota state. Contains capacity, current token count, refill rate, and a version counter for optimistic locking. Buckets are seeded by Terraform -- adding a new vendor dimension is a terraform apply, not a code change.

Tokens

Available capacity units in a bucket. Tokens are consumed on acquire() and lazily refilled over time using elapsed-time arithmetic (no background refill job). The number of tokens consumed per call is configured as cost_per_call on the bucket.

Lease

A temporary DynamoDB record written when tokens are acquired. Leases have a TTL and serve two purposes:

  1. Crash recovery. If a caller acquires tokens but crashes before releasing, the reconciler finds the expired lease and restores the tokens.
  2. Concurrent limit tracking. For concurrent-type dimensions, active leases represent held slots.

Lease records are deleted on release. Expired leases are cleaned up by the reconciler Lambda every 5 minutes.

Slot

The high-level API that most product code should use. A slot wraps acquire + hold + release into a safe pattern:

async with slot("openai#rpm", timeout=30) as s:
    response = await openai.chat.completions.create(...)
# release() called automatically
await slot("openai#rpm", 30, async () => {
    const response = await openai.chat.completions.create(...);
});
// release() called automatically
err := sluice.WithSlot(ctx, "openai#rpm", 30*time.Second, func(ctx context.Context) error {
    // vendor call -- release is guaranteed on return
    return nil
})

Using slot() / WithSlot() guarantees that tokens are released even if the vendor call fails. The lower-level acquire() API is available when you need explicit control, but requires manual release() in a finally block.