Reconciler Operations
How the reconciler Lambda recovers leaked tokens from expired lease records.
What it is
The reconciler is a Lambda function (sluice-reconciler-{env}) that scans for expired lease records in the Sluice DynamoDB table and restores tokens to their parent buckets. It is the crash recovery mechanism for Sluice.
Why it exists
When a caller acquires a slot, Sluice writes a lease record with a TTL. If the caller fails to call release() (Lambda timeout, crash, unhandled exception), the lease record remains and the tokens are lost.
For concurrent limit types, this is a hard leak -- the slot is permanently consumed until something restores it. Time-based limits (requests, tokens) self-heal via the lazy refill mechanism, but the lease record still needs cleanup.
How it works
The reconciler runs this sequence on every invocation:
1. Scan for expired leases
FilterExpression: begins_with(vendor_dimension, "lease#") AND ttl < :now
Lease keys follow the format lease#{vendor}#{dimension}#{uuid}. The scan finds all lease records whose TTL has passed.
2. Process each expired lease
For each expired lease, the reconciler parses the vendor dimension from the lease key and reads the parent bucket.
If limit_type is concurrent:
A transactional write restores the tokens with a capacity guard:
TransactWriteItems:
- Update bucket: SET tokens = tokens + :cost WHERE tokens <= :max_tokens
- Delete lease record
The condition tokens <= capacity - cost is the capacity guard. It prevents double-restore: if both release() and the reconciler run for the same lease, the second one to execute finds tokens already at capacity and the condition fails.
When the capacity guard rejects the update (TransactionCanceledException), the reconciler simply deletes the lease record. This is expected behavior, not an error.
If limit_type is requests or tokens:
Time-based buckets refill naturally via the lazy refill algorithm. The reconciler just deletes the lease record -- no token restoration needed.
If the parent bucket no longer exists:
The reconciler deletes the orphaned lease record silently.
Schedule
Triggered by EventBridge every 5 minutes via rate(5 minutes).
The 5-minute interval means concurrent slots leaked by crashed callers may remain unavailable for up to 5 minutes + the lease TTL (SLUICE_LEASE_TTL, default 60 seconds). In practice, the maximum recovery time is approximately 6 minutes.
Infrastructure
| Resource | Name |
|---|---|
| Lambda function | sluice-reconciler-{env} |
| Execution role | sluice-reconciler-{env} |
| EventBridge rule | sluice-reconciler-{env} |
| Runtime | Python 3.11 |
| Timeout | 60 seconds |
| Memory | Default (128 MB) |
| Source | python/src/sluice/_reconciler.py |
| Terraform | .infra/reconciler.tf |
IAM permissions
The reconciler role has these DynamoDB actions on the vendor buckets table:
dynamodb:Scan-- to find expired leasesdynamodb:GetItem-- to read parent bucket statedynamodb:TransactWriteItems-- to atomically restore tokens and delete leasesdynamodb:DeleteItem-- to delete lease records individually
Plus AWSLambdaBasicExecutionRole for CloudWatch Logs.
Failure modes
| Failure | Impact | Recovery |
|---|---|---|
| Lambda timeout (> 60s) | Incomplete scan; some leases not processed | Next invocation picks them up |
| Capacity guard rejection | Expected -- tokens already at capacity | Lease deleted, no action needed |
| DynamoDB throttle | Some leases skipped | Next invocation retries |
| Bucket deleted between scan and restore | Lease record orphaned | Reconciler deletes it on next pass |
Observability
The reconciler emits a structured log event on completion:
{
"event": "reconciler.complete",
"restored": 3,
"dimensions": ["elevenlabs#concurrent", "openai#rpm", "openai#tpm"],
"already_capped": 1
}
| Field | Meaning |
|---|---|
restored | Number of leases successfully processed (tokens restored or lease cleaned up) |
dimensions | Which vendor dimensions had leases restored |
already_capped | Number of concurrent leases where the capacity guard rejected the restore |
What to watch for
already_cappedconsistently high: callers are calling bothrelease()and letting leases expire. Usually harmless but indicates arelease()timing issue.restoredconsistently high for concurrent dimensions: callers are failing to callrelease(). Investigate whetherslot()/WithSlot()is being used instead of rawacquire().- Lambda errors in CloudWatch: check for DynamoDB connectivity issues or IAM permission problems.