Operations

Reconciler Operations

How the reconciler Lambda recovers leaked tokens from expired lease records.

What it is

The reconciler is a Lambda function (sluice-reconciler-{env}) that scans for expired lease records in the Sluice DynamoDB table and restores tokens to their parent buckets. It is the crash recovery mechanism for Sluice.

Why it exists

When a caller acquires a slot, Sluice writes a lease record with a TTL. If the caller fails to call release() (Lambda timeout, crash, unhandled exception), the lease record remains and the tokens are lost.

For concurrent limit types, this is a hard leak -- the slot is permanently consumed until something restores it. Time-based limits (requests, tokens) self-heal via the lazy refill mechanism, but the lease record still needs cleanup.

How it works

The reconciler runs this sequence on every invocation:

1. Scan for expired leases

FilterExpression: begins_with(vendor_dimension, "lease#") AND ttl < :now

Lease keys follow the format lease#{vendor}#{dimension}#{uuid}. The scan finds all lease records whose TTL has passed.

2. Process each expired lease

For each expired lease, the reconciler parses the vendor dimension from the lease key and reads the parent bucket.

If limit_type is concurrent:

A transactional write restores the tokens with a capacity guard:

TransactWriteItems:
  - Update bucket: SET tokens = tokens + :cost WHERE tokens <= :max_tokens
  - Delete lease record

The condition tokens <= capacity - cost is the capacity guard. It prevents double-restore: if both release() and the reconciler run for the same lease, the second one to execute finds tokens already at capacity and the condition fails.

When the capacity guard rejects the update (TransactionCanceledException), the reconciler simply deletes the lease record. This is expected behavior, not an error.

If limit_type is requests or tokens:

Time-based buckets refill naturally via the lazy refill algorithm. The reconciler just deletes the lease record -- no token restoration needed.

If the parent bucket no longer exists:

The reconciler deletes the orphaned lease record silently.

Schedule

Triggered by EventBridge every 5 minutes via rate(5 minutes).

The 5-minute interval means concurrent slots leaked by crashed callers may remain unavailable for up to 5 minutes + the lease TTL (SLUICE_LEASE_TTL, default 60 seconds). In practice, the maximum recovery time is approximately 6 minutes.

Infrastructure

ResourceName
Lambda functionsluice-reconciler-{env}
Execution rolesluice-reconciler-{env}
EventBridge rulesluice-reconciler-{env}
RuntimePython 3.11
Timeout60 seconds
MemoryDefault (128 MB)
Sourcepython/src/sluice/_reconciler.py
Terraform.infra/reconciler.tf

IAM permissions

The reconciler role has these DynamoDB actions on the vendor buckets table:

  • dynamodb:Scan -- to find expired leases
  • dynamodb:GetItem -- to read parent bucket state
  • dynamodb:TransactWriteItems -- to atomically restore tokens and delete leases
  • dynamodb:DeleteItem -- to delete lease records individually

Plus AWSLambdaBasicExecutionRole for CloudWatch Logs.

Failure modes

FailureImpactRecovery
Lambda timeout (> 60s)Incomplete scan; some leases not processedNext invocation picks them up
Capacity guard rejectionExpected -- tokens already at capacityLease deleted, no action needed
DynamoDB throttleSome leases skippedNext invocation retries
Bucket deleted between scan and restoreLease record orphanedReconciler deletes it on next pass

Observability

The reconciler emits a structured log event on completion:

{
  "event": "reconciler.complete",
  "restored": 3,
  "dimensions": ["elevenlabs#concurrent", "openai#rpm", "openai#tpm"],
  "already_capped": 1
}
FieldMeaning
restoredNumber of leases successfully processed (tokens restored or lease cleaned up)
dimensionsWhich vendor dimensions had leases restored
already_cappedNumber of concurrent leases where the capacity guard rejected the restore

What to watch for

  • already_capped consistently high: callers are calling both release() and letting leases expire. Usually harmless but indicates a release() timing issue.
  • restored consistently high for concurrent dimensions: callers are failing to call release(). Investigate whether slot() / WithSlot() is being used instead of raw acquire().
  • Lambda errors in CloudWatch: check for DynamoDB connectivity issues or IAM permission problems.