Operations

Reconciler Operations

How the reconciler Lambda recovers leaked tokens from expired lease records.

What it is

The reconciler is a Lambda function (sluice-reconciler-{env}) that scans for expired lease records in the Sluice DynamoDB table and restores tokens to their parent buckets. It is the crash recovery mechanism for Sluice.

Why it exists

When a caller acquires a slot, Sluice writes a lease record with a TTL. If the caller fails to call release() (Lambda timeout, crash, unhandled exception), the lease record remains and the tokens are lost.

For concurrent limit types, this is a hard leak -- the slot is permanently consumed until something restores it. Time-based limits (requests, tokens) self-heal via the lazy refill mechanism, but the lease record still needs cleanup.

How it works

The reconciler runs this sequence on every invocation:

1. Scan for expired leases

FilterExpression: begins_with(vendor_dimension, "lease#") AND ttl < :now

Lease keys follow the format lease#{vendor}#{dimension}#{uuid}. The scan finds all lease records whose TTL has passed.

2. Process each expired lease

For each expired lease, the reconciler parses the vendor dimension from the lease key and reads the parent bucket.

If limit_type is concurrent:

A transactional write restores the tokens with a capacity guard:

TransactWriteItems:
  - Update bucket: SET tokens = tokens + :cost WHERE tokens <= :max_tokens
  - Delete lease record

The condition tokens <= capacity - cost is the capacity guard. It prevents double-restore: if both release() and the reconciler run for the same lease, the second one to execute finds tokens already at capacity and the condition fails.

When the capacity guard rejects the update (TransactionCanceledException), the reconciler simply deletes the lease record. This is expected behavior, not an error.

If limit_type is requests or tokens:

Time-based buckets refill naturally via the lazy refill algorithm. The reconciler just deletes the lease record -- no token restoration needed.

If the parent bucket no longer exists:

The reconciler deletes the orphaned lease record silently.

Schedule

Triggered by EventBridge every 5 minutes via rate(5 minutes).

The 5-minute interval means concurrent slots leaked by crashed callers may remain unavailable for up to 5 minutes + the lease TTL (SLUICE_LEASE_TTL, default 60 seconds). In practice, the maximum recovery time is approximately 6 minutes.

Infrastructure

Resource	Name
Lambda function	`sluice-reconciler-{env}`
Execution role	`sluice-reconciler-{env}`
EventBridge rule	`sluice-reconciler-{env}`
Runtime	Python 3.11
Timeout	60 seconds
Memory	Default (128 MB)
Source	`python/src/sluice/_reconciler.py`
Terraform	`.infra/reconciler.tf`

IAM permissions

The reconciler role has these DynamoDB actions on the vendor buckets table:

dynamodb:Scan -- to find expired leases
dynamodb:GetItem -- to read parent bucket state
dynamodb:TransactWriteItems -- to atomically restore tokens and delete leases
dynamodb:DeleteItem -- to delete lease records individually

Plus AWSLambdaBasicExecutionRole for CloudWatch Logs.

Failure modes

Failure	Impact	Recovery
Lambda timeout (> 60s)	Incomplete scan; some leases not processed	Next invocation picks them up
Capacity guard rejection	Expected -- tokens already at capacity	Lease deleted, no action needed
DynamoDB throttle	Some leases skipped	Next invocation retries
Bucket deleted between scan and restore	Lease record orphaned	Reconciler deletes it on next pass

Observability

The reconciler emits a structured log event on completion:

{
  "event": "reconciler.complete",
  "restored": 3,
  "dimensions": ["elevenlabs#concurrent", "openai#rpm", "openai#tpm"],
  "already_capped": 1
}

Field	Meaning
`restored`	Number of leases successfully processed (tokens restored or lease cleaned up)
`dimensions`	Which vendor dimensions had leases restored
`already_capped`	Number of concurrent leases where the capacity guard rejected the restore

What to watch for

already_capped consistently high: callers are calling both release() and letting leases expire. Usually harmless but indicates a release() timing issue.
restored consistently high for concurrent dimensions: callers are failing to call release(). Investigate whether slot() / WithSlot() is being used instead of raw acquire().
Lambda errors in CloudWatch: check for DynamoDB connectivity issues or IAM permission problems.

Monitoring Guide

Structured log events, CloudWatch alerts, and dashboard setup for Sluice observability.

Stress Testing

Methodology, invariants, and results for Sluice cross-language stress tests.