Operations

Stress Testing

Methodology, invariants, and results for Sluice cross-language stress tests.

Sluice includes a cross-language stress test suite that exercises the acquire/release cycle under high contention on real DynamoDB. The suite runs identical workloads from all three SDKs (Python, TypeScript, Go) simultaneously against a shared bucket.

Running stress tests

Local sandbox (LocalStack)

task stress:seed         # Reset sandbox and seed stress buckets
task stress:run          # Run all three SDKs in parallel (default: 10 workers, 10s)
task stress:verify       # Check correctness invariants
task stress:report       # Generate cross-SDK comparison report

Override defaults with environment variables:

STRESS_DIMENSION=stress#concurrent STRESS_WORKERS=5 STRESS_DURATION_SECS=10 task stress:run

Pre environment (real DynamoDB)

task stress:seed:pre     # Seed stress buckets into pre environment
task stress:run:pre      # Run against real DynamoDB (eu-central-1)
task stress:verify:pre   # Verify invariants on pre

Requires AWS credentials with dynamodb:GetItem, dynamodb:PutItem, dynamodb:UpdateItem, dynamodb:DeleteItem, dynamodb:Scan on ontopix-vendor-buckets-pre.

Stress buckets

The seed tasks create these buckets:

Dimension	Type	Capacity	Refill Rate	Purpose
`stress#rpm`	requests	50	5.0/s	Time-refill rate limiting
`stress#concurrent`	concurrent	10	0	Concurrent slot limiting
`stress#openai-rpm`	requests	500	8.333/s	Realistic OpenAI RPM
`stress#openai-tpm`	tokens	200,000	3,333.33/s	Realistic OpenAI TPM
`stress#elevenlabs-concurrent`	concurrent	5	0	Realistic ElevenLabs concurrency

Worker behavior

Each stress runner spawns STRESS_WORKERS workers (default 10) per SDK. Each worker loops for STRESS_DURATION_SECS:

Acquire a slot on the target dimension
If GRANTED: simulate a vendor call (5-15ms random sleep), then release
If RETRY_IN: sleep for the suggested wait, then retry
Emit NDJSON event to stdout for every outcome

Workers stop acquiring 2 seconds before the deadline (drain buffer) to allow in-flight releases to complete.

NDJSON output format

Each line is a JSON object with:

{"sdk": "python", "worker": 3, "outcome": "granted", "elapsed_ms": 45.2, "attempt": 7, "ts": 1773571234.567}
{"sdk": "go", "worker": 1, "outcome": "retry_in", "wait_seconds": 0.6, "elapsed_ms": 12.3, "attempt": 2, "ts": 1773571234.890}
{"sdk": "typescript", "worker": 5, "outcome": "release_error", "error": "TransactionConflictException: ...", "attempt": 15, "ts": 1773571235.123}
{"sdk": "python", "type": "summary", "total_grants": 42, "total_retries": 180, "total_errors": 0, "duration_secs": 10.0}

Outcomes: granted, retry_in, error (acquire failure), release_error (release failure).

Correctness invariants

The verifier (task stress:verify) checks these invariants after each run:

1. No orphaned leases

All leases created during the test must be released. The verifier scans for lease items with ttl > now (excludes DynamoDB-TTL-expired items awaiting garbage collection).

Applies to: All limit types.

2. Token conservation (concurrent only)

For concurrent limits (refill_rate=0), tokens are a finite resource. At quiescence:

tokens + active_leases = capacity

With 0 active leases, tokens must equal capacity.

Note: Under extreme contention (60+ workers on a 10-slot bucket), occasional TransactionConflictException on release can cause token loss. This is by design — the reconciler restores tokens from expired leases within 5 minutes. The stress test does not run the reconciler, so token conservation may fail at very high contention levels.

3. Version consistency

Each acquire and release increments the bucket's version counter:

Concurrent types: version = 2 * total_grants (acquire + release each increment once)
Request/token types: version = total_grants (only acquire increments)

4. No over-grant (requests/tokens only)

Total grants must not exceed the theoretical maximum:

max_allowed = capacity + duration * refill_rate

A 5% tolerance accounts for floating-point timing differences across SDKs.

Expected contention rates

High contention is expected and correct. With 30 workers competing for optimistic locks on a single DynamoDB item:

Dimension	Workers	Expected Contention
`stress#rpm`	10/SDK	85-95%
`stress#concurrent`	5/SDK	35-50%
`stress#concurrent`	10/SDK	75-85%

Contention means the acquire transaction was cancelled and retried with exponential backoff. This is the intended behavior of optimistic locking.

Results

Sandbox (LocalStack)

Dimension	Workers	Duration	Invariants	Errors
`stress#rpm`	10/SDK	10s	ALL PASS	0
`stress#concurrent`	5/SDK	10s	ALL PASS	0

Pre (real DynamoDB, eu-central-1)

Dimension	Workers	Duration	Grants	Invariants	Errors
`stress#rpm`	10/SDK	30s	191	ALL PASS	0
`stress#concurrent`	10/SDK	60s	610	ALL PASS	0
`stress#concurrent`	20/SDK	90s	891	token_conservation: -5	5 release errors

At 20 workers/SDK (60 total), 5 out of 891 releases (0.6%) failed with TransactionConflictException after 5 retry attempts. Token loss is temporary — the reconciler restores tokens from expired leases within its 5-minute scan interval.

Release error handling

The non-transactional release pattern (see Architecture) handles errors specifically:

Error	Action	Recovery
`ConditionalCheckFailedException`	Skip token restore (capacity guard)	None needed — tokens already at capacity
`TransactionConflictException`	Retry up to 5 times (25ms * attempt backoff)	If all retries fail: tokens lost, reconciler restores within 5 min
Other errors (network, throttling)	Propagate to caller	Lease persists until TTL, reconciler restores tokens

Troubleshooting

Orphaned leases on pre after multiple runs

DynamoDB TTL deletion is eventually consistent (up to 48 hours). The verifier filters expired leases (ttl > now), but if you see unexpected leases, wait for TTL garbage collection or manually delete them.

High latency on pre

Expect higher p50/p95 latencies on real DynamoDB (100-600ms) compared to LocalStack (10-50ms). This is network latency to eu-central-1, not a performance issue with Sluice.

Token loss at extreme contention

At 60+ workers competing for 10 slots, ~0.5-1% of releases may fail with TransactionConflictException. This is acceptable — the reconciler handles recovery. Production workloads with realistic worker counts (2-5 per service) see 0% release failures.

Reconciler Operations

How the reconciler Lambda recovers leaked tokens from expired lease records.

Go SDK

Installation, configuration, and API reference for the Sluice Go SDK.