Stress Testing
Methodology, invariants, and results for Sluice cross-language stress tests.
Sluice includes a cross-language stress test suite that exercises the acquire/release cycle under high contention on real DynamoDB. The suite runs identical workloads from all three SDKs (Python, TypeScript, Go) simultaneously against a shared bucket.
Running stress tests
Local sandbox (LocalStack)
task stress:seed # Reset sandbox and seed stress buckets
task stress:run # Run all three SDKs in parallel (default: 10 workers, 10s)
task stress:verify # Check correctness invariants
task stress:report # Generate cross-SDK comparison report
Override defaults with environment variables:
STRESS_DIMENSION=stress#concurrent STRESS_WORKERS=5 STRESS_DURATION_SECS=10 task stress:run
Pre environment (real DynamoDB)
task stress:seed:pre # Seed stress buckets into pre environment
task stress:run:pre # Run against real DynamoDB (eu-central-1)
task stress:verify:pre # Verify invariants on pre
Requires AWS credentials with dynamodb:GetItem, dynamodb:PutItem, dynamodb:UpdateItem, dynamodb:DeleteItem, dynamodb:Scan on ontopix-vendor-buckets-pre.
Stress buckets
The seed tasks create these buckets:
| Dimension | Type | Capacity | Refill Rate | Purpose |
|---|---|---|---|---|
stress#rpm | requests | 50 | 5.0/s | Time-refill rate limiting |
stress#concurrent | concurrent | 10 | 0 | Concurrent slot limiting |
stress#openai-rpm | requests | 500 | 8.333/s | Realistic OpenAI RPM |
stress#openai-tpm | tokens | 200,000 | 3,333.33/s | Realistic OpenAI TPM |
stress#elevenlabs-concurrent | concurrent | 5 | 0 | Realistic ElevenLabs concurrency |
Worker behavior
Each stress runner spawns STRESS_WORKERS workers (default 10) per SDK. Each worker loops for STRESS_DURATION_SECS:
- Acquire a slot on the target dimension
- If
GRANTED: simulate a vendor call (5-15ms random sleep), then release - If
RETRY_IN: sleep for the suggested wait, then retry - Emit NDJSON event to stdout for every outcome
Workers stop acquiring 2 seconds before the deadline (drain buffer) to allow in-flight releases to complete.
NDJSON output format
Each line is a JSON object with:
{"sdk": "python", "worker": 3, "outcome": "granted", "elapsed_ms": 45.2, "attempt": 7, "ts": 1773571234.567}
{"sdk": "go", "worker": 1, "outcome": "retry_in", "wait_seconds": 0.6, "elapsed_ms": 12.3, "attempt": 2, "ts": 1773571234.890}
{"sdk": "typescript", "worker": 5, "outcome": "release_error", "error": "TransactionConflictException: ...", "attempt": 15, "ts": 1773571235.123}
{"sdk": "python", "type": "summary", "total_grants": 42, "total_retries": 180, "total_errors": 0, "duration_secs": 10.0}
Outcomes: granted, retry_in, error (acquire failure), release_error (release failure).
Correctness invariants
The verifier (task stress:verify) checks these invariants after each run:
1. No orphaned leases
All leases created during the test must be released. The verifier scans for lease items with ttl > now (excludes DynamoDB-TTL-expired items awaiting garbage collection).
Applies to: All limit types.
2. Token conservation (concurrent only)
For concurrent limits (refill_rate=0), tokens are a finite resource. At quiescence:
tokens + active_leases = capacity
With 0 active leases, tokens must equal capacity.
Note: Under extreme contention (60+ workers on a 10-slot bucket), occasional TransactionConflictException on release can cause token loss. This is by design — the reconciler restores tokens from expired leases within 5 minutes. The stress test does not run the reconciler, so token conservation may fail at very high contention levels.
3. Version consistency
Each acquire and release increments the bucket's version counter:
- Concurrent types:
version = 2 * total_grants(acquire + release each increment once) - Request/token types:
version = total_grants(only acquire increments)
4. No over-grant (requests/tokens only)
Total grants must not exceed the theoretical maximum:
max_allowed = capacity + duration * refill_rate
A 5% tolerance accounts for floating-point timing differences across SDKs.
Expected contention rates
High contention is expected and correct. With 30 workers competing for optimistic locks on a single DynamoDB item:
| Dimension | Workers | Expected Contention |
|---|---|---|
stress#rpm | 10/SDK | 85-95% |
stress#concurrent | 5/SDK | 35-50% |
stress#concurrent | 10/SDK | 75-85% |
Contention means the acquire transaction was cancelled and retried with exponential backoff. This is the intended behavior of optimistic locking.
Results
Sandbox (LocalStack)
| Dimension | Workers | Duration | Invariants | Errors |
|---|---|---|---|---|
stress#rpm | 10/SDK | 10s | ALL PASS | 0 |
stress#concurrent | 5/SDK | 10s | ALL PASS | 0 |
Pre (real DynamoDB, eu-central-1)
| Dimension | Workers | Duration | Grants | Invariants | Errors |
|---|---|---|---|---|---|
stress#rpm | 10/SDK | 30s | 191 | ALL PASS | 0 |
stress#concurrent | 10/SDK | 60s | 610 | ALL PASS | 0 |
stress#concurrent | 20/SDK | 90s | 891 | token_conservation: -5 | 5 release errors |
At 20 workers/SDK (60 total), 5 out of 891 releases (0.6%) failed with TransactionConflictException after 5 retry attempts. Token loss is temporary — the reconciler restores tokens from expired leases within its 5-minute scan interval.
Release error handling
The non-transactional release pattern (see Architecture) handles errors specifically:
| Error | Action | Recovery |
|---|---|---|
ConditionalCheckFailedException | Skip token restore (capacity guard) | None needed — tokens already at capacity |
TransactionConflictException | Retry up to 5 times (25ms * attempt backoff) | If all retries fail: tokens lost, reconciler restores within 5 min |
| Other errors (network, throttling) | Propagate to caller | Lease persists until TTL, reconciler restores tokens |
Troubleshooting
Orphaned leases on pre after multiple runs
DynamoDB TTL deletion is eventually consistent (up to 48 hours). The verifier filters expired leases (ttl > now), but if you see unexpected leases, wait for TTL garbage collection or manually delete them.
High latency on pre
Expect higher p50/p95 latencies on real DynamoDB (100-600ms) compared to LocalStack (10-50ms). This is network latency to eu-central-1, not a performance issue with Sluice.
Token loss at extreme contention
At 60+ workers competing for 10 slots, ~0.5-1% of releases may fail with TransactionConflictException. This is acceptable — the reconciler handles recovery. Production workloads with realistic worker counts (2-5 per service) see 0% release failures.