Operations

Stress Testing

Methodology, invariants, and results for Sluice cross-language stress tests.

Sluice includes a cross-language stress test suite that exercises the acquire/release cycle under high contention on real DynamoDB. The suite runs identical workloads from all three SDKs (Python, TypeScript, Go) simultaneously against a shared bucket.

Running stress tests

Local sandbox (LocalStack)

task stress:seed         # Reset sandbox and seed stress buckets
task stress:run          # Run all three SDKs in parallel (default: 10 workers, 10s)
task stress:verify       # Check correctness invariants
task stress:report       # Generate cross-SDK comparison report

Override defaults with environment variables:

STRESS_DIMENSION=stress#concurrent STRESS_WORKERS=5 STRESS_DURATION_SECS=10 task stress:run

Pre environment (real DynamoDB)

task stress:seed:pre     # Seed stress buckets into pre environment
task stress:run:pre      # Run against real DynamoDB (eu-central-1)
task stress:verify:pre   # Verify invariants on pre

Requires AWS credentials with dynamodb:GetItem, dynamodb:PutItem, dynamodb:UpdateItem, dynamodb:DeleteItem, dynamodb:Scan on ontopix-vendor-buckets-pre.

Stress buckets

The seed tasks create these buckets:

DimensionTypeCapacityRefill RatePurpose
stress#rpmrequests505.0/sTime-refill rate limiting
stress#concurrentconcurrent100Concurrent slot limiting
stress#openai-rpmrequests5008.333/sRealistic OpenAI RPM
stress#openai-tpmtokens200,0003,333.33/sRealistic OpenAI TPM
stress#elevenlabs-concurrentconcurrent50Realistic ElevenLabs concurrency

Worker behavior

Each stress runner spawns STRESS_WORKERS workers (default 10) per SDK. Each worker loops for STRESS_DURATION_SECS:

  1. Acquire a slot on the target dimension
  2. If GRANTED: simulate a vendor call (5-15ms random sleep), then release
  3. If RETRY_IN: sleep for the suggested wait, then retry
  4. Emit NDJSON event to stdout for every outcome

Workers stop acquiring 2 seconds before the deadline (drain buffer) to allow in-flight releases to complete.

NDJSON output format

Each line is a JSON object with:

{"sdk": "python", "worker": 3, "outcome": "granted", "elapsed_ms": 45.2, "attempt": 7, "ts": 1773571234.567}
{"sdk": "go", "worker": 1, "outcome": "retry_in", "wait_seconds": 0.6, "elapsed_ms": 12.3, "attempt": 2, "ts": 1773571234.890}
{"sdk": "typescript", "worker": 5, "outcome": "release_error", "error": "TransactionConflictException: ...", "attempt": 15, "ts": 1773571235.123}
{"sdk": "python", "type": "summary", "total_grants": 42, "total_retries": 180, "total_errors": 0, "duration_secs": 10.0}

Outcomes: granted, retry_in, error (acquire failure), release_error (release failure).

Correctness invariants

The verifier (task stress:verify) checks these invariants after each run:

1. No orphaned leases

All leases created during the test must be released. The verifier scans for lease items with ttl > now (excludes DynamoDB-TTL-expired items awaiting garbage collection).

Applies to: All limit types.

2. Token conservation (concurrent only)

For concurrent limits (refill_rate=0), tokens are a finite resource. At quiescence:

tokens + active_leases = capacity

With 0 active leases, tokens must equal capacity.

Note: Under extreme contention (60+ workers on a 10-slot bucket), occasional TransactionConflictException on release can cause token loss. This is by design — the reconciler restores tokens from expired leases within 5 minutes. The stress test does not run the reconciler, so token conservation may fail at very high contention levels.

3. Version consistency

Each acquire and release increments the bucket's version counter:

  • Concurrent types: version = 2 * total_grants (acquire + release each increment once)
  • Request/token types: version = total_grants (only acquire increments)

4. No over-grant (requests/tokens only)

Total grants must not exceed the theoretical maximum:

max_allowed = capacity + duration * refill_rate

A 5% tolerance accounts for floating-point timing differences across SDKs.

Expected contention rates

High contention is expected and correct. With 30 workers competing for optimistic locks on a single DynamoDB item:

DimensionWorkersExpected Contention
stress#rpm10/SDK85-95%
stress#concurrent5/SDK35-50%
stress#concurrent10/SDK75-85%

Contention means the acquire transaction was cancelled and retried with exponential backoff. This is the intended behavior of optimistic locking.

Results

Sandbox (LocalStack)

DimensionWorkersDurationInvariantsErrors
stress#rpm10/SDK10sALL PASS0
stress#concurrent5/SDK10sALL PASS0

Pre (real DynamoDB, eu-central-1)

DimensionWorkersDurationGrantsInvariantsErrors
stress#rpm10/SDK30s191ALL PASS0
stress#concurrent10/SDK60s610ALL PASS0
stress#concurrent20/SDK90s891token_conservation: -55 release errors

At 20 workers/SDK (60 total), 5 out of 891 releases (0.6%) failed with TransactionConflictException after 5 retry attempts. Token loss is temporary — the reconciler restores tokens from expired leases within its 5-minute scan interval.

Release error handling

The non-transactional release pattern (see Architecture) handles errors specifically:

ErrorActionRecovery
ConditionalCheckFailedExceptionSkip token restore (capacity guard)None needed — tokens already at capacity
TransactionConflictExceptionRetry up to 5 times (25ms * attempt backoff)If all retries fail: tokens lost, reconciler restores within 5 min
Other errors (network, throttling)Propagate to callerLease persists until TTL, reconciler restores tokens

Troubleshooting

Orphaned leases on pre after multiple runs

DynamoDB TTL deletion is eventually consistent (up to 48 hours). The verifier filters expired leases (ttl > now), but if you see unexpected leases, wait for TTL garbage collection or manually delete them.

High latency on pre

Expect higher p50/p95 latencies on real DynamoDB (100-600ms) compared to LocalStack (10-50ms). This is network latency to eu-central-1, not a performance issue with Sluice.

Token loss at extreme contention

At 60+ workers competing for 10 slots, ~0.5-1% of releases may fail with TransactionConflictException. This is acceptable — the reconciler handles recovery. Production workloads with realistic worker counts (2-5 per service) see 0% release failures.