Monitoring Guide
Structured log events, CloudWatch alerts, and dashboard setup for Sluice observability.
Structured log events
All Sluice SDKs and the reconciler emit structured log events. These are the canonical events to monitor and alert on.
| Event | Component | Fields | Meaning |
|---|---|---|---|
acquire.granted | SDK | dimensions, lease_id | Slot acquired successfully |
acquire.retry_in | SDK | dimensions, wait_seconds | Insufficient tokens; caller should wait |
acquire.contention_retry | SDK | attempt, backoff_ms, dimensions | Version conflict on TransactWriteItems; retrying with backoff |
slot.timeout | SDK | dimensions, timeout | Slot deadline exceeded before acquisition |
release.ok | SDK | dimensions, lease_id | Slot released (lease record deleted) |
penalize.applied | SDK | dimension, factor | Tokens reduced after vendor 429 |
reconciler.scan_complete | Reconciler | total_leases, restored, deleted | Reconciler scan finished |
Logger names by SDK
| SDK | Logger |
|---|---|
| Python | logging.getLogger("sluice") |
| TypeScript | debug("sluice:acquire"), debug("sluice:slot") |
| Go | slog.With("component", "sluice") |
CloudWatch Insights queries
These queries assume logs from Lambda functions using the Sluice SDK are shipped to CloudWatch Logs.
High contention rate
Find dimensions experiencing frequent version conflicts:
fields @timestamp, dimensions, attempt, backoff_ms
| filter event = "acquire.contention_retry"
| stats count() as retries by dimensions, bin(5m) as period
| sort retries desc
| limit 20
If a dimension consistently shows high contention, consider whether the capacity is too low for the number of concurrent consumers.
Capacity exhaustion
Find dimensions where callers are frequently told to wait:
fields @timestamp, dimensions, wait_seconds
| filter event = "acquire.retry_in"
| stats count() as throttled, avg(wait_seconds) as avg_wait by dimensions, bin(5m) as period
| sort throttled desc
| limit 20
Sustained high throttled counts mean demand exceeds the vendor's rate limit. Options:
- Negotiate higher limits with the vendor
- Implement request prioritization in the calling service
- Spread load across multiple vendor accounts
Reconciler failures
Check for reconciler Lambda errors:
fields @timestamp, @message
| filter @logGroup like /sluice-reconciler/
| filter @message like /ERROR/ or @message like /Exception/
| sort @timestamp desc
| limit 50
Penalize frequency
Track how often vendor 429s are occurring:
fields @timestamp, dimension, factor
| filter event = "penalize.applied"
| stats count() as penalties by dimension, bin(15m) as period
| sort penalties desc
Frequent penalties on the same dimension indicate the configured capacity exceeds the vendor's actual limit. Consider reducing the bucket capacity in .infra/vendors.tf.
Slot timeouts
Find callers that are timing out waiting for slots:
fields @timestamp, dimensions, timeout
| filter event = "slot.timeout"
| stats count() as timeouts by dimensions, bin(5m) as period
| sort timeouts desc
Timeouts mean demand is so high that callers cannot acquire a slot within their deadline. This is more severe than retry_in -- work is being dropped, not delayed.
What to alert on
Critical
| Condition | Query basis | Threshold | Action |
|---|---|---|---|
| Reconciler Lambda errors | Reconciler failures query | Any error in 15 minutes | Check Lambda logs, IAM permissions, DynamoDB connectivity |
| Sustained slot timeouts | Slot timeouts query | > 10 timeouts in 5 minutes per dimension | Capacity planning needed; work is being dropped |
Warning
| Condition | Query basis | Threshold | Action |
|---|---|---|---|
| High retry_in rate | Capacity exhaustion query | > 50% of acquisitions return retry_in over 15 minutes | Vendor limit may need increase |
| Frequent penalize calls | Penalize frequency query | > 5 penalties per dimension in 15 minutes | Bucket capacity likely exceeds vendor's real limit |
| High contention retries | High contention query | > 20 contention retries per dimension in 5 minutes | Too many concurrent consumers for this dimension |
Dashboard suggestions (v0.2.0)
These are not yet implemented. Planned for when Sluice adds a CloudWatch metrics layer.
Recommended widgets
- Acquisition success rate --
acquire.granted / (acquire.granted + acquire.retry_in + slot.timeout)per dimension, as a time series - Average wait time --
avg(wait_seconds)fromacquire.retry_inevents, per dimension - Contention heatmap --
acquire.contention_retrycount by dimension and time bucket - Reconciler health --
restoredandalready_cappedcounts per invocation - Penalize events -- timeline of
penalize.appliedevents overlaid withacquire.retry_inrate
Metric candidates (requires SDK instrumentation)
| Metric | Unit | Dimensions |
|---|---|---|
sluice.acquire.granted | Count | vendor_dimension |
sluice.acquire.retry_in | Count | vendor_dimension |
sluice.acquire.wait_seconds | Seconds | vendor_dimension |
sluice.acquire.contention_retries | Count | vendor_dimension |
sluice.slot.timeout | Count | vendor_dimension |
sluice.penalize.applied | Count | vendor_dimension |
sluice.reconciler.restored | Count | -- |
sluice.reconciler.already_capped | Count | -- |