Operations

Monitoring Guide

Structured log events, CloudWatch alerts, and dashboard setup for Sluice observability.

Structured log events

All Sluice SDKs and the reconciler emit structured log events. These are the canonical events to monitor and alert on.

Event	Component	Fields	Meaning
`acquire.granted`	SDK	`dimensions`, `lease_id`	Slot acquired successfully
`acquire.retry_in`	SDK	`dimensions`, `wait_seconds`	Insufficient tokens; caller should wait
`acquire.contention_retry`	SDK	`attempt`, `backoff_ms`, `dimensions`	Version conflict on TransactWriteItems; retrying with backoff
`slot.timeout`	SDK	`dimensions`, `timeout`	Slot deadline exceeded before acquisition
`release.ok`	SDK	`dimensions`, `lease_id`	Slot released (lease record deleted)
`penalize.applied`	SDK	`dimension`, `factor`	Tokens reduced after vendor 429
`reconciler.scan_complete`	Reconciler	`total_leases`, `restored`, `deleted`	Reconciler scan finished

Logger names by SDK

SDK	Logger
Python	`logging.getLogger("sluice")`
TypeScript	`debug("sluice:acquire")`, `debug("sluice:slot")`
Go	`slog.With("component", "sluice")`

CloudWatch Insights queries

These queries assume logs from Lambda functions using the Sluice SDK are shipped to CloudWatch Logs.

High contention rate

Find dimensions experiencing frequent version conflicts:

fields @timestamp, dimensions, attempt, backoff_ms
| filter event = "acquire.contention_retry"
| stats count() as retries by dimensions, bin(5m) as period
| sort retries desc
| limit 20

If a dimension consistently shows high contention, consider whether the capacity is too low for the number of concurrent consumers.

Capacity exhaustion

Find dimensions where callers are frequently told to wait:

fields @timestamp, dimensions, wait_seconds
| filter event = "acquire.retry_in"
| stats count() as throttled, avg(wait_seconds) as avg_wait by dimensions, bin(5m) as period
| sort throttled desc
| limit 20

Sustained high throttled counts mean demand exceeds the vendor's rate limit. Options:

Negotiate higher limits with the vendor
Implement request prioritization in the calling service
Spread load across multiple vendor accounts

Reconciler failures

Check for reconciler Lambda errors:

fields @timestamp, @message
| filter @logGroup like /sluice-reconciler/
| filter @message like /ERROR/ or @message like /Exception/
| sort @timestamp desc
| limit 50

Penalize frequency

Track how often vendor 429s are occurring:

fields @timestamp, dimension, factor
| filter event = "penalize.applied"
| stats count() as penalties by dimension, bin(15m) as period
| sort penalties desc

Frequent penalties on the same dimension indicate the configured capacity exceeds the vendor's actual limit. Consider reducing the bucket capacity in .infra/vendors.tf.

Slot timeouts

Find callers that are timing out waiting for slots:

fields @timestamp, dimensions, timeout
| filter event = "slot.timeout"
| stats count() as timeouts by dimensions, bin(5m) as period
| sort timeouts desc

Timeouts mean demand is so high that callers cannot acquire a slot within their deadline. This is more severe than retry_in -- work is being dropped, not delayed.

What to alert on

Critical

Condition	Query basis	Threshold	Action
Reconciler Lambda errors	Reconciler failures query	Any error in 15 minutes	Check Lambda logs, IAM permissions, DynamoDB connectivity
Sustained slot timeouts	Slot timeouts query	> 10 timeouts in 5 minutes per dimension	Capacity planning needed; work is being dropped

Warning

Condition	Query basis	Threshold	Action
High retry_in rate	Capacity exhaustion query	> 50% of acquisitions return retry_in over 15 minutes	Vendor limit may need increase
Frequent penalize calls	Penalize frequency query	> 5 penalties per dimension in 15 minutes	Bucket capacity likely exceeds vendor's real limit
High contention retries	High contention query	> 20 contention retries per dimension in 5 minutes	Too many concurrent consumers for this dimension

Dashboard suggestions (v0.2.0)

These are not yet implemented. Planned for when Sluice adds a CloudWatch metrics layer.

Recommended widgets

Acquisition success rate -- acquire.granted / (acquire.granted + acquire.retry_in + slot.timeout) per dimension, as a time series
Average wait time -- avg(wait_seconds) from acquire.retry_in events, per dimension
Contention heatmap -- acquire.contention_retry count by dimension and time bucket
Reconciler health -- restored and already_capped counts per invocation
Penalize events -- timeline of penalize.applied events overlaid with acquire.retry_in rate

Metric candidates (requires SDK instrumentation)

Metric	Unit	Dimensions
`sluice.acquire.granted`	Count	`vendor_dimension`
`sluice.acquire.retry_in`	Count	`vendor_dimension`
`sluice.acquire.wait_seconds`	Seconds	`vendor_dimension`
`sluice.acquire.contention_retries`	Count	`vendor_dimension`
`sluice.slot.timeout`	Count	`vendor_dimension`
`sluice.penalize.applied`	Count	`vendor_dimension`
`sluice.reconciler.restored`	Count	--
`sluice.reconciler.already_capped`	Count	--

Infrastructure Operations

Terraform layout, deployment workflow, and environment management for Sluice infrastructure.

Reconciler Operations

How the reconciler Lambda recovers leaked tokens from expired lease records.