Operations

Monitoring Guide

Structured log events, CloudWatch alerts, and dashboard setup for Sluice observability.

Structured log events

All Sluice SDKs and the reconciler emit structured log events. These are the canonical events to monitor and alert on.

EventComponentFieldsMeaning
acquire.grantedSDKdimensions, lease_idSlot acquired successfully
acquire.retry_inSDKdimensions, wait_secondsInsufficient tokens; caller should wait
acquire.contention_retrySDKattempt, backoff_ms, dimensionsVersion conflict on TransactWriteItems; retrying with backoff
slot.timeoutSDKdimensions, timeoutSlot deadline exceeded before acquisition
release.okSDKdimensions, lease_idSlot released (lease record deleted)
penalize.appliedSDKdimension, factorTokens reduced after vendor 429
reconciler.scan_completeReconcilertotal_leases, restored, deletedReconciler scan finished

Logger names by SDK

SDKLogger
Pythonlogging.getLogger("sluice")
TypeScriptdebug("sluice:acquire"), debug("sluice:slot")
Goslog.With("component", "sluice")

CloudWatch Insights queries

These queries assume logs from Lambda functions using the Sluice SDK are shipped to CloudWatch Logs.

High contention rate

Find dimensions experiencing frequent version conflicts:

fields @timestamp, dimensions, attempt, backoff_ms
| filter event = "acquire.contention_retry"
| stats count() as retries by dimensions, bin(5m) as period
| sort retries desc
| limit 20

If a dimension consistently shows high contention, consider whether the capacity is too low for the number of concurrent consumers.

Capacity exhaustion

Find dimensions where callers are frequently told to wait:

fields @timestamp, dimensions, wait_seconds
| filter event = "acquire.retry_in"
| stats count() as throttled, avg(wait_seconds) as avg_wait by dimensions, bin(5m) as period
| sort throttled desc
| limit 20

Sustained high throttled counts mean demand exceeds the vendor's rate limit. Options:

  • Negotiate higher limits with the vendor
  • Implement request prioritization in the calling service
  • Spread load across multiple vendor accounts

Reconciler failures

Check for reconciler Lambda errors:

fields @timestamp, @message
| filter @logGroup like /sluice-reconciler/
| filter @message like /ERROR/ or @message like /Exception/
| sort @timestamp desc
| limit 50

Penalize frequency

Track how often vendor 429s are occurring:

fields @timestamp, dimension, factor
| filter event = "penalize.applied"
| stats count() as penalties by dimension, bin(15m) as period
| sort penalties desc

Frequent penalties on the same dimension indicate the configured capacity exceeds the vendor's actual limit. Consider reducing the bucket capacity in .infra/vendors.tf.

Slot timeouts

Find callers that are timing out waiting for slots:

fields @timestamp, dimensions, timeout
| filter event = "slot.timeout"
| stats count() as timeouts by dimensions, bin(5m) as period
| sort timeouts desc

Timeouts mean demand is so high that callers cannot acquire a slot within their deadline. This is more severe than retry_in -- work is being dropped, not delayed.

What to alert on

Critical

ConditionQuery basisThresholdAction
Reconciler Lambda errorsReconciler failures queryAny error in 15 minutesCheck Lambda logs, IAM permissions, DynamoDB connectivity
Sustained slot timeoutsSlot timeouts query> 10 timeouts in 5 minutes per dimensionCapacity planning needed; work is being dropped

Warning

ConditionQuery basisThresholdAction
High retry_in rateCapacity exhaustion query> 50% of acquisitions return retry_in over 15 minutesVendor limit may need increase
Frequent penalize callsPenalize frequency query> 5 penalties per dimension in 15 minutesBucket capacity likely exceeds vendor's real limit
High contention retriesHigh contention query> 20 contention retries per dimension in 5 minutesToo many concurrent consumers for this dimension

Dashboard suggestions (v0.2.0)

These are not yet implemented. Planned for when Sluice adds a CloudWatch metrics layer.

  1. Acquisition success rate -- acquire.granted / (acquire.granted + acquire.retry_in + slot.timeout) per dimension, as a time series
  2. Average wait time -- avg(wait_seconds) from acquire.retry_in events, per dimension
  3. Contention heatmap -- acquire.contention_retry count by dimension and time bucket
  4. Reconciler health -- restored and already_capped counts per invocation
  5. Penalize events -- timeline of penalize.applied events overlaid with acquire.retry_in rate

Metric candidates (requires SDK instrumentation)

MetricUnitDimensions
sluice.acquire.grantedCountvendor_dimension
sluice.acquire.retry_inCountvendor_dimension
sluice.acquire.wait_secondsSecondsvendor_dimension
sluice.acquire.contention_retriesCountvendor_dimension
sluice.slot.timeoutCountvendor_dimension
sluice.penalize.appliedCountvendor_dimension
sluice.reconciler.restoredCount--
sluice.reconciler.already_cappedCount--