Decisions

ADR-0011: Lambda Deploy Strategy

Decision to use direct zip upload as the default Lambda code deployment method from CI/CD.

Approved

Status: Accepted

Date: 2026-03-14

Deciders: Engineering Leadership


Context

Ontopix runs multiple services on AWS Lambda (audit-service with 7 Lambdas, maxcolchon with 3). As we add CI/CD pipelines, we need a standard method for deploying Lambda code from GitHub Actions.

Two approaches exist:

  1. Direct zip upload — GitHub Actions builds the zip, then calls aws lambda update-function-code --zip-file fileb://dist/lambda.zip to push it directly to the Lambda function.
  2. S3 artifact upload — GitHub Actions builds the zip, uploads it to an S3 bucket with a versioned key (lambdas/{component}/{version}-{sha}.zip), then calls aws lambda update-function-code --s3-bucket ... --s3-key ....

The maxcolchon project implemented the S3 approach, introducing a dedicated {project}-{env}-lambda-artifacts bucket with versioning, lifecycle policies (90 days, 30 noncurrent versions), and a dual-mode Terraform configuration (S3 for CI/CD, local file for bootstrap). This works but adds infrastructure complexity.

Key observations:

  • Current Lambda bundles are small: 1-10 MB (TypeScript/esbuild) and 10-30 MB (Python/uv). All well under the 50 MB direct upload limit.
  • Rollback via S3 (pointing to a previous key) saves ~2 minutes vs rollback via git revert + CI rebuild. In practice, both require human intervention.
  • Git commit SHAs already provide full traceability of what code is deployed.
  • The GitHubActions-Lambda-DeployRole (ADR-003) already has lambda:UpdateFunctionCode permission, which supports both methods.

We need a standard default that balances simplicity with operational needs.

Decision

We adopt direct zip upload as the default Lambda code deployment method from CI/CD.

Core Rules

  1. Default method: aws lambda update-function-code --zip-file fileb://<path> from the GitHub Actions runner
  2. Terraform separation: Lambda functions MUST use lifecycle { ignore_changes = [filename, source_code_hash] } to decouple code deployments from infrastructure management
  3. Workflow structure: Two workflows per service — ci.yaml (PRs: validate) and deploy.yaml (master: build + deploy)
  4. Rollback: Revert the commit on master; CI/CD rebuilds and redeploys automatically
  5. S3 escalation: Services MAY adopt S3 artifact upload when escalation criteria are met (see below)

Escalation to S3 Artifacts

A service SHOULD switch to S3 artifact upload when any of these criteria apply:

  • Bundle size exceeds 40 MB zipped (approaching the 50 MB direct upload limit)
  • Regulatory or compliance requirements mandate persistent artifact retention with audit trail
  • Operational requirements demand instant rollback without rebuild (sub-30-second recovery SLA)

When escalating, follow the {project}-{env}-lambda-artifacts bucket naming convention established by the Lambda-DeployRole.

Terraform Configuration

Lambda functions use local file for initial bootstrap and ignore_changes to prevent Terraform from reverting CI/CD deployments:

resource "aws_lambda_function" "my_function" {
  function_name    = "${var.project}-${var.environment}-my-function"
  role             = aws_iam_role.my_function.arn
  handler          = "index.handler"
  runtime          = "nodejs20.x"
  filename         = "${path.module}/../dist/my-function.zip"
  source_code_hash = filebase64sha256("${path.module}/../dist/my-function.zip")

  lifecycle {
    ignore_changes = [filename, source_code_hash]
  }
}

Rationale

Why Direct Upload by Default?

Simplicity:

  • No additional infrastructure (no S3 bucket, lifecycle policies, versioning configuration)
  • One command per Lambda: aws lambda update-function-code --zip-file fileb://...
  • No intermediate state to manage or clean up

Sufficient for current scale:

  • All Ontopix Lambda bundles are 1-30 MB, well under the 50 MB limit
  • 10 total Lambdas across all services — not at a scale where artifact management adds value
  • Deploy cycles are fast (~2 minutes end-to-end)

Git as the source of truth:

  • Commit SHA identifies exactly what code is deployed
  • git log provides the full deployment history
  • Rollback = git revert + automatic CI/CD redeploy

Alignment with existing infrastructure:

  • Lambda-DeployRole (ADR-003) already supports lambda:UpdateFunctionCode
  • No IAM changes needed
  • No new S3 bucket permissions required

Why Keep S3 as an Escalation Path?

Direct upload has real limits:

  • 50 MB zip limit — large Python dependencies or bundled assets can exceed this
  • No persistent artifacts — if compliance requires knowing exactly which binary was running at a given time, git alone may not suffice
  • Rebuild-dependent rollback — a broken build system prevents rollback

The S3 approach is a valid escalation, not a wrong choice. The {project}-{env}-lambda-artifacts convention in Lambda-DeployRole ensures any service can adopt it without IAM changes.

Why Decouple Terraform from Code Deploys?

Without lifecycle { ignore_changes }, running terraform apply would revert Lambda code to whatever local zip was built during the apply. This creates two problems:

  • Developer running terraform apply to change an env var accidentally downgrades Lambda code
  • Terraform plan always shows code drift (noisy, masks real infrastructure changes)

The separation matches the Code vs Infrastructure boundary from ADR-003: GitHub Actions deploys code, developers manage infrastructure via Terraform.

Limitations

Direct zip upload is deliberately simple. That simplicity comes with hard constraints:

LimitationImpactThreshold
50 MB zip size limitAWS API rejects the uploadBundle exceeds 50 MB compressed
No persistent artifactsCannot inspect deployed binary without rebuilding from sourceAlways (inherent to the model)
Rebuild-dependent rollbackRollback requires CI to build from a previous commit; broken CI = no rollbackCI pipeline failure during incident
No atomic multi-Lambda deployEach Lambda updates independently; brief window where Lambdas run different versionsServices where Lambdas share a contract that changes simultaneously
No deployment history in AWSCloudWatch logs show function updates but not which artifact was deployedPost-incident forensics requiring binary-level traceability

When to Escalate to S3 Artifacts

A service MUST evaluate switching to S3 artifact upload when any of these triggers fire:

TriggerWhy it mattersAction
Bundle exceeds 40 MB zippedApproaching the 50 MB hard limit with no margin for growthSwitch to S3 upload (--s3-bucket/--s3-key)
Compliance or audit requires persistent artifact retentionRegulators or customers require proof of exactly which binary ran at a given timeAdd S3 bucket with versioning; retain artifacts per retention policy
Instant rollback SLA (sub-30 seconds)Rebuild takes ~2 minutes; S3 rollback is ~5 seconds (update-function-code pointing to previous key)Switch to S3 with versioned keys for instant repoint
CI pipeline unreliable for rollbackIf CI has frequent failures, rebuild-based rollback becomes risky during incidentsS3 artifacts decouple rollback from CI health
Multi-Lambda atomic deploy neededServices where Lambdas share a versioned contract and must update togetherS3 + CodeDeploy or custom orchestration

Proactive Monitoring

To prevent hitting limitations reactively:

  • CI bundle size check: Add a step that fails the build (or warns) when any zip exceeds 40 MB
  • Quarterly review: Check if any service's bundles are growing toward the threshold
  • Incident retro: After any rollback, evaluate if rebuild-based rollback was fast enough

Consequences

  • Zero additional infrastructure for Lambda CI/CD — no S3 buckets, lifecycle policies, or extra IAM
  • Fast adoption — any new service can add CI/CD without infrastructure changes
  • Clear escalation path — S3 artifacts are documented and IAM-ready when needed
  • See Limitations above for constraints and escalation triggers

Alternatives Considered

Alternative 1: S3 Artifact Upload as Default

Rejected as default because:

  • Adds infrastructure per service (S3 bucket, lifecycle, versioning)
  • Adds workflow complexity (upload to S3, then update Lambda)
  • Marginal benefit when bundles are small and deploys are fast
  • The maxcolchon implementation works but is more complex than needed for current scale

Not rejected entirely — remains the documented escalation path.

Alternative 2: AWS SAM / Serverless Framework

Rejected because:

  • Introduces a new abstraction layer over Terraform
  • Inconsistent with the .infra/ Terraform convention (ADR-0004)
  • Vendor-specific tooling that doesn't align with our IaC approach

Alternative 3: Container Image Lambdas

Rejected as default because:

  • Adds ECR dependency and Docker build complexity
  • Overkill for small Node.js/Python functions
  • May be appropriate for specific use cases (large ML models, custom runtimes)

References

Success Criteria

This decision is successful if:

  • New Lambda services adopt direct upload CI/CD without needing infrastructure changes
  • Existing services (maxcolchon) can migrate to direct upload, simplifying their setup
  • Bundle sizes remain under 40 MB (the proactive warning threshold)
  • Rollback time via git revert + CI rebuild stays under 5 minutes
  • Services that genuinely need S3 artifacts can escalate using the documented criteria