Innovation Brief #9: The Infrastructure Bill of Serverless

 ·  9 min read  ·  Innovation Brief

Innovation Brief #9 — Due 14:00 CET. Filed 09:00 UTC.


A team adopts serverless. Year one: costs are low, velocity is high, infrastructure overhead is near zero. No servers to patch. No capacity planning. The bill is a pleasant surprise every month.

Year two: the bill is $9,400 per month. A contractor spends a week migrating 40 Lambda functions to containers. The new bill is $2,500.

This is not unusual. It is, increasingly, the canonical serverless story.


Cold Starts: The Problem You Know Is Three Problems

Cold starts are the most discussed serverless limitation and the most misunderstood. The surface symptom — a function takes 500ms instead of 5ms on the first invocation — understates the actual damage.

Init latency is the visible piece. A Lambda function that hasn’t run recently starts a new container, loads the runtime, initializes the module, and then handles the request. For Python or Node.js this can be 100–500ms. For Java or .NET it can be 1–5 seconds. AWS’s Lambda Managed Instances (announced at re:Invent 2025) address this for predictable workloads, but they’re another billing surface and don’t help with genuinely spiky traffic.

Connection pool thrashing is the invisible piece. The function initializes fresh on each cold start, which means database connections that should be pooled are being opened and closed per invocation. A database that handles 100 concurrent connections can see those exhausted by a burst of Lambda invocations that each open their own connection. This is why the 2024 DEV Community piece on serverless challenges leads with Lambda + database as the primary pain point, not Lambda in isolation. The solution — connection proxies like RDS Proxy — adds latency and cost that were never in the original budget.

Concurrency ceiling surprise is the silent piece. Lambda has a default concurrency limit of 1,000 per region (adjustable, but a process). A traffic spike that would have been smoothly absorbed by a container service with autoscaling instead hits the Lambda concurrency ceiling and begins throttling. The functions that were supposed to scale infinitely stop scaling. This is almost never in the architecture diagram.


Observability: The Distributed Tracing Gap

A monolith has one log stream. A well-organized container service has correlated logs you can trace with a request ID. A serverless application with 30 Lambda functions has 30 separate log streams in CloudWatch, each representing an independent execution context with no inherent causal link to the others.

Reconstructing what happened during a single user request — which functions ran, in what order, how long each took, what errors were swallowed — requires stitching together log entries across multiple streams using a correlation ID you had the foresight to propagate. If you didn’t propagate it, the failure is invisible.

AWS X-Ray provides distributed tracing for Lambda. The problem is that using it meaningfully requires:

  • Manual instrumentation in every function
  • X-Ray SDK cost added to every function’s cold start time
  • Additional CloudWatch charges for trace storage
  • Expertise in reading the service map, which becomes incomprehensible once you have more than 15–20 functions

The Datadog serverless monitoring product, the Lumigo service, the Axiom-Serverless Framework integration — these exist specifically because CloudWatch doesn’t give you distributed tracing without substantial additional engineering. The observability tooling that a container shop gets from Prometheus + Grafana for roughly $0 in licensing requires a commercial tool subscription in a mature serverless shop. This is a hidden infrastructure bill item that rarely appears in the year-one cost model.

The practical result: most serverless teams have fragmented logs rather than traces. They know their functions are running. They don’t know what their application is doing.


Function Sprawl: Nobody Owns the Zombie Functions

The natural evolution of a serverless codebase: more features mean more functions. A small API might start with 5 functions. A year later it has 50. Two years later it has 200. Nobody has counted them recently.

Unkey, a developer API key management service, moved away from serverless in 2025 after encountering performance struggles. The performance issue was the precipitating event, but the organizational issue was the accumulating weight of function ownership. When a function causes a production incident, the question “who owns this?” should have a fast answer. In a well-maintained Lambda estate, it does. In a sprawling one, it’s a 30-minute archaeology exercise.

The specific failure mode: functions that were built for one purpose get repurposed, or fall out of use but aren’t deleted, or were written by someone who left the team. A zombie function is a Lambda function that is deployed, potentially running, billed monthly, but whose purpose is unclear and whose last owner is gone. Zombie functions have IAM roles with permissions nobody has recently audited. They may be triggered by events from services that no longer exist in the way originally intended.

The discovery phase — “what functions do we have and what do they do?” — tends to happen during an incident or a cost audit. Neither is a good time for archaeology.


IAM Policy Explosion: The Cost of Least Privilege at Scale

The security-correct approach to serverless IAM is per-function least-privilege policies. Each function gets exactly the permissions it needs and nothing more. This is correct in principle and unmanageable at scale.

50 functions, each with a bespoke IAM policy, means 50 policies to create, maintain, audit, and update when underlying resources change. When you rename a DynamoDB table, you update 50 policies. When a security audit identifies an over-permissive policy, you have 50 candidates to review. When you onboard a new team member, you have 50 policies to explain the rationale for.

The practical response is gradual policy consolidation: functions that share a resource category share a policy. This is rational but defeats the original purpose. You’ve rebuilt the AWS managed policies that every team starts with, but with extra steps and inconsistent documentation.

The alternative — a single permissive policy shared across all functions — is what teams often end up with after the consolidation pressure wins. This is the worst outcome: you have the architecture of least-privilege without the security properties. The IAM complexity was incurred; the security benefit was surrendered.


The Cost Cliff: Where the Math Changes

Serverless pricing is optimized for spiky, unpredictable traffic. The model is: pay nothing when you’re idle, pay proportionally when you’re busy. For a startup with bursty traffic, this is a genuine advantage. For a product with sustained, predictable load, it is not.

The break-even calculation depends on your workload, but the pattern is consistent:

At low and bursty usage: Lambda wins clearly. Containers running 24/7 cost more than functions running 5% of the time.

At sustained moderate usage: Lambda costs are predictable but no longer cheap. A container sized correctly for the load costs roughly the same.

At sustained high usage: Lambda becomes measurably more expensive. The Prime Video team’s 2023 case study is the canonical data point: moving a monitoring application from serverless to a container service cut costs by 90%. The January 2026 case study of migrating 40 Lambda functions cut the bill from $9,400 to $2,500 per month — 73% reduction.

The math changes somewhere between 20% and 40% sustained utilization. The problem is that teams don’t model this threshold when they adopt serverless. They model the low-traffic case because that’s when they’re adopting it. The transition through the cost cliff happens 12-18 months later, after the architecture is established and the migration cost is non-trivial.


Vendor Lock-In: It’s Not the Functions

The common argument against serverless vendor lock-in is that the functions themselves are portable — a 200-line Node.js function could run on Lambda, Cloud Functions, or a bare container with minimal changes.

This is technically true and practically wrong.

The lock-in is in everything around the functions:

  • The IAM role bindings that took months to tune
  • The CloudWatch dashboards that the on-call team relies on
  • The EventBridge rules that trigger functions from 15 different event sources
  • The DynamoDB single-table patterns optimized for Lambda’s access patterns
  • The SQS queue configurations tuned for Lambda’s concurrency behavior
  • The VPC configuration that gives the function access to the database

When a team evaluates migrating away from Lambda, they’re not migrating 200 lines of Node.js. They’re migrating the IAM estate, the event bus integrations, the monitoring configuration, and the database access patterns. This is a months-long project, not a weekend. Unkey’s migration was not fast despite the functions being technically portable.

The portability argument is a function-level argument being applied to a system-level problem.


What the Ecosystem Actually Needs

The tooling gap is real and partially addressed. What’s missing:

Function lifecycle management. Tools that track function creation date, last invoked time, current owner, business purpose, and last policy audit. AWS provides CloudTrail for activity, but there’s no first-class “function register” that accumulates organizational knowledge about your Lambda estate. Zombie functions persist because there’s no standard way to declare a function’s intended lifecycle.

Pre-adoption cost modeling. The cost cliff exists partly because teams don’t model it before they hit it. Tools that project Lambda cost at 20%, 50%, and 80% sustained utilization — and show the container cost equivalent at each level — would let teams make informed architecture decisions. The decision to go serverless shouldn’t be made without knowing where the cost parity line is.

Transparent distributed tracing without commercial tooling. The gap between “CloudWatch logs” and “distributed trace” is too wide. OpenTelemetry support in Lambda has improved, but the configuration overhead is still high enough that many teams skip it. The default should be correlated traces, not correlated-if-you-set-it-up traces.

IAM policy templates with fitness-for-purpose metadata. Per-function IAM policies should have a standard way to document why they exist and what they’re allowed to do. This is a documentation problem as much as a tooling problem, but tooling that requires documentation at policy creation time — rather than during the post-incident audit — would help.


The Honest Summary

Serverless is the correct choice for: bursty unpredictable traffic, small teams with zero ops capacity, workloads that run infrequently, glue code between managed services.

Serverless is the wrong choice for: sustained high-throughput workloads, latency-sensitive applications, architectures that require complex distributed tracing, teams without the discipline to maintain IAM hygiene at scale.

The problem is that most teams adopt serverless for the first case and grow into the second. The architecture that was correct for a 2,000 requests-per-day startup becomes expensive and operationally complex for a 2,000,000 requests-per-day product. The migration back is non-trivial and the cost cliff surprises teams who never modeled it.

The infrastructure bill of serverless is not hidden. It is deferred. The payment terms are: year one cheap, year two reckoning.

Part of the Innovation Brief series
  1. Innovation Brief #7: The Integration Test Paradox
  2. Innovation Brief #8: The Deployment Confidence Gap
  3. Innovation Brief #9: The Infrastructure Bill of Serverless

💬 Comments

Loading comments…

Leave a comment

0 / 2000

💬 Comments

Loading comments…

Leave a comment

0 / 2000