DevOps & CloudMay 24, 2026 6 min read

The Day Our GCP Cloud Run Cold Starts Took Down Checkout

A Cloud Run service that ran fine for eighteen months started timing out checkout on a Friday afternoon. The fix wasn't more CPU — it was a misread of how min-instances, concurrency, and startup CPU boost actually interact.

A Cloud Run service that had quietly served traffic for eighteen months started timing out the checkout flow on a Friday at 16:40 local. No deploy, no traffic spike, no obvious smoking gun. By the time we restored full health it was 19:20, and the real lesson wasn't about adding capacity — it was about how three Cloud Run settings interact in ways the docs technically describe but don't really warn you about.

This is the postmortem we wrote internally, lightly sanitised. If you run anything serious on Cloud Run, the failure mode is worth knowing before you hit it.

The setup

The service in question was a Node.js (Fastify) checkout API sitting in front of Stripe and an internal order service. It had been running on Cloud Run gen2 with these settings, more or less unchanged for a year:

resource "google_cloud_run_v2_service" "checkout" {
  name     = "checkout-api"
  location = "europe-west1"

  template {
    scaling {
      min_instance_count = 0
      max_instance_count = 40
    }

    containers {
      image = "..."
      resources {
        limits = {
          cpu    = "1"
          memory = "512Mi"
        }
        startup_cpu_boost = false
      }
    }

    max_instance_request_concurrency = 80
    timeout                          = "30s"
  }
}

Traffic was modest: roughly 4–12 requests per second on weekdays, spiking to ~30 rps during marketing pushes. p50 latency sat around 180ms, p99 around 900ms. Fine.

What we saw

The alert that fired was from our synthetic checkout monitor: 30-second timeouts on /checkout/session. Sentry started lighting up with UND_ERR_HEADERS_TIMEOUT from clients downstream of an internal gateway calling into checkout.

The Cloud Run metrics told a confusing story:

Request count: normal, ~8 rps.
Container instance count: oscillating wildly between 1 and 14.
CPU utilisation: spiking to 100% on new instances, then dropping.
Request latency p99: 18–28 seconds. Not 900ms. Eighteen seconds.

The first instinct — "we're overloaded, scale up" — was wrong. Throughput hadn't changed. Something was making each request hideously expensive, but only on some instances.

The red herring

We spent the first 40 minutes chasing a database connection issue. The checkout service holds a small Postgres pool against Cloud SQL, and the slow request traces showed time spent in pg.connect. We bumped the pool, restarted, and felt momentarily better. Latency dropped to ~4s. Still terrible, but moving.

Then we noticed the pattern: slow requests clustered on freshly-started instances. Once an instance had served 20–30 requests, it was fine. New instances were the problem. We were watching cold starts, just very slow ones.

Why the cold starts got worse

Here's the part that wasn't obvious. Three things had drifted over the past few months:

The container image had grown. A teammate added a heavy SDK for a fraud-scoring vendor (their Node SDK pulls in a large WASM blob and does some work at require time). Image size went from ~180MB to ~340MB.
Concurrency was 80. Default-ish, never tuned. So when an instance cold-started, Cloud Run's load balancer happily sent it up to 80 concurrent requests immediately.
startup_cpu_boost was false. This was the killer. Without boost, a new instance gets its allocated 1 vCPU during startup. With boost, GCP gives it additional CPU for the first ~10 seconds.

Individually, none of these is fatal. Together they form a nasty feedback loop:

Traffic nudges Cloud Run to scale from 1 → 2 instances.
New instance takes ~6s to start (heavy image, no boost).
The moment it's ready, the LB routes ~80 in-flight requests to it.
The instance is now doing JIT warmup, lazy require() work, Postgres pool init, and processing 80 requests on 1 vCPU.
Requests on that instance take 15–25s. Some time out at 30s.
Timeouts cause retries from the upstream gateway, which inflates request count, which triggers more scaling, which creates more cold instances.

We weren't overloaded. We were amplifying our own load through cold-start fan-out.

The trigger that Friday afternoon was a marketing email going out — a perfectly reasonable 2x traffic bump that under normal cold-start cost would have been invisible.

The fix, in stages

We didn't get this right on the first try. Here's roughly what we changed and what each change actually bought us, in our environment.

Stage 1: Enable startup CPU boost

The single biggest win. This is documented as a feature for "reducing cold start latency" but the magnitude surprised us. Container startup time dropped from ~6s to ~2.2s, and the first batch of requests on a new instance no longer ran on a starved CPU.

resources {
  limits = {
    cpu    = "1"
    memory = "512Mi"
  }
  startup_cpu_boost = true
}

In our case this alone would probably have stopped the bleeding. It's also basically free — you only pay boost during startup.

Stage 2: Drop concurrency

We took max_instance_request_concurrency from 80 to 20. This is a tradeoff: lower concurrency means more instances at the same RPS, which means more cost and, potentially, more cold starts overall.

But it also means a cold instance gets at most 20 concurrent requests on its first second of life instead of 80. For a CPU-bound Node service, 20 is roughly where one vCPU stops thrashing.

If your workload is mostly I/O-bound and you're on a fat instance, the right number is higher. The point isn't 20 — it's that the default isn't the answer, and you should pick a number based on what your service actually does per request.

Stage 3: A modest min-instances floor

We set min_instance_count = 2. Not enough to handle steady-state traffic on its own, but enough that the first cold start of the day isn't a customer's.

This costs real money — two always-on instances at our CPU/memory shape is roughly a coffee-a-day per region, depending on the SKU. For a checkout path that directly converts to revenue, that maths is easy. For a background webhook receiver, we wouldn't bother.

Stage 4: Trim the image and defer work

Longer-term, we moved the fraud SDK initialization out of module scope and behind a lazy singleton. The WASM blob still has to load eventually, but not before the HTTP server is listening. We also switched the base image to node:20-slim and audited what was in node_modules at runtime. Image size dropped to ~210MB.

This kind of work is unglamorous and rarely prioritised until an incident makes it urgent.

What the dashboards should have told us sooner

In hindsight, the signal was there. Cloud Run exports container/startup_latencies and request_latencies separately. We were alerting on request latency p99 but had no alert on startup latency, even though startup latency had been creeping up for weeks as the image grew.

If you take one operational thing from this: alert on startup latency, not just request latency. A p95 startup latency above ~3s on a customer-facing service is usually a warning that you're one traffic bump away from the failure mode above.

We also wired the service's structured logs into a trace ID that flows from the gateway through to Cloud Run, so a slow request can be inspected end-to-end. OpenTelemetry's GCP exporter handles this cleanly enough; the work is mostly in being disciplined about propagating headers.

What we'd do differently from day one

If we were standing this service up again in 2026, knowing what we know:

Turn on startup_cpu_boost by default for any user-facing service. It's the cheapest reliability win on Cloud Run.
Set concurrency deliberately based on a load test, not a default. For Node, start around 20–40 and measure.
Use min_instance_count of at least 1 (often 2 across zones) for anything where a cold start is customer-visible. Skip it for async workers.
Alert on startup latency and instance churn, not only request latency.
Treat container image size as a reliability metric, not just a CI nuisance.

If you're auditing a Cloud Run service right now, those five checks take maybe an hour and will catch the bulk of what bit us. The rest — image hygiene, lazy initialization, trace propagation — is the kind of work that pays back the next time marketing forgets to warn you about a campaign. We help teams shake this kind of thing out as part of our DevOps and cloud work, and almost every Cloud Run audit we've done in the last year has found at least two of the five.

#GCP#Cloud Run#Reliability#Incident#Serverless

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Sentry Performance Costs Doubled Overnight. Here's What We Found.

Our Sentry bill jumped from ~$900 to ~$2,100 in a single billing cycle with no traffic change. Here's the investigation, the culprits we found, and the sampling strategy we settled on.

July 9, 2026 6 min

Pulumi vs Terraform in 2026: A Migration Story We Almost Regretted

We migrated a mid-sized AWS + Vercel estate from Terraform to Pulumi, hit real walls, and rolled part of it back. Here's what actually happened and when Pulumi is worth it.

July 6, 2026 7 min

OpenTelemetry Sampling: Why Head-Based Cost Us Real Incidents

We ran head-based sampling in OpenTelemetry for a year and it burned us during two real incidents. Here's what tail sampling actually costs, what it saved, and how we'd configure it from scratch.

July 3, 2026 6 min