DevOps & CloudJune 23, 2026 6 min read

Terraform State Locking Failed Mid-Apply. Here's What We Learned.

A DynamoDB throttle event left our Terraform state half-written and locked. Here's the postmortem, the recovery steps, and the guardrails we added so it doesn't happen again.

A Terraform apply against our shared networking module died halfway through on a Tuesday afternoon. The state file ended up in a weird in-between place, the DynamoDB lock refused to release, and three engineers spent the next ninety minutes trying not to make it worse. This is the postmortem, with the recovery steps we actually ran and the guardrails we wish we'd had on day one.

What broke, in one paragraph

We run Terraform with the standard AWS backend: state in S3, lock in DynamoDB. The apply was modifying a transit gateway attachment plus a handful of route tables — nothing exotic. Roughly six minutes in, the CI runner lost its IAM session because someone had rotated the role's trust policy in a parallel PR (don't ask). Terraform tried to write the updated state back to S3, failed the credential check, retried, and then exited non-zero. The DynamoDB lock row stayed put. The S3 state object was technically still the pre-apply version, but several real AWS resources had already been mutated.

That gap — real infrastructure changed, state file says otherwise, lock won't release — is the classic IaC nightmare.

Why the lock didn't auto-release

The AWS backend's locking model is simpler than people assume. Terraform writes a row to DynamoDB at the start of a plan/apply, and deletes it on clean exit. There's no TTL. There's no heartbeat. If the process dies between "acquire lock" and "release lock" — SIGKILL, OOM, credential expiry, network partition, runner eviction — the row stays forever.

We knew this in theory. We had not internalised what it means when your CI runner is ephemeral and your engineers are on a Slack huddle trying to remember which workspace owns which lock ID.

The lock row, for reference

If you've never looked at one, it's just this:

{
  "LockID": "my-bucket/env/prod/network.tfstate-md5",
  "Info": "{\"ID\":\"a1b2c3...\",\"Operation\":\"OperationTypeApply\",\"Who\":\"runner@ci-prod-7\",\"Created\":\"2026-...\"}"
}

That ID field is what terraform force-unlock wants. Write it down somewhere your panicked future self can find it.

The recovery, step by step

We did not run terraform force-unlock first. That was the right call, because the state file and reality were out of sync, and forcing the lock would have invited the next engineer to apply on top of stale state.

Here's the order we actually went in:

1. Freeze the blast radius

First thing: we paused the CI pipeline for that workspace and posted in the channel that nobody should run anything against network/prod. Cheap, obvious, easy to forget when adrenaline kicks in.

2. Snapshot the state bucket

S3 versioning was on (it had better be). We listed every version of the state object from the last 24 hours and copied them to a quarantine prefix. If we did something stupid in the next hour, we wanted a known-good rollback.

aws s3api list-object-versions \
  --bucket tf-state-prod \
  --prefix env/prod/network.tfstate \
  --query 'Versions[?LastModified>=`2026-...`]'

3. Reconcile state to reality

This is where most teams go wrong. We pulled the current state locally, then for every resource the failed apply had touched, we ran a targeted terraform import or terraform state rm to make the file match what AWS actually had. We used the CloudTrail event history for the runner's role to get an authoritative list of what changed — not the Terraform plan output, which was now lying to us.

For a transit gateway attachment that had been created but never recorded:

terraform import \
  module.network.aws_ec2_transit_gateway_vpc_attachment.shared \
  tgw-attach-0abc123def456

For a route that had been deleted but state still claimed existed, terraform state rm was enough.

4. Now release the lock

With state and reality reconciled, force-unlock was safe:

terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef0123456789

5. Plan, eyeball, apply

The next plan should be near-empty. Ours wasn't — it surfaced two route table entries we'd missed in reconciliation. We fixed those by hand in the console (yes, really) and re-imported. Then a clean plan. Then apply. Total elapsed: about 90 minutes from break to green.

What we changed afterward

A postmortem that doesn't change anything is just a story. We made four concrete changes.

Short-lived credentials with generous TTL during apply

The original trigger was a credential rotation mid-apply. We moved our CI runners to OIDC-based role assumption with a 2-hour session, and added a pre-flight check that refuses to start an apply if the remaining session TTL is under 30 minutes. This is one of those guardrails that costs almost nothing and would have prevented the whole incident.

A lock-age alarm

We added a CloudWatch alarm that fires when any row in the lock table is older than 20 minutes. Most of our applies finish in under 5. A 20-minute-old lock is, in practice, always a stuck lock. The alarm pages whoever is on-call for platform, with a link to a runbook that says, in big letters, do not force-unlock without checking state drift first.

Smaller blueprints

The failing module was doing too much. Splitting our network module into three smaller workspaces (transit gateway, VPCs, route tables) means each apply touches fewer resources, finishes faster, and has a smaller reconciliation surface when something goes wrong. The tradeoff: more cross-workspace data sources and a slightly more annoying dependency graph. Worth it.

A read-only "what does AWS actually have" script

We wrote a small tool that, given a workspace, queries AWS directly for every resource type the module manages and dumps a comparable structure. It's not a replacement for terraform plan — it's the thing you run when you don't trust terraform plan. During the incident, we built this from scratch under pressure. Now it lives in our platform repo.

Things we considered and rejected

Switching backends

We looked at Terraform Cloud and at the HCP Terraform managed state offering. Both handle locking more gracefully and both have proper run queues. We didn't switch, mostly because the migration cost for our number of workspaces was real and the underlying problem — apply dies mid-flight, state drifts — exists on every backend. A better lock doesn't save you from reconciliation work.

DynamoDB TTL on the lock table

Tempting and wrong. A TTL that auto-deletes a lock after, say, an hour, would have unlocked us automatically — and then the next engineer would have applied on top of a state file that didn't match reality. The lock is doing its job by staying stuck. It's a signal, not a bug.

Banning `force-unlock`

We briefly discussed making the command unavailable in CI. We kept it available but gated it behind a script that requires a linked incident ticket and prints the reconciliation checklist before running. Friction in the right place.

The uncomfortable lesson

The failure wasn't really about Terraform. It was about an unspoken assumption that the apply step was atomic. It isn't. There's always a window between AWS changed and state file updated, and your operational model has to account for that window being interrupted by anything from a flaky network to a colleague's well-intentioned IAM PR.

If you take one thing from this: the worst thing you can do when a Terraform lock is stuck is run force-unlock first. The lock is the only thing stopping the situation from getting worse.

Where we'd start

If you're running Terraform against AWS at any real scale and you haven't been bitten by this yet, three things to do this week:

Turn on S3 versioning for your state bucket if it isn't already. This is the single highest-value, lowest-effort change.
Add an alarm on lock-row age. Twenty minutes is a fine starting threshold.
Write the reconciliation runbook before you need it. Ours lives next to the on-call rota and gets reviewed quarterly.

If you'd rather have someone else build these guardrails into your platform, that's the kind of work our team does on our DevOps and cloud engagements. And if you've got your own war story, we'd genuinely like to read it.

#Terraform#AWS#IaC#DevOps#Postmortem

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Vercel Edge Middleware Cold Starts Wrecked Our p95. Here's the Fix.

Edge middleware promised sub-50ms execution. Our p95 said otherwise. Here's what we found when we instrumented it properly, and the three changes that brought latency back under control.

June 25, 2026 6 min

Our AWS NAT Gateway Bill Hit $4k/Month. Here's How We Cut It by 80%.

A single NAT Gateway line item quietly ate our cloud budget. Here's the traffic audit, the VPC endpoint rollout, and the gotchas nobody mentions in the AWS docs.

June 20, 2026 7 min

GCP Cloud Run vs AWS Lambda for a Real Next.js Backend: What We Picked and Why

We ran the same Next.js API workload on Cloud Run and Lambda for three months. Cold starts, cost, observability, and one nasty timeout bug shaped the decision.

June 17, 2026 6 min