DevOps & CloudMay 22, 2026 6 min read

The AWS NAT Gateway Bill That Ate Our Margin

A quiet $40/day NAT Gateway line item turned into the second-largest cost on our AWS account. Here's how we found it, what was actually driving it, and the VPC endpoint plumbing that fixed it.

We thought our AWS bill was boring. Compute was predictable, S3 was cheap, RDS was sized about right. Then a finance review flagged a line called NAT Gateway that was quietly running around $40 a day — more than our production database. This is the story of how we tracked it down, what we got wrong on the first pass, and the unglamorous networking changes that actually moved it.

The setup, because context matters

The workload was a mid-sized EKS cluster running about 30 services for a B2B SaaS product. Standard three-AZ VPC, private subnets for nodes, one NAT Gateway per AZ for egress. Nothing exotic. Traffic patterns were boring too: a lot of internal service-to-service calls, some calls to S3 and DynamoDB, a few third-party APIs, and the usual pull of container images from ECR.

We'd set this up two years earlier following the AWS reference architecture more or less verbatim. It worked. It was also bleeding money in a way nobody had bothered to look at.

What the bill actually showed

The AWS Cost Explorer breakdown for NAT Gateway has two components that matter:

Hourly charge for each gateway (around $0.045/hour per AZ in us-east-1, so roughly $100/month per gateway, $300/month for three AZs)
Data processing charge of $0.045 per GB of traffic through the gateway

The hourly cost was fine. The data processing charge was the killer: we were pushing roughly 25 GB/day through NAT, which on three gateways with cross-AZ effects worked out to the bulk of that $40/day. The question was what traffic, because the gateway itself doesn't tell you.

Finding the source: VPC flow logs and a lot of squinting

NAT Gateway billing is opaque by design. CloudWatch metrics give you BytesOutToDestination and BytesInFromDestination, but not per-source or per-destination. To get that, you need VPC flow logs.

We turned on flow logs for the private subnets, sent them to S3, and queried with Athena. The query that finally gave us a useful answer:

SELECT
  srcaddr,
  dstaddr,
  SUM(bytes) / 1024 / 1024 / 1024 AS gb_total
FROM vpc_flow_logs
WHERE
  date >= '2026/01/01'
  AND action = 'ACCEPT'
  AND srcaddr LIKE '10.%'
  AND dstaddr NOT LIKE '10.%'
GROUP BY srcaddr, dstaddr
ORDER BY gb_total DESC
LIMIT 50;

The top destinations were not what we expected. We assumed third-party API traffic was the problem. It wasn't. The top three buckets, in order, were:

S3 endpoints (s3.us-east-1.amazonaws.com ranges) — about 45% of NAT traffic
ECR image pulls during pod restarts and rolling deploys — about 20%
CloudWatch Logs ingestion from sidecars — about 12%

The rest was a long tail of legitimate egress: webhooks, OAuth callbacks, Stripe, a couple of analytics vendors.

In other words, ~77% of our NAT data processing charges were for traffic to AWS services in the same region. We were paying AWS to route traffic to AWS over the public internet. That's the part that stings once you see it.

Why this happens by default

When a pod in a private subnet calls s3.us-east-1.amazonaws.com, DNS resolves to a public IP. The route table for the private subnet sends 0.0.0.0/0 to the NAT Gateway. The NAT gateway forwards it out, AWS routes it back into S3, and you get charged $0.045/GB for the privilege. Nothing is broken. Nothing is misconfigured. It's just expensive.

The fix, in three layers

We fixed this in three passes over about two weeks. The order matters because each step makes the next one easier to measure.

Layer 1: Gateway endpoints for S3 and DynamoDB

S3 and DynamoDB are special. They support gateway VPC endpoints, which are free. No hourly charge, no data processing charge. You add an endpoint to your VPC, attach it to the relevant route tables, and traffic to those services stays on the AWS backbone.

In Terraform:

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
}

resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.dynamodb"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
}

This is genuinely free money. Adding it killed the S3 traffic through NAT almost entirely. The bill dropped roughly 40% within 24 hours. There is exactly one footgun: if you have endpoint policies attached, make sure they actually allow the actions your workloads need. We accidentally locked out a backup job for about ten minutes.

Layer 2: Interface endpoints for ECR and CloudWatch Logs

ECR, CloudWatch Logs, STS, SSM, and most other AWS services use interface endpoints (PrivateLink). These are not free — about $0.01/hour per endpoint per AZ, plus $0.01/GB data processing — but they're cheaper than NAT for any non-trivial traffic volume.

The math: an interface endpoint across three AZs costs about $22/month in fixed fees. Anything above ~500 GB/month of traffic through that service makes it cheaper than going via NAT, before you even count cross-AZ NAT effects.

For ECR you actually need three endpoints to do it properly:

locals {
  interface_endpoints = [
    "ecr.api",
    "ecr.dkr",
    "logs",
    "sts",
    "ssm",
    "secretsmanager",
  ]
}

resource "aws_vpc_endpoint" "interface" {
  for_each = toset(local.interface_endpoints)

  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.${each.value}"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

private_dns_enabled = true is what makes this transparent — the default AWS DNS names resolve to the endpoint instead of the public IP, so your pods need zero code changes. Just make sure the security group on the endpoint allows 443 from the node CIDR ranges. That tripped us up for about an hour.

Layer 3: Single NAT Gateway for non-critical environments

The reference architecture's one NAT per AZ is about availability: if an AZ fails, the others keep working. For our staging and development VPCs, we don't actually need that. The cost of three NATs at idle hourly rates is roughly $300/month per VPC, and we had four non-prod VPCs.

We collapsed those to a single NAT each. Production stayed three-AZ because that AZ-failure scenario is real and we've actually used it once. But for non-prod, an hour of egress downtime during an AZ event is a perfectly acceptable tradeoff for $200/month saved per environment.

What we'd watch out for next time

A few things we'd do differently if we were setting this up from scratch:

Add the S3 and DynamoDB gateway endpoints on day one. They're free, they're trivial, there's no reason not to. Bake them into your base VPC module.
Turn on VPC flow logs from day one, even at a sampled rate. When the bill goes weird, you need data going back, not data starting now. The cost is small if you sample (we sample at 1:10 for non-prod).
Treat NAT Gateway data processing as a metric, not just a line item. Put BytesOutToDestination on a dashboard next to your app metrics. A traffic anomaly here often signals a misbehaving service before it shows up anywhere else — we once caught a retry loop hammering a third-party API because NAT traffic doubled overnight.
Don't assume the reference architecture is optimised. It's optimised for getting started, not for steady state. Revisit it once you have real traffic patterns.

Where we'd start

If you're reading this and haven't looked at your NAT bill recently, do this today: open Cost Explorer, filter by service EC2 - Other, group by usage type, and look for NatGateway-Bytes. If that number is more than your hourly NAT charges, you almost certainly have AWS-to-AWS traffic going out through NAT. Add the S3 and DynamoDB gateway endpoints first — it's a 10-minute Terraform change and you'll see the impact on the next day's bill. Everything else can wait until you have flow log data to justify it.

If you'd like a hand auditing your AWS footprint or rebuilding a VPC module that doesn't quietly leak money, our team does this kind of work — see our DevOps and cloud services or browse more posts in this category.

#AWS#Cost Optimization#Networking#EKS#FinOps

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Sentry Performance Costs Doubled Overnight. Here's What We Found.

Our Sentry bill jumped from ~$900 to ~$2,100 in a single billing cycle with no traffic change. Here's the investigation, the culprits we found, and the sampling strategy we settled on.

July 9, 2026 6 min

Pulumi vs Terraform in 2026: A Migration Story We Almost Regretted

We migrated a mid-sized AWS + Vercel estate from Terraform to Pulumi, hit real walls, and rolled part of it back. Here's what actually happened and when Pulumi is worth it.

July 6, 2026 7 min

OpenTelemetry Sampling: Why Head-Based Cost Us Real Incidents

We ran head-based sampling in OpenTelemetry for a year and it burned us during two real incidents. Here's what tail sampling actually costs, what it saved, and how we'd configure it from scratch.

July 3, 2026 6 min