The AWS NAT Gateway Bill That Ate Our Margin
A quiet $40/day NAT Gateway line item turned into the second-largest cost on our AWS account. Here's how we found it, what was actually driving it, and the VPC endpoint plumbing that fixed it.

We thought our AWS bill was boring. Compute was predictable, S3 was cheap, RDS was sized about right. Then a finance review flagged a line called NAT Gateway that was quietly running around $40 a day — more than our production database. This is the story of how we tracked it down, what we got wrong on the first pass, and the unglamorous networking changes that actually moved it.
The setup, because context matters
The workload was a mid-sized EKS cluster running about 30 services for a B2B SaaS product. Standard three-AZ VPC, private subnets for nodes, one NAT Gateway per AZ for egress. Nothing exotic. Traffic patterns were boring too: a lot of internal service-to-service calls, some calls to S3 and DynamoDB, a few third-party APIs, and the usual pull of container images from ECR.
We'd set this up two years earlier following the AWS reference architecture more or less verbatim. It worked. It was also bleeding money in a way nobody had bothered to look at.
What the bill actually showed
The AWS Cost Explorer breakdown for NAT Gateway has two components that matter:
- Hourly charge for each gateway (around $0.045/hour per AZ in us-east-1, so roughly $100/month per gateway, $300/month for three AZs)
- Data processing charge of $0.045 per GB of traffic through the gateway
The hourly cost was fine. The data processing charge was the killer: we were pushing roughly 25 GB/day through NAT, which on three gateways with cross-AZ effects worked out to the bulk of that $40/day. The question was what traffic, because the gateway itself doesn't tell you.
Finding the source: VPC flow logs and a lot of squinting
NAT Gateway billing is opaque by design. CloudWatch metrics give you BytesOutToDestination and BytesInFromDestination, but not per-source or per-destination. To get that, you need VPC flow logs.
We turned on flow logs for the private subnets, sent them to S3, and queried with Athena. The query that finally gave us a useful answer:
SELECT
srcaddr,
dstaddr,
SUM(bytes) / 1024 / 1024 / 1024 AS gb_total
FROM vpc_flow_logs
WHERE
date >= '2026/01/01'
AND action = 'ACCEPT'
AND srcaddr LIKE '10.%'
AND dstaddr NOT LIKE '10.%'
GROUP BY srcaddr, dstaddr
ORDER BY gb_total DESC
LIMIT 50;
The top destinations were not what we expected. We assumed third-party API traffic was the problem. It wasn't. The top three buckets, in order, were:
- S3 endpoints (
s3.us-east-1.amazonaws.comranges) — about 45% of NAT traffic - ECR image pulls during pod restarts and rolling deploys — about 20%
- CloudWatch Logs ingestion from sidecars — about 12%
The rest was a long tail of legitimate egress: webhooks, OAuth callbacks, Stripe, a couple of analytics vendors.
In other words, ~77% of our NAT data processing charges were for traffic to AWS services in the same region. We were paying AWS to route traffic to AWS over the public internet. That's the part that stings once you see it.
Why this happens by default
When a pod in a private subnet calls s3.us-east-1.amazonaws.com, DNS resolves to a public IP. The route table for the private subnet sends 0.0.0.0/0 to the NAT Gateway. The NAT gateway forwards it out, AWS routes it back into S3, and you get charged $0.045/GB for the privilege. Nothing is broken. Nothing is misconfigured. It's just expensive.
The fix, in three layers
We fixed this in three passes over about two weeks. The order matters because each step makes the next one easier to measure.
Layer 1: Gateway endpoints for S3 and DynamoDB
S3 and DynamoDB are special. They support gateway VPC endpoints, which are free. No hourly charge, no data processing charge. You add an endpoint to your VPC, attach it to the relevant route tables, and traffic to those services stays on the AWS backbone.
In Terraform:
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = aws_route_table.private[*].id
}
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.dynamodb"
vpc_endpoint_type = "Gateway"
route_table_ids = aws_route_table.private[*].id
}
This is genuinely free money. Adding it killed the S3 traffic through NAT almost entirely. The bill dropped roughly 40% within 24 hours. There is exactly one footgun: if you have endpoint policies attached, make sure they actually allow the actions your workloads need. We accidentally locked out a backup job for about ten minutes.
Layer 2: Interface endpoints for ECR and CloudWatch Logs
ECR, CloudWatch Logs, STS, SSM, and most other AWS services use interface endpoints (PrivateLink). These are not free — about $0.01/hour per endpoint per AZ, plus $0.01/GB data processing — but they're cheaper than NAT for any non-trivial traffic volume.
The math: an interface endpoint across three AZs costs about $22/month in fixed fees. Anything above ~500 GB/month of traffic through that service makes it cheaper than going via NAT, before you even count cross-AZ NAT effects.
For ECR you actually need three endpoints to do it properly:
locals {
interface_endpoints = [
"ecr.api",
"ecr.dkr",
"logs",
"sts",
"ssm",
"secretsmanager",
]
}
resource "aws_vpc_endpoint" "interface" {
for_each = toset(local.interface_endpoints)
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.${each.value}"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
private_dns_enabled = true is what makes this transparent — the default AWS DNS names resolve to the endpoint instead of the public IP, so your pods need zero code changes. Just make sure the security group on the endpoint allows 443 from the node CIDR ranges. That tripped us up for about an hour.
Layer 3: Single NAT Gateway for non-critical environments
The reference architecture's one NAT per AZ is about availability: if an AZ fails, the others keep working. For our staging and development VPCs, we don't actually need that. The cost of three NATs at idle hourly rates is roughly $300/month per VPC, and we had four non-prod VPCs.
We collapsed those to a single NAT each. Production stayed three-AZ because that AZ-failure scenario is real and we've actually used it once. But for non-prod, an hour of egress downtime during an AZ event is a perfectly acceptable tradeoff for $200/month saved per environment.
What we'd watch out for next time
A few things we'd do differently if we were setting this up from scratch:
- Add the S3 and DynamoDB gateway endpoints on day one. They're free, they're trivial, there's no reason not to. Bake them into your base VPC module.
- Turn on VPC flow logs from day one, even at a sampled rate. When the bill goes weird, you need data going back, not data starting now. The cost is small if you sample (we sample at 1:10 for non-prod).
- Treat NAT Gateway data processing as a metric, not just a line item. Put
BytesOutToDestinationon a dashboard next to your app metrics. A traffic anomaly here often signals a misbehaving service before it shows up anywhere else — we once caught a retry loop hammering a third-party API because NAT traffic doubled overnight. - Don't assume the reference architecture is optimised. It's optimised for getting started, not for steady state. Revisit it once you have real traffic patterns.
Where we'd start
If you're reading this and haven't looked at your NAT bill recently, do this today: open Cost Explorer, filter by service EC2 - Other, group by usage type, and look for NatGateway-Bytes. If that number is more than your hourly NAT charges, you almost certainly have AWS-to-AWS traffic going out through NAT. Add the S3 and DynamoDB gateway endpoints first — it's a 10-minute Terraform change and you'll see the impact on the next day's bill. Everything else can wait until you have flow log data to justify it.
If you'd like a hand auditing your AWS footprint or rebuilding a VPC module that doesn't quietly leak money, our team does this kind of work — see our DevOps and cloud services or browse more posts in this category.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading

The Day Our GCP Cloud Run Cold Starts Took Down Checkout
A Cloud Run service that ran fine for eighteen months started timing out checkout on a Friday afternoon. The fix wasn't more CPU — it was a misread of how min-instances, concurrency, and startup CPU boost actually interact.

The Sentry Bill That Tripled Overnight: A Quota Postmortem
A single deploy turned a calm Sentry account into a $4k surprise. Here's what happened, what we changed, and how to stop event floods before finance notices.

Terraform vs Pulumi in 2026: A Migration We Half-Finished
We spent six months partially migrating a production AWS estate from Terraform to Pulumi. Here's what we kept, what we rolled back, and the boring reasons IaC choices rarely come down to language.
