DevOps & CloudJune 20, 2026 7 min read

Our AWS NAT Gateway Bill Hit $4k/Month. Here's How We Cut It by 80%.

A single NAT Gateway line item quietly ate our cloud budget. Here's the traffic audit, the VPC endpoint rollout, and the gotchas nobody mentions in the AWS docs.

The line item said NAT Gateway - Data Processed and the number next to it was the kind that makes you refresh the console. Roughly $4,000 a month on a workload that, by our own architecture diagrams, should have been mostly internal AWS-to-AWS traffic. This is the story of how we traced it, what we changed, and the parts of the AWS networking model that punish you for not reading the fine print.

The bill that didn't add up

We run a fairly standard setup for one of our client platforms: an ECS Fargate cluster in private subnets, an RDS Postgres instance, S3 for assets, SQS for job queues, and a handful of third-party APIs (Stripe, SendGrid, an analytics vendor). Two NAT Gateways for HA across AZs.

The monthly AWS bill had three big rocks: Fargate compute, RDS, and — surprisingly — NAT Gateway charges. Specifically, the data processing component. NAT Gateway pricing has two parts you care about:

An hourly charge per gateway (roughly $0.045/hr in most regions)
A per-GB data processing charge (roughly $0.045/GB)

The hourly part is fixed and boring. The per-GB part is where bills explode. At ~$0.045/GB, a service pushing 1 TB/day through NAT is burning ~$1,350/month before egress. We were doing closer to 3 TB/day across both AZs.

Why this is so easy to miss

In a typical VPC, anything in a private subnet that needs to reach any IP outside the VPC — including AWS service endpoints like s3.us-east-1.amazonaws.com — routes through the NAT Gateway by default. The AWS SDK calls feel "internal" because you're talking to an AWS service, but the packets are leaving your VPC, hitting the public AWS endpoint, and coming back. Every byte is billable NAT data.

This is the single biggest cost trap we see on mid-sized AWS accounts.

Step 1: figure out what's actually going through NAT

Before touching anything, we needed a traffic breakdown. VPC Flow Logs to the rescue, but raw flow logs are unreadable. We piped them to S3 and queried with Athena.

Here's the table definition we used (trimmed for clarity):

CREATE EXTERNAL TABLE vpc_flow_logs (
  version int,
  account_id string,
  interface_id string,
  srcaddr string,
  dstaddr string,
  srcport int,
  dstport int,
  protocol int,
  packets bigint,
  bytes bigint,
  start_time bigint,
  end_time bigint,
  action string,
  log_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION 's3://our-flow-logs/AWSLogs/.../vpcflowlogs/';

Then the question we actually cared about: which destinations are eating our NAT bandwidth?

SELECT
  dstaddr,
  SUM(bytes) / 1024 / 1024 / 1024 AS gb,
  COUNT(*) AS flows
FROM vpc_flow_logs
WHERE dt = '2026-01-15'
  AND srcaddr LIKE '10.0.%'    -- our private CIDR
  AND dstaddr NOT LIKE '10.0.%' -- leaving the VPC
GROUP BY dstaddr
ORDER BY gb DESC
LIMIT 50;

The top of the list was unambiguous:

S3 endpoints — roughly 58% of NAT bytes. Image processing workers pulling and pushing originals.
CloudWatch Logs ingestion endpoints — roughly 14%. Verbose application logging from Fargate tasks.
ECR endpoints — roughly 9%. Image pulls on every Fargate task start, and our deploys were frequent.
Secrets Manager — small in bytes, large in request count.
Third-party APIs (Stripe, SendGrid, analytics vendor) — the remaining ~15%.

Four of the top five were AWS services we could keep entirely inside our VPC with endpoints. That's the 80%+ we were leaking.

Step 2: Gateway endpoints first, because they're free

AWS offers two endpoint flavours:

Gateway endpoints: S3 and DynamoDB only. Free. No hourly charge, no data processing charge. You add an entry to your route tables and you're done.
Interface endpoints (PrivateLink): Everything else. Cost ~$0.01/hr per endpoint per AZ, plus ~$0.01/GB processed. Cheaper than NAT for the same data, but not free.

The S3 gateway endpoint was the obvious first move. The Terraform was trivial:

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id

  policy = data.aws_iam_policy_document.s3_endpoint.json
}

One caveat that bit us during testing: gateway endpoints only work when traffic uses the regional S3 endpoint. If your code is hardcoded to a different region's bucket or to s3.amazonaws.com (the global endpoint), the route table entry doesn't match and traffic still goes out through NAT. We had two services doing exactly this. Fixed in config, redeployed.

Result: NAT data processing dropped by roughly 55% the day the S3 endpoint went live. That alone paid for the engineering time several times over.

Step 3: Interface endpoints, with math

For CloudWatch Logs, ECR, Secrets Manager, SQS, and a few others, we needed interface endpoints. These aren't free, so you have to do the math.

For each candidate service, the break-even is roughly:

If NAT data for service X exceeds ~25 GB/month per AZ, an interface endpoint is cheaper.

That's a rough heuristic. The real formula compares the endpoint hourly cost (~$7.30/month per AZ) plus its per-GB charge against NAT's per-GB charge. For high-volume services it's not close — interface endpoints win easily. For low-volume services like Secrets Manager (which we hit a lot but in tiny payloads), the hourly fee can actually make it more expensive than just letting it go through NAT.

We ended up adding interface endpoints for:

logs (CloudWatch Logs) — big win
ecr.api and ecr.dkr (both required for pulls) — big win
sqs — moderate win
sts — small but worth it because of how often we hit it

We deliberately did not add endpoints for Secrets Manager and KMS in this VPC. Their NAT footprint was under a few GB/month combined; the interface endpoint hourly cost would have exceeded the NAT savings.

The ECR gotcha

ECR is a two-endpoint service: ecr.api for the control plane and ecr.dkr for the actual image layer pulls. Miss the second one and you've paid for an endpoint that does nothing for your bill. The image layers themselves come from S3, so you also need the S3 gateway endpoint for ECR pulls to stay off NAT. We had all three in place, but the order of operations matters during rollout — pull a test image, check flow logs, confirm no NAT traffic, then roll out broadly.

Step 4: the boring 15% that we left alone

Third-party API traffic to Stripe, SendGrid, and our analytics vendor still goes through NAT. There's no clean way around this without a forward proxy, and the volume was modest enough that it wasn't worth the operational complexity. PrivateLink-to-third-party-SaaS is a real option for some vendors, but the per-vendor setup cost rarely pencils out below a certain spend.

One thing we did clean up: a misbehaving worker that was pulling a 40 MB config file from an external CDN on every job. Cached it in S3 instead. Small fix, surprisingly visible on the graph.

What the bill looks like now

After about three weeks of incremental rollout, NAT data processing charges dropped from ~$3,800/month to ~$700/month. We added roughly $90/month in interface endpoint costs. Net savings: a bit over $3,000/month on a single account, with no application changes beyond fixing the two hardcoded S3 endpoints.

The other quiet win: ECR pulls got noticeably faster on Fargate cold starts, because we were no longer round-tripping through a NAT Gateway to reach an AWS service in the same region. We didn't benchmark it rigorously, but task start times felt tighter in our deploy logs.

What we'd do first on a new account

If you're standing up a fresh AWS environment in 2026, do this on day one:

Add the S3 and DynamoDB gateway endpoints to every VPC that has private subnets. They're free. There is no reason not to.
Turn on VPC Flow Logs from the start, even if you don't query them yet. You can't optimize what you didn't log.
Set a CloudWatch alarm on NATGateway BytesOutToDestination so a runaway service shows up on a Slack channel before it shows up on a bill.
Before adding any interface endpoint, check the traffic volume. Don't blanket-deploy them — the hourly fees add up if you over-provision across many AZs and many VPCs.

If you're already running and your NAT line item looks suspicious, start with the Athena query above on a single day of flow logs. The answer is almost always sitting in the top three destinations, and it's almost always cheaper to fix than to ignore. If you want a hand auditing this on your own infrastructure, our DevOps and cloud team does this kind of work — but honestly, this one's straightforward enough to take on yourselves first.

#AWS#Cost Optimization#Networking#DevOps

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Vercel Edge Middleware Cold Starts Wrecked Our p95. Here's the Fix.

Edge middleware promised sub-50ms execution. Our p95 said otherwise. Here's what we found when we instrumented it properly, and the three changes that brought latency back under control.

June 25, 2026 6 min

Terraform State Locking Failed Mid-Apply. Here's What We Learned.

A DynamoDB throttle event left our Terraform state half-written and locked. Here's the postmortem, the recovery steps, and the guardrails we added so it doesn't happen again.

June 23, 2026 6 min

GCP Cloud Run vs AWS Lambda for a Real Next.js Backend: What We Picked and Why

We ran the same Next.js API workload on Cloud Run and Lambda for three months. Cold starts, cost, observability, and one nasty timeout bug shaped the decision.

June 17, 2026 6 min