DevOps & CloudMay 27, 2026 6 min read

How We Cut Our AWS Multi-AZ RDS Failover Time From 90s to 12s

A real story of trimming RDS failover from a customer-visible 90 seconds down to roughly 12. The fix wasn't a bigger instance — it was DNS, connection pools, and Proxy.

A customer pinged us last spring: "Your checkout was down for 90 seconds." It was actually a healthy AWS RDS Multi-AZ failover during a minor engine patch. The database recovered fine. Our application did not. This is the story of why a 30-second AWS event turned into a 90-second user-visible outage, and how we got the next one down to about 12 seconds.

What Multi-AZ actually promises (and what it doesn't)

AWS markets RDS Multi-AZ as high availability with "typically 60–120 seconds" of failover time. That number describes how long until the standby is promoted and reachable on the cluster endpoint. It says nothing about:

How long your application clients take to notice the old primary is gone
How long your connection pool holds onto dead TCP sockets
How long your DNS resolver caches the old A record
How your ORM behaves mid-transaction when the socket dies

In our case, the engine swapped in around 28 seconds. The remaining 60+ seconds was entirely our stack failing to let go of a corpse.

How we measured it

We instrumented three timestamps for the postmortem:

rds_failover_started — from the RDS event stream via EventBridge
first_successful_query_after_failover — application-side, logged per pod
checkout_p50_recovered — derived from our Sentry transaction data

The gap between (1) and (2) was where the pain lived. In our worst pod it was 78 seconds. In the best, 41. That variance was the first clue: this wasn't an AWS problem, it was a client-side lottery.

The four things eating our recovery time

1. JVM DNS caching

The JVM, by default, caches successful DNS lookups forever when a security manager is installed, and for 30 seconds otherwise. RDS failover works by repointing the cluster endpoint's CNAME to the new primary. If your client never re-resolves, it keeps dialing a host that is now the demoted standby (or worse, nothing).

We were running a Spring Boot service on JDK 17. The fix was a one-liner in our entrypoint:

# Force the JVM to honor short TTLs on the RDS endpoint
export JAVA_TOOL_OPTIONS="-Dnetworkaddress.cache.ttl=5 -Dnetworkaddress.cache.negative.ttl=2"

Five seconds is a deliberate floor. Going lower hammered our VPC resolver during normal operation with no measurable benefit.

2. HikariCP holding dead sockets

Hikari's default maxLifetime is 30 minutes and its keepaliveTime is disabled. After failover, the pool happily handed out sockets connected to an endpoint that no longer accepted writes. The first query would fail, the connection would be evicted, and the next request would get another dead connection from the same pool.

Our tuned config:

spring:
  datasource:
    hikari:
      maximum-pool-size: 20
      connection-timeout: 3000
      validation-timeout: 2000
      keepalive-time: 30000
      max-lifetime: 600000
      # Critical: fail fast and recycle
      connection-test-query: "SELECT 1"

The key change wasn't connection-test-query — that's almost always wrong on modern drivers — it was keepalive-time. With keepalive enabled, Hikari probes idle connections every 30 seconds and discovers the failover quickly instead of waiting for a user request to find the problem.

3. TCP keepalive at the OS level

Even with Hikari probing, the OS itself was configured to wait two hours before declaring a TCP connection dead. On Linux, default tcp_keepalive_time is 7200 seconds. We dropped it on our application nodes:

sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=3

This means a dead socket is detected within roughly 90 seconds at the kernel level even if nothing else catches it. It's a safety net, not the primary mechanism.

4. In-flight transactions hanging

The nastiest failure mode: a transaction that had already issued BEGIN and a few writes when the primary disappeared. The socket would sit in ESTABLISHED state from the client's view, the query would block on recv(), and we'd hold the request thread until something timed out — sometimes minutes.

We added a statement timeout on every connection:

ALTER ROLE app_service SET statement_timeout = '8s';
ALTER ROLE app_service SET idle_in_transaction_session_timeout = '15s';

Eight seconds is aggressive. It killed two slow background jobs we'd been ignoring, which was honestly a benefit. For anything that legitimately needs longer, we set the timeout per-session inside that job.

Where RDS Proxy actually helped

We'd avoided RDS Proxy for years because of the per-ACU cost and the latency overhead (in our experience, roughly 1–3 ms added per query, sometimes more under load). The failover scenario changed our math.

RDS Proxy holds connections to the database on your behalf and, crucially, survives a failover from the client's perspective. When the primary swaps, Proxy reconnects to the new primary and parks your client connection during the gap rather than tearing it down. For idle connections, the client never notices. For in-flight queries, you still get an error — Proxy can't replay a transaction — but the recovery window collapses.

We rolled out Proxy only on the checkout and inventory services, not the analytics or admin paths. Terraform sketch:

resource "aws_db_proxy" "checkout" {
  name                   = "checkout-proxy"
  engine_family          = "POSTGRESQL"
  role_arn               = aws_iam_role.proxy.arn
  vpc_subnet_ids         = var.private_subnet_ids
  require_tls            = true
  idle_client_timeout    = 1800

  auth {
    auth_scheme = "SECRETS"
    secret_arn  = aws_secretsmanager_secret.db.arn
    iam_auth    = "DISABLED"
  }
}

resource "aws_db_proxy_default_target_group" "checkout" {
  db_proxy_name = aws_db_proxy.checkout.name

  connection_pool_config {
    max_connections_percent      = 80
    max_idle_connections_percent = 40
    connection_borrow_timeout    = 5
  }
}

The connection_borrow_timeout of 5 seconds is what makes failover feel fast: clients waiting on a connection during the swap get held for up to 5 seconds rather than failing instantly.

What it cost us

Proxy added roughly 12% to that database's monthly bill and around 2 ms p50 latency. For checkout, that was an obvious trade. For our reporting service, which runs long aggregations and doesn't care about a 60-second blip at 3 a.m., it wasn't worth it. Pick your battles.

Testing the fix without a real outage

The only way to trust this stuff is to break it on purpose. We use the RDS reboot-db-instance API with --force-failover in a weekly job against staging:

aws rds reboot-db-instance \
  --db-instance-identifier staging-checkout \
  --force-failover

We record the same three timestamps mentioned earlier and alert if recovery exceeds 20 seconds. Doing this weekly caught a regression six weeks in: a developer had added a new datasource with default Hikari settings. We'd never have noticed until the next real failover.

For anyone running this in production, do the same drill in a maintenance window first. The first one we ran in staging revealed that our read replicas were not being failover-tested at all, and one of them had drifted in configuration. Surprises are cheaper in staging.

The numbers that mattered

After the changes, our next planned failover during a minor version upgrade looked like this in our metrics:

RDS engine swap: 24 seconds
First successful query post-swap: +9 seconds (so 33s total from event start)
Checkout p50 latency returned to baseline: ~12 seconds after first successful query

That's a customer-visible recovery of roughly 12 seconds for new requests, with in-flight requests getting a clean error they could retry. Before the work, the same scenario produced a 90-second window where roughly a third of checkout attempts failed.

We also stopped getting Sentry pages for transient Connection refused errors during routine RDS maintenance, which had been a low-grade tax on the on-call rotation for over a year.

Where we'd start

If you've never tested an RDS failover against your own application, do that first — before tuning anything. Trigger a forced failover in staging, watch your logs, and measure the actual user-visible gap. You'll learn more in ten minutes than from any blog post.

Then, in order of effort-to-impact: fix JVM (or runtime) DNS TTL, turn on connection pool keepalives, set statement timeouts on the database side, and only then consider RDS Proxy. Proxy is powerful but it's not the first lever, and it isn't free. If you'd like a hand running a reliability drill on your stack, our DevOps and cloud team does this work with clients regularly — though honestly, most teams can get 80% of the way there with the four changes above and an afternoon.

#AWS#RDS#Reliability#PostgreSQL#DevOps

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

AWS NAT Gateway Bills Ate Our Margins. Here's How We Cut Them 78%.

A single misconfigured VPC route turned our NAT Gateway into a five-figure monthly line item. Here's the audit trail, the fixes, and what we'd do differently.

July 19, 2026 6 min

GCP Cloud Run vs AWS Lambda for Bursty APIs: What Broke, What Held

We ran the same bursty checkout API on Cloud Run and Lambda for six months. Cold starts, concurrency, and billing quirks all bit us in ways the marketing pages don't mention.

July 14, 2026 6 min

Vercel Preview Deployments Are Leaking Secrets. Audit Yours Now.

Preview URLs are treated like staging by developers and like production by attackers. Here's how we found real secrets exposed across three client accounts, and the guardrails we now enforce by default.

July 11, 2026 6 min