How We Cut Our AWS Multi-AZ RDS Failover Time From 90s to 12s
A real story of trimming RDS failover from a customer-visible 90 seconds down to roughly 12. The fix wasn't a bigger instance — it was DNS, connection pools, and Proxy.

A customer pinged us last spring: "Your checkout was down for 90 seconds." It was actually a healthy AWS RDS Multi-AZ failover during a minor engine patch. The database recovered fine. Our application did not. This is the story of why a 30-second AWS event turned into a 90-second user-visible outage, and how we got the next one down to about 12 seconds.
What Multi-AZ actually promises (and what it doesn't)
AWS markets RDS Multi-AZ as high availability with "typically 60–120 seconds" of failover time. That number describes how long until the standby is promoted and reachable on the cluster endpoint. It says nothing about:
- How long your application clients take to notice the old primary is gone
- How long your connection pool holds onto dead TCP sockets
- How long your DNS resolver caches the old A record
- How your ORM behaves mid-transaction when the socket dies
In our case, the engine swapped in around 28 seconds. The remaining 60+ seconds was entirely our stack failing to let go of a corpse.
How we measured it
We instrumented three timestamps for the postmortem:
rds_failover_started— from the RDS event stream via EventBridgefirst_successful_query_after_failover— application-side, logged per podcheckout_p50_recovered— derived from our Sentry transaction data
The gap between (1) and (2) was where the pain lived. In our worst pod it was 78 seconds. In the best, 41. That variance was the first clue: this wasn't an AWS problem, it was a client-side lottery.
The four things eating our recovery time
1. JVM DNS caching
The JVM, by default, caches successful DNS lookups forever when a security manager is installed, and for 30 seconds otherwise. RDS failover works by repointing the cluster endpoint's CNAME to the new primary. If your client never re-resolves, it keeps dialing a host that is now the demoted standby (or worse, nothing).
We were running a Spring Boot service on JDK 17. The fix was a one-liner in our entrypoint:
# Force the JVM to honor short TTLs on the RDS endpoint
export JAVA_TOOL_OPTIONS="-Dnetworkaddress.cache.ttl=5 -Dnetworkaddress.cache.negative.ttl=2"
Five seconds is a deliberate floor. Going lower hammered our VPC resolver during normal operation with no measurable benefit.
2. HikariCP holding dead sockets
Hikari's default maxLifetime is 30 minutes and its keepaliveTime is disabled. After failover, the pool happily handed out sockets connected to an endpoint that no longer accepted writes. The first query would fail, the connection would be evicted, and the next request would get another dead connection from the same pool.
Our tuned config:
spring:
datasource:
hikari:
maximum-pool-size: 20
connection-timeout: 3000
validation-timeout: 2000
keepalive-time: 30000
max-lifetime: 600000
# Critical: fail fast and recycle
connection-test-query: "SELECT 1"
The key change wasn't connection-test-query — that's almost always wrong on modern drivers — it was keepalive-time. With keepalive enabled, Hikari probes idle connections every 30 seconds and discovers the failover quickly instead of waiting for a user request to find the problem.
3. TCP keepalive at the OS level
Even with Hikari probing, the OS itself was configured to wait two hours before declaring a TCP connection dead. On Linux, default tcp_keepalive_time is 7200 seconds. We dropped it on our application nodes:
sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=3
This means a dead socket is detected within roughly 90 seconds at the kernel level even if nothing else catches it. It's a safety net, not the primary mechanism.
4. In-flight transactions hanging
The nastiest failure mode: a transaction that had already issued BEGIN and a few writes when the primary disappeared. The socket would sit in ESTABLISHED state from the client's view, the query would block on recv(), and we'd hold the request thread until something timed out — sometimes minutes.
We added a statement timeout on every connection:
ALTER ROLE app_service SET statement_timeout = '8s';
ALTER ROLE app_service SET idle_in_transaction_session_timeout = '15s';
Eight seconds is aggressive. It killed two slow background jobs we'd been ignoring, which was honestly a benefit. For anything that legitimately needs longer, we set the timeout per-session inside that job.
Where RDS Proxy actually helped
We'd avoided RDS Proxy for years because of the per-ACU cost and the latency overhead (in our experience, roughly 1–3 ms added per query, sometimes more under load). The failover scenario changed our math.
RDS Proxy holds connections to the database on your behalf and, crucially, survives a failover from the client's perspective. When the primary swaps, Proxy reconnects to the new primary and parks your client connection during the gap rather than tearing it down. For idle connections, the client never notices. For in-flight queries, you still get an error — Proxy can't replay a transaction — but the recovery window collapses.
We rolled out Proxy only on the checkout and inventory services, not the analytics or admin paths. Terraform sketch:
resource "aws_db_proxy" "checkout" {
name = "checkout-proxy"
engine_family = "POSTGRESQL"
role_arn = aws_iam_role.proxy.arn
vpc_subnet_ids = var.private_subnet_ids
require_tls = true
idle_client_timeout = 1800
auth {
auth_scheme = "SECRETS"
secret_arn = aws_secretsmanager_secret.db.arn
iam_auth = "DISABLED"
}
}
resource "aws_db_proxy_default_target_group" "checkout" {
db_proxy_name = aws_db_proxy.checkout.name
connection_pool_config {
max_connections_percent = 80
max_idle_connections_percent = 40
connection_borrow_timeout = 5
}
}
The connection_borrow_timeout of 5 seconds is what makes failover feel fast: clients waiting on a connection during the swap get held for up to 5 seconds rather than failing instantly.
What it cost us
Proxy added roughly 12% to that database's monthly bill and around 2 ms p50 latency. For checkout, that was an obvious trade. For our reporting service, which runs long aggregations and doesn't care about a 60-second blip at 3 a.m., it wasn't worth it. Pick your battles.
Testing the fix without a real outage
The only way to trust this stuff is to break it on purpose. We use the RDS reboot-db-instance API with --force-failover in a weekly job against staging:
aws rds reboot-db-instance \
--db-instance-identifier staging-checkout \
--force-failover
We record the same three timestamps mentioned earlier and alert if recovery exceeds 20 seconds. Doing this weekly caught a regression six weeks in: a developer had added a new datasource with default Hikari settings. We'd never have noticed until the next real failover.
For anyone running this in production, do the same drill in a maintenance window first. The first one we ran in staging revealed that our read replicas were not being failover-tested at all, and one of them had drifted in configuration. Surprises are cheaper in staging.
The numbers that mattered
After the changes, our next planned failover during a minor version upgrade looked like this in our metrics:
- RDS engine swap: 24 seconds
- First successful query post-swap: +9 seconds (so 33s total from event start)
- Checkout p50 latency returned to baseline: ~12 seconds after first successful query
That's a customer-visible recovery of roughly 12 seconds for new requests, with in-flight requests getting a clean error they could retry. Before the work, the same scenario produced a 90-second window where roughly a third of checkout attempts failed.
We also stopped getting Sentry pages for transient Connection refused errors during routine RDS maintenance, which had been a low-grade tax on the on-call rotation for over a year.
Where we'd start
If you've never tested an RDS failover against your own application, do that first — before tuning anything. Trigger a forced failover in staging, watch your logs, and measure the actual user-visible gap. You'll learn more in ten minutes than from any blog post.
Then, in order of effort-to-impact: fix JVM (or runtime) DNS TTL, turn on connection pool keepalives, set statement timeouts on the database side, and only then consider RDS Proxy. Proxy is powerful but it's not the first lever, and it isn't free. If you'd like a hand running a reliability drill on your stack, our DevOps and cloud team does this work with clients regularly — though honestly, most teams can get 80% of the way there with the four changes above and an afternoon.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading

We Moved Our Vercel ISR Cache to S3 + CloudFront. Here's the Math.
Vercel's bandwidth and function invocation costs got loud at scale. We moved the hot read path to S3 + CloudFront while keeping the DX. Here's the architecture, the numbers, and what broke.

The CloudFront-to-Vercel Edge Migration That Almost Broke Auth
We moved a Next.js app from CloudFront + Lambda@Edge to Vercel and learned the hard way that signed cookies, edge regions, and middleware ordering don't translate cleanly. Here's what bit us.

Migrating a Terraform Monorepo to Stacks: What We'd Do Differently
We moved a 40-module Terraform monorepo to HCP Terraform Stacks. Here's what broke, what we gained, and the four decisions we'd reverse if we started over.
