The Sentry Bill That Tripled Overnight: A Quota Postmortem
A single deploy turned a calm Sentry account into a $4k surprise. Here's what happened, what we changed, and how to stop event floods before finance notices.

We upgraded a frontend SDK on a Thursday. By Monday, Sentry had ingested more events in four days than it usually does in a quarter, and someone in finance was asking polite but pointed questions. This is the postmortem we wrote internally, sanded down for public reading.
The shape of the incident
The app in question is a mid-sized B2B dashboard. Normal volume is somewhere between 40k and 90k error events per day across web and two mobile clients, which fits comfortably inside our paid plan. On the Thursday in question, we shipped a minor version bump to a popular framework SDK — nothing dramatic in the changelog, just "improved instrumentation."
By Saturday morning the dashboard was showing roughly 1.4 million events per day. We hit the on-demand spend cap before anyone looked at a graph, and the account silently started dropping events. The bill, when it landed, was about 3.2× our usual monthly Sentry spend. Not catastrophic, but enough to be a board-deck footnote.
The annoying part: nothing was broken. Users were fine. The product was healthy. We were paying to record a particular kind of noise at extreme resolution.
What actually changed
The SDK upgrade flipped two defaults we hadn't been tracking:
- Automatic instrumentation of fetch failures now captured aborted requests as errors. Our app aggressively cancels in-flight requests when users navigate, which is normal behaviour.
- Console.error breadcrumbs were promoted to events under certain conditions, including a noisy third-party widget that logs a warning on every page load in Safari.
Neither is a bug. Both are defensible defaults. But the combination, multiplied by our traffic, turned a quiet stream into a fire hose.
Why we didn't notice for three days
We had a Slack alert for "new issue type," not for "event volume anomaly." The new errors were grouped into two issue fingerprints, so Slack saw two new issues, shrugged, and moved on. Our weekly digest would have caught it. The bill caught it first.
The five-minute triage
Once we realised what was happening, the stop-the-bleeding phase was straightforward. In order:
- Set a hard spend cap in the Sentry org settings (we'd had a soft one).
- Add inbound filters for the two dominant fingerprints.
- Drop the
traces_sample_rateon the affected project from 0.2 to 0.02 while we investigated. - Add a
beforeSendhook to discardAbortErrorand the third-party widget's warning class.
The beforeSend hook is the most reusable piece. Roughly:
import * as Sentry from "@sentry/browser";
const IGNORED_MESSAGES = [
/AbortError/i,
/ResizeObserver loop/i,
/Non-Error promise rejection captured/i,
];
const IGNORED_SOURCES = [
"chrome-extension://",
"safari-extension://",
"moz-extension://",
];
Sentry.init({
dsn: process.env.SENTRY_DSN,
tracesSampleRate: 0.02,
beforeSend(event, hint) {
const error = hint?.originalException;
const message =
typeof error === "string" ? error : (error as Error)?.message ?? "";
if (IGNORED_MESSAGES.some((re) => re.test(message))) {
return null;
}
const frames = event.exception?.values?.[0]?.stacktrace?.frames ?? [];
if (
frames.some((f) =>
IGNORED_SOURCES.some((src) => f.filename?.startsWith(src)),
)
) {
return null;
}
return event;
},
});
A few notes on this snippet, because the details matter:
- Filter in
beforeSend, not just withignoreErrors. The latter happens earlier but is string-match only and misses some shapes. - Always return
nullto drop, notundefined.undefinedlets the event through. - Filter browser extension noise. It is almost never your bug, and on a popular site it can be 20–40% of raw events.
The real fix: a sampling policy, not a vibe
The deeper problem wasn't the SDK. It was that we'd never written down what we wanted Sentry to do for us. Sampling was set to whatever the quickstart suggested two years ago, and no one had revisited it.
We rewrote the policy as four rules:
1. Errors are not sampled. Transactions are.
Dropping error events to save money is a trap. You lose the long-tail bugs that only fire for one user in a thousand. Instead, be ruthless about filtering classes of non-errors: aborted requests, third-party noise, expected validation failures.
Transactions (performance traces) are where sampling actually belongs. We moved to dynamic sampling: 100% for slow requests, 100% for errors, ~1% for everything else.
2. Per-route sample rates
A checkout page deserves more observability than a marketing landing page. We use tracesSampler to set rates per route:
tracesSampler: (samplingContext) => {
const url = samplingContext.location?.pathname ?? "";
if (url.startsWith("/checkout")) return 1.0;
if (url.startsWith("/api/internal")) return 0.5;
if (url.startsWith("/health")) return 0;
return 0.01;
},
Health checks should never be sampled. Ever. We saw a non-trivial chunk of our previous spend going to traces of a Kubernetes liveness probe.
3. Release-gated quotas
For every new release, we reserve a small fraction of the monthly quota and watch the first 24 hours of event volume against a baseline. If a release drives more than 2× the rolling 7-day median event rate, we get a page. Not a Slack message — a page. This would have caught the Thursday deploy by Friday morning.
4. Owner per project
Every Sentry project now has a named owner who gets the weekly volume digest. "Platform team" is not an owner. A person is an owner. This is dull and human and works.
What we'd do differently if starting fresh
If we were setting up error monitoring on a new product today, in roughly this order:
- Start with
beforeSendpopulated. Even an empty function is a reminder it exists. DropAbortError, extension noise, andResizeObserver loop limit exceededfrom day one. - Set the spend cap before you set the DSN. It is much easier to argue for raising a cap than for refunding an overage.
- Tag events with
releaseandenvironmentaggressively. When something goes wrong, you want to be able to answer "is this new in v1.42?" in one click. - Treat SDK upgrades like dependency upgrades on a critical service. Read the changelog. Diff the defaults. Deploy to a canary environment and watch the volume for 24 hours.
- Build a small dashboard outside Sentry. Pull the events-per-hour metric into whatever you already use (Grafana, Datadog, a Slackbot). Don't rely on the vendor's UI to surface anomalies in the vendor's billing.
A note on the comparable services
We get asked whether switching to a competitor would have helped. Honestly: no. The same event flood would have shown up on Datadog Error Tracking, Honeycomb, Rollbar, or a self-hosted GlitchTip instance. The pricing models differ but the failure mode is the same — you pay for what you ingest, and a noisy SDK ingests a lot.
The self-hosted route trades a billing problem for a storage and ops problem. For a team under about fifteen engineers, that trade is usually worse, not better. Above that, it starts to make sense if you already run Postgres and object storage seriously.
What the incident actually cost
Beyond the bill, the real cost was three engineer-days: one to triage, one to write the sampling policy, one to retrofit beforeSend hooks across four services and update the deployment runbook. We also burned a small amount of trust with finance, which is worth more than the dollars.
The events we dropped during the cap-hit window are gone. If a real bug had landed in that window, we would have missed it. That's the part that keeps us honest about prevention.
Where we'd start
If you have a Sentry account that's been running for more than a year without anyone looking at the sampling config: open it this week. Check three things — your top 10 issues by event count, your tracesSampleRate, and whether you have a spend cap. If any of those look wrong, fix the cheap one first (the cap), then the medium one (beforeSend), then the structural one (per-route sampling). You don't need a project for it. You need an afternoon.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading

The Day Our GCP Cloud Run Cold Starts Took Down Checkout
A Cloud Run service that ran fine for eighteen months started timing out checkout on a Friday afternoon. The fix wasn't more CPU — it was a misread of how min-instances, concurrency, and startup CPU boost actually interact.

The AWS NAT Gateway Bill That Ate Our Margin
A quiet $40/day NAT Gateway line item turned into the second-largest cost on our AWS account. Here's how we found it, what was actually driving it, and the VPC endpoint plumbing that fixed it.

Terraform vs Pulumi in 2026: A Migration We Half-Finished
We spent six months partially migrating a production AWS estate from Terraform to Pulumi. Here's what we kept, what we rolled back, and the boring reasons IaC choices rarely come down to language.
