DevOps & CloudJune 15, 2026 6 min read

Our Vercel Cron Jobs Silently Stopped Firing for 6 Hours. Here's the Postmortem.

A scheduled job that hadn't fired in six hours, no alert, no error in Sentry, and a billing email that didn't get sent. Here's exactly what broke, how we caught it, and the cron monitoring pattern we run now.

A customer pinged us on a Tuesday morning asking why their weekly digest email never showed up. We checked the logs: the cron endpoint hadn't been hit in six hours. No 500s, no Sentry events, no Vercel deployment failures. The job just... stopped. This is the postmortem and the monitoring we wish we'd had on day one.

The Setup That Looked Fine

We were running a Next.js 14 app on Vercel Pro with four cron jobs defined in vercel.json. Standard stuff: a billing reconciliation at 02:00 UTC, a digest email at 08:00 UTC, a stale-session cleanup every 15 minutes, and a webhook retry sweep every 5 minutes.

The config looked like this:

{
  "crons": [
    { "path": "/api/cron/billing", "schedule": "0 2 * * *" },
    { "path": "/api/cron/digest", "schedule": "0 8 * * 1" },
    { "path": "/api/cron/sessions", "schedule": "*/15 * * * *" },
    { "path": "/api/cron/webhooks", "schedule": "*/5 * * * *" }
  ]
}

Every handler authenticated against CRON_SECRET, wrote a structured log line on entry and exit, and pushed exceptions to Sentry. We had uptime checks on the public marketing pages and the API health endpoint. We thought we were covered.

We were not covered.

What Actually Happened

At around 04:00 UTC, a routine deployment went out. It passed CI, it deployed to production, it was promoted. The 15-minute session cleanup ran at 04:00. It ran at 04:15. Then it stopped. The 5-minute webhook sweep ran at 04:00, 04:05, 04:10. Then it stopped too. The 08:00 digest never fired.

The customer report landed at 10:14 UTC. By the time we logged into the Vercel dashboard, the cron jobs tab showed the schedules as configured but "Last Run" timestamps from six hours earlier. No error indicator. No banner.

What we eventually figured out, with help from Vercel support: a stray change to vercel.json during a refactor had introduced a duplicate path entry. The deployment succeeded, but the cron registration on Vercel's side entered a state where some jobs were paused pending re-registration on the next deployment. We have not seen this documented as a public failure mode, and we cannot reproduce it on demand. That's part of the lesson — "shouldn't happen" is not a monitoring strategy.

Why Sentry Didn't Catch It

Sentry only knows about events that get sent to it. Our handlers reported errors when they ran. They were not running, so there were no errors. We had Sentry Cron Monitoring available on the plan and had not wired it up because the jobs "felt simple". That was the single biggest mistake.

Why Our Uptime Checks Didn't Catch It

We had Better Stack pinging /api/health every 60 seconds. That endpoint returned 200 because the app was fine. Health checks tell you the kitchen is open. They do not tell you anyone is cooking.

The Pattern We Run Now: Dead Man's Switches Everywhere

The fix is older than Vercel: a dead man's switch. Every scheduled job has to actively announce "I ran" to an external observer, and if the announcement doesn't arrive within the expected window, that observer alerts.

We use Sentry Cron Monitoring for this because we already pay for Sentry. Healthchecks.io and Better Stack Heartbeats work the same way if you'd rather decouple.

The handler now looks like this:

import * as Sentry from '@sentry/nextjs';

const MONITOR_SLUG = 'cron-sessions-cleanup';

export async function GET(req: Request) {
  const auth = req.headers.get('authorization');
  if (auth !== `Bearer ${process.env.CRON_SECRET}`) {
    return new Response('Unauthorized', { status: 401 });
  }

  const checkInId = Sentry.captureCheckIn(
    {
      monitorSlug: MONITOR_SLUG,
      status: 'in_progress',
    },
    {
      schedule: { type: 'crontab', value: '*/15 * * * *' },
      checkinMargin: 2,
      maxRuntime: 5,
      timezone: 'Etc/UTC',
    },
  );

  try {
    const deleted = await cleanupStaleSessions();

    Sentry.captureCheckIn({
      checkInId,
      monitorSlug: MONITOR_SLUG,
      status: 'ok',
    });

    return Response.json({ deleted });
  } catch (err) {
    Sentry.captureCheckIn({
      checkInId,
      monitorSlug: MONITOR_SLUG,
      status: 'error',
    });
    Sentry.captureException(err);
    return new Response('failed', { status: 500 });
  }
}

The critical part is checkinMargin: 2. Sentry expects the job to check in within two minutes of its scheduled time. If it doesn't, we get paged. The platform does not need to tell us anything broke — we infer it from silence.

Belt-and-Braces: An External Cron That Watches the Internal Cron

For the two jobs that actually move money or send customer-facing email, we added a second layer. A GitHub Actions workflow on a schedule: runs every hour and queries our own database for "when did this job last record a successful run". If the gap exceeds the threshold, it fails the workflow, which pages on-call.

name: cron-watchdog
on:
  schedule:
    - cron: '15 * * * *'
jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - name: Query last run timestamps
        env:
          DATABASE_URL: ${{ secrets.WATCHDOG_DB_URL }}
        run: node scripts/check-cron-freshness.js

Is this redundant with Sentry Cron Monitoring? Yes. Deliberately. The whole point of the incident was that one layer of observability had an invisible failure mode. We are not assuming the second layer is bulletproof either — we are assuming the joint probability of both layers silently failing at the same time is low enough to sleep on.

Tradeoffs We Accepted

Nothing here is free.

Cost. Sentry Cron Monitoring is billed per check-in on most plans. A */5 * * * * job is roughly 8,640 check-ins per month. Multiply across jobs and environments and the line item is real. We dropped check-ins from staging crons after the first month — staging silence is annoying, production silence is a customer call.

Noise during deploys. Vercel deployments can briefly delay a cron invocation. With a 2-minute margin on a 5-minute cron, we got a handful of false positives the first week. We bumped margins to 3 minutes for the most frequent jobs and the noise stopped. Do not set margins so wide that real failures hide for an hour.

Lock-in. Tying cron health to Sentry means a Sentry outage looks like a cron outage. That's why the watchdog workflow exists in GitHub Actions and reads from our own database, not from Sentry's API.

What We Tell New Engineers Now

Three rules went into the runbook after this incident:

Any scheduled task gets a dead man's switch before it ships. Not after. Before. The PR template has a checkbox.
"It ran" is a fact the job has to assert, not a fact you infer from the absence of errors. Write a cron_runs table or equivalent. Every successful run inserts a row with the job name and timestamp. This is the source of truth the watchdog reads.
Treat the platform's own dashboard as untrusted for alerting. Vercel, AWS EventBridge, GCP Cloud Scheduler — they all have a UI that shows you when jobs ran. None of them owe you a page when they didn't.

The cron_runs table is twelve lines of SQL and has paid for itself twice since we added it. Once for this incident, once when a Cloud Scheduler job in a sibling GCP project quietly stopped after an IAM change.

Where We'd Start

If you're reading this with cron jobs in production and no dead man's switch, the order of operations is:

Add a cron_runs table today. Every handler writes to it on success. Total work: under an hour.
Wire one of Sentry Cron Monitoring, Healthchecks.io, or Better Stack Heartbeats into your most business-critical job. Pick the one you already pay for. Total work: an afternoon.
Add a watchdog that reads cron_runs and pages if anything is stale. Cron-on-cron is fine. GitHub Actions is fine.
Only then, roll the pattern out to the rest of your jobs.

If you'd like a hand designing this for a larger fleet — multi-region, multi-cloud, or with strict compliance constraints — that's the kind of work we do on our DevOps and reliability engagements. The fix is rarely exotic. The discipline of applying it before the incident is the hard part.

#Vercel#Observability#Reliability#Sentry#Postmortem

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

OpenTelemetry Sampling in Production: The Config That Saved Our Trace Bill

Head sampling threw away the traces we needed. Tail sampling blew up our collector memory. Here's the sampling config we landed on after six months in production.

July 30, 2026 6 min

CloudFront to Vercel: The Cache Header Mismatch That Cost Us a Weekend

We fronted a Vercel app with CloudFront to satisfy a compliance requirement. Two weeks later, stale checkouts and missing Set-Cookie headers taught us how differently these two CDNs think about caching.

July 27, 2026 6 min

Vercel Edge Middleware Latency: What We Measured When We Moved Auth to the Edge

We moved auth checks from a Node API route to Vercel Edge Middleware expecting free speed. Some routes got faster, some got slower, and the bill moved in ways we didn't predict.

July 25, 2026 6 min