DevOps & CloudMay 14, 2026 6 min read

When OpenTelemetry Sampling Bit Us: Debugging a Phantom p99

A war story about how head-based sampling in OpenTelemetry quietly hid a real latency regression for three weeks, and the tail-sampling setup we landed on after the incident.

Our dashboards said the API was healthy. Our biggest customer said it wasn't. For three weeks both were technically correct, and the reason came down to a single line in an OpenTelemetry Collector config that nobody had touched since the service launched.

This is the short version of that incident, what the traces were actually telling us once we looked properly, and the sampling setup we run now.

The setup before the incident

The service in question is a fairly ordinary B2B API: Node.js on ECS Fargate, Postgres on RDS, a Redis cache, and a handful of downstream calls to internal services and a payments vendor. Tracing was instrumented with the OpenTelemetry Node SDK, exported via OTLP to a Collector running as a sidecar, then forwarded to our vendor backend.

Sampling was configured at the SDK using ParentBased(TraceIdRatioBased(0.1)). Ten percent of root spans, parent decision honoured downstream. That ratio was chosen 18 months earlier when traffic was a quarter of what it is now, and it survived every config review because, honestly, nobody had a reason to change it.

Dashboards in the vendor backend showed p50, p95 and p99 latency calculated from the sampled traces. Metrics from the runtime (Prometheus, scraped by the Collector) showed request rates and error counts on the unsampled, full traffic.

In hindsight, you can already see the shape of the problem.

What the customer saw vs. what we saw

The customer, a logistics platform that fans out a lot of parallel requests during their morning batch, reported intermittent 4–6 second responses on an endpoint our dashboards showed at ~280 ms p99. Their evidence was a CSV of request IDs and wall-clock timings from their side.

We pulled the first request ID. No trace. Second one, no trace. Fifth one, finally a trace — and it was 312 ms, well within normal.

That's the moment the on-call engineer (correctly) stopped trusting the dashboard.

Why the slow traces were missing

With head-based ratio sampling at 10%, the decision to keep a trace is made before the span finishes. The sampler has no idea whether the request will take 200 ms or 6 seconds. Statistically, slow requests are sampled at the same rate as fast ones — but slow requests are rare, so in absolute terms you capture very few of them.

Our slow tail was roughly 1 in 4,000 requests on that endpoint. At 10% sampling, you'd expect to keep about 1 in 40,000. With the customer's traffic pattern, that worked out to a handful of slow traces per day, scattered across a backend that aggregates by service, not by customer. They were invisible in p99 percentiles computed from sampled data because the vendor was reporting the p99 of what we sent it, not the p99 of reality.

Meanwhile our runtime histogram metrics — exported unsampled via the Collector's Prometheus receiver — did show a small p99 bump, but it had been written off as noise because the trace-derived dashboards looked fine. Two sources of truth, one of them silently lying.

Finding the actual regression

Once we stopped trusting sampled p99, the path forward was unglamorous:

Temporarily crank sampling to 100% on that one endpoint via a Collector processor.
Wait for the next batch window.
Filter traces by http.route and duration > 2s.

Within 40 minutes we had ~120 slow traces. All of them had the same shape: a long gap between the Postgres span finishing and the next application span starting. Not a database problem. Not a downstream problem. Pure on-host time.

The culprit was a JSON serialisation step on a response payload that had grown roughly 8x since a feature shipped three weeks earlier. Under normal load it was fine; under the customer's burst pattern the event loop stalled, which also explained why the slow requests clumped together — one stalled tick delayed several in-flight responses.

A streaming serialiser fixed the latency. But the more important fix was that we never want to be in this position again.

Moving to tail sampling

We replaced head-based sampling with tail sampling in the OpenTelemetry Collector. The decision is now made after all spans for a trace are collected, based on actual trace properties: duration, error status, specific routes, specific customers.

The core of the new Collector config looks like this:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 2000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 750 }
      - name: priority-customers
        type: string_attribute
        string_attribute:
          key: customer.tier
          values: [enterprise]
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/vendor]

A few things worth calling out, because tail sampling has sharp edges.

Tail sampling is not free

The Collector now has to buffer every span for decision_wait seconds before deciding whether to keep the trace. Memory usage went from ~180 MB per Collector instance to ~1.4 GB at peak. We moved the Collector from a sidecar to a dedicated gateway deployment behind a load balancer, with consistent hashing on trace_id so all spans for one trace land on the same Collector. Without that, tail sampling silently breaks because each Collector only sees a fragment of the trace.

If you're running Collectors as sidecars and considering tail sampling, plan for that topology change first. We sketched the migration in our SRE work and budgeted about a sprint for it; it took closer to three weeks because of the load balancer health-check tuning.

Percentiles from sampled traces are still wrong

This is the part that surprised the product team. Even with tail sampling biased toward slow traces, you cannot compute a meaningful p99 from those traces — by design, the slow ones are over-represented. So:

Latency percentiles: always from unsampled histogram metrics (we use OTel metrics + Prometheus, exported through the same Collector).
Trace exemplars: from the tail-sampled traces, linked from the metrics via exemplar support.
Error investigation: from tail-sampled traces, where we keep 100% of errors.

If you remember one thing from this article: do your percentiles in metrics, do your debugging in traces, and never let the two get computed from the same sampled stream.

What we'd check in your stack tomorrow

If you're running OpenTelemetry in anger and reading this nervously, here's the quick audit we now run on new services:

Open your latency dashboard. Ask where the numbers actually come from — sampled spans, or unsampled histograms? If your team can't answer in 30 seconds, that's the finding.
Pull one of your slowest known requests by ID. Is there a trace for it? If not, your sampler is hiding the bugs you most need to see.
Look at error sampling. If you're using head-based ratio sampling, you're throwing away 90% of your errors for no good reason. Errors should be sampled at 100%.
Check whether spans for a single trace can land on different Collector instances. If yes, and you're using or planning tail sampling, fix the routing first.
Confirm your runtime metrics (event loop lag, GC pauses, DB pool wait time) are exported unsampled and alertable independently of traces.

None of this is exotic. It's the default configuration that quietly stops being correct as traffic and payload sizes grow. We've written more about the broader observability stack on the 72Technologies blog if you want the longer version.

The customer who flagged this got a fix and a credit. The engineer who didn't trust the dashboard got a beer. The dashboard got rewritten.

#OpenTelemetry#Observability#Incident Review#AWS#SRE

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Vercel Edge Middleware Cold Starts Wrecked Our p95. Here's the Fix.

Edge middleware promised sub-50ms execution. Our p95 said otherwise. Here's what we found when we instrumented it properly, and the three changes that brought latency back under control.

June 25, 2026 6 min

Terraform State Locking Failed Mid-Apply. Here's What We Learned.

A DynamoDB throttle event left our Terraform state half-written and locked. Here's the postmortem, the recovery steps, and the guardrails we added so it doesn't happen again.

June 23, 2026 6 min

Our AWS NAT Gateway Bill Hit $4k/Month. Here's How We Cut It by 80%.

A single NAT Gateway line item quietly ate our cloud budget. Here's the traffic audit, the VPC endpoint rollout, and the gotchas nobody mentions in the AWS docs.

June 20, 2026 7 min