OpenTelemetry Sampling at Scale: Why Tail-Based Bit Us First
We rolled out OpenTelemetry across a Node and Go fleet, picked tail-based sampling because everyone said to, and learned why head-based wins for most teams. Here's the tradeoff we wish someone had drawn for us.

We spent a quarter rolling OpenTelemetry across a mixed Node and Go fleet, switched on tail-based sampling because every conference talk in 2025 said to, and watched our collector memory chart look like a heart monitor. The lesson wasn't that tail-based is bad. It's that most teams reach for it before they've earned it.
This is the breakdown we wish we'd had on a whiteboard before we started.
The two sampling modes, minus the marketing
If you've only skimmed the OTel docs, here's the honest version.
Head-based sampling decides whether to keep a trace at the moment the root span is created. The decision propagates via the traceparent header, so every downstream service agrees. It's cheap, stateless, and deterministic. The downside: you decide before you know if the request was interesting. A 500 error you sampled out is gone forever.
Tail-based sampling buffers every span for a window (usually 5–30 seconds), waits for the whole trace to complete, then decides. You can keep 100% of errors, 100% of slow requests, and a small percentage of healthy ones. The downside: the collector has to hold every span in memory for the buffer window, and it has to see every span — which kills horizontal scaling unless you shard by trace ID.
That last sentence is the one that bit us.
Why "just turn on tail sampling" is a trap
The OTel Collector's tail_sampling processor needs all spans for a given trace to land on the same collector instance. If you run a fleet of collectors behind a round-robin load balancer, span A of trace X goes to collector 1, span B goes to collector 2, and neither has enough context to decide. Both end up holding partial traces until the timeout, then making bad decisions.
The fix is a two-tier collector setup: a stateless front layer that does nothing but hash by trace ID and route to a stateful back layer. That back layer is now a sharded, memory-hungry, stateful service you have to operate. Welcome to your new pet.
What our bill and latency actually looked like
Numbers from our environment — not benchmarks, just what we saw. Take them as shape, not gospel.
Before OTel, we were on a vendor SDK with adaptive sampling at roughly 10%. Trace volume sat around 40M spans/day. After lift-and-shift to OTel with head-based probabilistic sampling at 10%, span volume was basically unchanged and our backend bill was within 5% of before.
When we flipped to tail-based with a policy of "keep all errors, keep all traces over 1s, sample 5% of the rest," three things happened in our experience:
- Useful trace volume dropped about 30% (good — we were paying for noise)
- Collector memory went from ~400MB steady to 3–6GB with spikes to 11GB during traffic bursts
- p99 export latency from app to backend went from ~2s to ~25s, because of the decision window
The last one mattered more than we expected. During an incident, engineers were refreshing the trace UI waiting for spans that wouldn't appear for 20+ seconds. That's an eternity when you're paging.
A head-based config that gets you 80% of the value
If you haven't started yet, do this first. It's boring, it works, and you can always evolve later.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 5s
send_batch_size: 1024
# Trim noisy spans before they hit your backend
filter/drop_health:
error_mode: ignore
traces:
span:
- 'attributes["http.target"] == "/healthz"'
- 'attributes["http.target"] == "/metrics"'
exporters:
otlp/backend:
endpoint: ingest.your-vendor.tld:4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [filter/drop_health, batch]
exporters: [otlp/backend]
The sampling decision happens in the SDK, not the collector. In Node:
import { NodeSDK } from "@opentelemetry/sdk-node";
import {
ParentBasedSampler,
TraceIdRatioBasedSampler,
} from "@opentelemetry/sdk-trace-base";
const sampler = new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1), // 10% of new traces
});
const sdk = new NodeSDK({
sampler,
// ...exporter, resource, instrumentations
});
sdk.start();
ParentBasedSampler is the important bit. It respects the upstream decision if there's a traceparent header, so a single trace stays consistent end-to-end. Without it you get half-sampled traces with holes that look like bugs.
When 10% isn't enough
For most backends, 10% gives you plenty for performance work and capacity planning. Where it falls down: rare errors. If your error rate is 0.1% and you're sampling at 10%, you're keeping 0.01% of all traffic as error traces. On 10M requests/day that's 1,000 error traces — fine. On 100k requests/day it's 10. Not fine.
The cheap fix is error-biased head sampling: instrument your SDK to set the sampling decision to RECORD_AND_SAMPLED when the root span sees a non-2xx response or an exception. You're still head-based, but errors get a second chance. You can't catch errors that happen deep in the trace this way, but you'll catch most of what matters.
When tail-based is actually worth the operational cost
We didn't rip out tail sampling everywhere. We kept it for two cases:
- Checkout and payment flows. Low volume, high value per trace, and the questions we ask are "why was this specific user slow" — exactly what tail sampling is good at. We run a small dedicated collector pair for these services with maybe 2GB of memory each. Totally fine.
- Async pipelines with long tails. Background workers where a tiny fraction of jobs take 100x the median. Head sampling misses these by definition; tail sampling catches them by design.
For the rest — the chatty internal RPC mesh, the read-heavy product APIs, the static asset proxies — head-based at 5–10% plus aggressive filter processors is the right call.
The two-tier collector pattern, if you must
If you do go tail-based at scale, here's the shape:
apps → [load-balanced front collectors] → [trace-ID-sharded back collectors] → backend
(stateless, autoscale freely) (stateful, scale carefully)
The front layer uses the loadbalancing exporter with routing_key: traceID. The back layer runs the tail_sampling processor. You size the back tier for your worst burst, not your average, because OOMs there cause data loss for everyone, not just one node.
Budget roughly: average span size × spans per second × decision window in seconds × safety factor of 3. Our back tier ended up at 4 nodes of 8GB each for ~5k spans/sec sustained. Your mileage will absolutely vary.
What goes wrong in production (so you can plan for it)
Things that have hurt us or clients we've helped:
- Schema drift between services. Service A sets
user.id, service B setsuserId. Your tail policies key on attributes — inconsistent attributes mean inconsistent decisions. Enforce a span attribute schema in code review, or use thetransformprocessor to normalise. - Long-running spans blowing the decision window. If a span lasts 60s and your tail window is 30s, the decision fires before the span closes. The processor will log a warning and you'll wonder why your slow traces are missing. Tune
decision_waitto your p99 trace duration plus headroom. - Collector restarts losing in-flight traces. Rolling a tail-sampling collector drops whatever's in the buffer. Do it during low traffic, and don't be surprised by a temporary trace gap.
- SDK-side BatchSpanProcessor queue overflow under load. Defaults are conservative. Bump
maxQueueSizeandmaxExportBatchSizeif you seeBatchSpanProcessor dropping spanswarnings — that's data loss before the collector even sees it.
Where we'd start
If you're standing up OTel in 2026 and your trace volume is anywhere under a few hundred million spans a day, do this in order:
- Head-based probabilistic sampling at 10%,
ParentBasedSamplereverywhere, no exceptions. - Filter processors in the collector to drop health checks, metrics scrapes, and known-noisy paths. This is free volume reduction.
- Error-biased sampling in your SDKs so rare errors aren't lost.
- Only after all that — and only for the services where you can name the specific question tail sampling answers — stand up a sharded tail-sampling tier.
The goal isn't to keep every interesting trace. It's to spend collector memory and engineer attention where they actually change an outcome. If you want a second pair of eyes on a rollout, our team does this kind of work as part of our DevOps and platform engagements.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading

Our Vercel Cron Jobs Silently Stopped Firing for 6 Hours. Here's the Postmortem.
A scheduled job that hadn't fired in six hours, no alert, no error in Sentry, and a billing email that didn't get sent. Here's exactly what broke, how we caught it, and the cron monitoring pattern we run now.

Sentry Performance Quotas Blew Up Our Bill: What We Changed
A war story about Sentry transactions, span ingestion, and a 6x bill spike — plus the dynamic sampling, SDK config, and quota guardrails we now ship by default.

Pulumi vs Terraform in 2026: A Real Migration, Not a Bake-Off
We moved part of a production AWS estate from Terraform to Pulumi over six months. Here's what actually changed, what broke, and where we'd quietly stay on HCL.
