DevOps & CloudJune 7, 2026 6 min read

OpenTelemetry Sampling at Scale: Why Tail-Based Bit Us First

We rolled out OpenTelemetry across a Node and Go fleet, picked tail-based sampling because everyone said to, and learned why head-based wins for most teams. Here's the tradeoff we wish someone had drawn for us.

We spent a quarter rolling OpenTelemetry across a mixed Node and Go fleet, switched on tail-based sampling because every conference talk in 2025 said to, and watched our collector memory chart look like a heart monitor. The lesson wasn't that tail-based is bad. It's that most teams reach for it before they've earned it.

This is the breakdown we wish we'd had on a whiteboard before we started.

The two sampling modes, minus the marketing

If you've only skimmed the OTel docs, here's the honest version.

Head-based sampling decides whether to keep a trace at the moment the root span is created. The decision propagates via the traceparent header, so every downstream service agrees. It's cheap, stateless, and deterministic. The downside: you decide before you know if the request was interesting. A 500 error you sampled out is gone forever.

Tail-based sampling buffers every span for a window (usually 5–30 seconds), waits for the whole trace to complete, then decides. You can keep 100% of errors, 100% of slow requests, and a small percentage of healthy ones. The downside: the collector has to hold every span in memory for the buffer window, and it has to see every span — which kills horizontal scaling unless you shard by trace ID.

That last sentence is the one that bit us.

Why "just turn on tail sampling" is a trap

The OTel Collector's tail_sampling processor needs all spans for a given trace to land on the same collector instance. If you run a fleet of collectors behind a round-robin load balancer, span A of trace X goes to collector 1, span B goes to collector 2, and neither has enough context to decide. Both end up holding partial traces until the timeout, then making bad decisions.

The fix is a two-tier collector setup: a stateless front layer that does nothing but hash by trace ID and route to a stateful back layer. That back layer is now a sharded, memory-hungry, stateful service you have to operate. Welcome to your new pet.

What our bill and latency actually looked like

Numbers from our environment — not benchmarks, just what we saw. Take them as shape, not gospel.

Before OTel, we were on a vendor SDK with adaptive sampling at roughly 10%. Trace volume sat around 40M spans/day. After lift-and-shift to OTel with head-based probabilistic sampling at 10%, span volume was basically unchanged and our backend bill was within 5% of before.

When we flipped to tail-based with a policy of "keep all errors, keep all traces over 1s, sample 5% of the rest," three things happened in our experience:

Useful trace volume dropped about 30% (good — we were paying for noise)
Collector memory went from ~400MB steady to 3–6GB with spikes to 11GB during traffic bursts
p99 export latency from app to backend went from ~2s to ~25s, because of the decision window

The last one mattered more than we expected. During an incident, engineers were refreshing the trace UI waiting for spans that wouldn't appear for 20+ seconds. That's an eternity when you're paging.

A head-based config that gets you 80% of the value

If you haven't started yet, do this first. It's boring, it works, and you can always evolve later.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  # Trim noisy spans before they hit your backend
  filter/drop_health:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
        - 'attributes["http.target"] == "/metrics"'

exporters:
  otlp/backend:
    endpoint: ingest.your-vendor.tld:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [filter/drop_health, batch]
      exporters: [otlp/backend]

The sampling decision happens in the SDK, not the collector. In Node:

import { NodeSDK } from "@opentelemetry/sdk-node";
import {
  ParentBasedSampler,
  TraceIdRatioBasedSampler,
} from "@opentelemetry/sdk-trace-base";

const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1), // 10% of new traces
});

const sdk = new NodeSDK({
  sampler,
  // ...exporter, resource, instrumentations
});

sdk.start();

ParentBasedSampler is the important bit. It respects the upstream decision if there's a traceparent header, so a single trace stays consistent end-to-end. Without it you get half-sampled traces with holes that look like bugs.

When 10% isn't enough

For most backends, 10% gives you plenty for performance work and capacity planning. Where it falls down: rare errors. If your error rate is 0.1% and you're sampling at 10%, you're keeping 0.01% of all traffic as error traces. On 10M requests/day that's 1,000 error traces — fine. On 100k requests/day it's 10. Not fine.

The cheap fix is error-biased head sampling: instrument your SDK to set the sampling decision to RECORD_AND_SAMPLED when the root span sees a non-2xx response or an exception. You're still head-based, but errors get a second chance. You can't catch errors that happen deep in the trace this way, but you'll catch most of what matters.

When tail-based is actually worth the operational cost

We didn't rip out tail sampling everywhere. We kept it for two cases:

Checkout and payment flows. Low volume, high value per trace, and the questions we ask are "why was this specific user slow" — exactly what tail sampling is good at. We run a small dedicated collector pair for these services with maybe 2GB of memory each. Totally fine.
Async pipelines with long tails. Background workers where a tiny fraction of jobs take 100x the median. Head sampling misses these by definition; tail sampling catches them by design.

For the rest — the chatty internal RPC mesh, the read-heavy product APIs, the static asset proxies — head-based at 5–10% plus aggressive filter processors is the right call.

The two-tier collector pattern, if you must

If you do go tail-based at scale, here's the shape:

  apps  →  [load-balanced front collectors]  →  [trace-ID-sharded back collectors]  →  backend
           (stateless, autoscale freely)        (stateful, scale carefully)

The front layer uses the loadbalancing exporter with routing_key: traceID. The back layer runs the tail_sampling processor. You size the back tier for your worst burst, not your average, because OOMs there cause data loss for everyone, not just one node.

Budget roughly: average span size × spans per second × decision window in seconds × safety factor of 3. Our back tier ended up at 4 nodes of 8GB each for ~5k spans/sec sustained. Your mileage will absolutely vary.

What goes wrong in production (so you can plan for it)

Things that have hurt us or clients we've helped:

Schema drift between services. Service A sets user.id, service B sets userId. Your tail policies key on attributes — inconsistent attributes mean inconsistent decisions. Enforce a span attribute schema in code review, or use the transform processor to normalise.
Long-running spans blowing the decision window. If a span lasts 60s and your tail window is 30s, the decision fires before the span closes. The processor will log a warning and you'll wonder why your slow traces are missing. Tune decision_wait to your p99 trace duration plus headroom.
Collector restarts losing in-flight traces. Rolling a tail-sampling collector drops whatever's in the buffer. Do it during low traffic, and don't be surprised by a temporary trace gap.
SDK-side BatchSpanProcessor queue overflow under load. Defaults are conservative. Bump maxQueueSize and maxExportBatchSize if you see BatchSpanProcessor dropping spans warnings — that's data loss before the collector even sees it.

Where we'd start

If you're standing up OTel in 2026 and your trace volume is anywhere under a few hundred million spans a day, do this in order:

Head-based probabilistic sampling at 10%, ParentBasedSampler everywhere, no exceptions.
Filter processors in the collector to drop health checks, metrics scrapes, and known-noisy paths. This is free volume reduction.
Error-biased sampling in your SDKs so rare errors aren't lost.
Only after all that — and only for the services where you can name the specific question tail sampling answers — stand up a sharded tail-sampling tier.

The goal isn't to keep every interesting trace. It's to spend collector memory and engineer attention where they actually change an outcome. If you want a second pair of eyes on a rollout, our team does this kind of work as part of our DevOps and platform engagements.

#OpenTelemetry#Observability#DevOps#Cost Engineering

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

OpenTelemetry Sampling in Production: The Config That Saved Our Trace Bill

Head sampling threw away the traces we needed. Tail sampling blew up our collector memory. Here's the sampling config we landed on after six months in production.

July 30, 2026 6 min

CloudFront to Vercel: The Cache Header Mismatch That Cost Us a Weekend

We fronted a Vercel app with CloudFront to satisfy a compliance requirement. Two weeks later, stale checkouts and missing Set-Cookie headers taught us how differently these two CDNs think about caching.

July 27, 2026 6 min

Vercel Edge Middleware Latency: What We Measured When We Moved Auth to the Edge

We moved auth checks from a Node API route to Vercel Edge Middleware expecting free speed. Some routes got faster, some got slower, and the bill moved in ways we didn't predict.

July 25, 2026 6 min