All articles
DevOps & CloudJune 7, 2026 6 min read

OpenTelemetry Sampling at Scale: Why Tail-Based Bit Us First

We rolled out OpenTelemetry across a Node and Go fleet, picked tail-based sampling because everyone said to, and learned why head-based wins for most teams. Here's the tradeoff we wish someone had drawn for us.

OpenTelemetry Sampling at Scale: Why Tail-Based Bit Us First

We spent a quarter rolling OpenTelemetry across a mixed Node and Go fleet, switched on tail-based sampling because every conference talk in 2025 said to, and watched our collector memory chart look like a heart monitor. The lesson wasn't that tail-based is bad. It's that most teams reach for it before they've earned it.

This is the breakdown we wish we'd had on a whiteboard before we started.

The two sampling modes, minus the marketing

If you've only skimmed the OTel docs, here's the honest version.

Head-based sampling decides whether to keep a trace at the moment the root span is created. The decision propagates via the traceparent header, so every downstream service agrees. It's cheap, stateless, and deterministic. The downside: you decide before you know if the request was interesting. A 500 error you sampled out is gone forever.

Tail-based sampling buffers every span for a window (usually 5–30 seconds), waits for the whole trace to complete, then decides. You can keep 100% of errors, 100% of slow requests, and a small percentage of healthy ones. The downside: the collector has to hold every span in memory for the buffer window, and it has to see every span — which kills horizontal scaling unless you shard by trace ID.

That last sentence is the one that bit us.

Why "just turn on tail sampling" is a trap

The OTel Collector's tail_sampling processor needs all spans for a given trace to land on the same collector instance. If you run a fleet of collectors behind a round-robin load balancer, span A of trace X goes to collector 1, span B goes to collector 2, and neither has enough context to decide. Both end up holding partial traces until the timeout, then making bad decisions.

The fix is a two-tier collector setup: a stateless front layer that does nothing but hash by trace ID and route to a stateful back layer. That back layer is now a sharded, memory-hungry, stateful service you have to operate. Welcome to your new pet.

What our bill and latency actually looked like

Numbers from our environment — not benchmarks, just what we saw. Take them as shape, not gospel.

Before OTel, we were on a vendor SDK with adaptive sampling at roughly 10%. Trace volume sat around 40M spans/day. After lift-and-shift to OTel with head-based probabilistic sampling at 10%, span volume was basically unchanged and our backend bill was within 5% of before.

When we flipped to tail-based with a policy of "keep all errors, keep all traces over 1s, sample 5% of the rest," three things happened in our experience:

  • Useful trace volume dropped about 30% (good — we were paying for noise)
  • Collector memory went from ~400MB steady to 3–6GB with spikes to 11GB during traffic bursts
  • p99 export latency from app to backend went from ~2s to ~25s, because of the decision window

The last one mattered more than we expected. During an incident, engineers were refreshing the trace UI waiting for spans that wouldn't appear for 20+ seconds. That's an eternity when you're paging.

A head-based config that gets you 80% of the value

If you haven't started yet, do this first. It's boring, it works, and you can always evolve later.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  # Trim noisy spans before they hit your backend
  filter/drop_health:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
        - 'attributes["http.target"] == "/metrics"'

exporters:
  otlp/backend:
    endpoint: ingest.your-vendor.tld:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [filter/drop_health, batch]
      exporters: [otlp/backend]

The sampling decision happens in the SDK, not the collector. In Node:

import { NodeSDK } from "@opentelemetry/sdk-node";
import {
  ParentBasedSampler,
  TraceIdRatioBasedSampler,
} from "@opentelemetry/sdk-trace-base";

const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1), // 10% of new traces
});

const sdk = new NodeSDK({
  sampler,
  // ...exporter, resource, instrumentations
});

sdk.start();

ParentBasedSampler is the important bit. It respects the upstream decision if there's a traceparent header, so a single trace stays consistent end-to-end. Without it you get half-sampled traces with holes that look like bugs.

When 10% isn't enough

For most backends, 10% gives you plenty for performance work and capacity planning. Where it falls down: rare errors. If your error rate is 0.1% and you're sampling at 10%, you're keeping 0.01% of all traffic as error traces. On 10M requests/day that's 1,000 error traces — fine. On 100k requests/day it's 10. Not fine.

The cheap fix is error-biased head sampling: instrument your SDK to set the sampling decision to RECORD_AND_SAMPLED when the root span sees a non-2xx response or an exception. You're still head-based, but errors get a second chance. You can't catch errors that happen deep in the trace this way, but you'll catch most of what matters.

When tail-based is actually worth the operational cost

We didn't rip out tail sampling everywhere. We kept it for two cases:

  1. Checkout and payment flows. Low volume, high value per trace, and the questions we ask are "why was this specific user slow" — exactly what tail sampling is good at. We run a small dedicated collector pair for these services with maybe 2GB of memory each. Totally fine.
  2. Async pipelines with long tails. Background workers where a tiny fraction of jobs take 100x the median. Head sampling misses these by definition; tail sampling catches them by design.

For the rest — the chatty internal RPC mesh, the read-heavy product APIs, the static asset proxies — head-based at 5–10% plus aggressive filter processors is the right call.

The two-tier collector pattern, if you must

If you do go tail-based at scale, here's the shape:

  apps  →  [load-balanced front collectors]  →  [trace-ID-sharded back collectors]  →  backend
           (stateless, autoscale freely)        (stateful, scale carefully)

The front layer uses the loadbalancing exporter with routing_key: traceID. The back layer runs the tail_sampling processor. You size the back tier for your worst burst, not your average, because OOMs there cause data loss for everyone, not just one node.

Budget roughly: average span size × spans per second × decision window in seconds × safety factor of 3. Our back tier ended up at 4 nodes of 8GB each for ~5k spans/sec sustained. Your mileage will absolutely vary.

What goes wrong in production (so you can plan for it)

Things that have hurt us or clients we've helped:

  • Schema drift between services. Service A sets user.id, service B sets userId. Your tail policies key on attributes — inconsistent attributes mean inconsistent decisions. Enforce a span attribute schema in code review, or use the transform processor to normalise.
  • Long-running spans blowing the decision window. If a span lasts 60s and your tail window is 30s, the decision fires before the span closes. The processor will log a warning and you'll wonder why your slow traces are missing. Tune decision_wait to your p99 trace duration plus headroom.
  • Collector restarts losing in-flight traces. Rolling a tail-sampling collector drops whatever's in the buffer. Do it during low traffic, and don't be surprised by a temporary trace gap.
  • SDK-side BatchSpanProcessor queue overflow under load. Defaults are conservative. Bump maxQueueSize and maxExportBatchSize if you see BatchSpanProcessor dropping spans warnings — that's data loss before the collector even sees it.

Where we'd start

If you're standing up OTel in 2026 and your trace volume is anywhere under a few hundred million spans a day, do this in order:

  1. Head-based probabilistic sampling at 10%, ParentBasedSampler everywhere, no exceptions.
  2. Filter processors in the collector to drop health checks, metrics scrapes, and known-noisy paths. This is free volume reduction.
  3. Error-biased sampling in your SDKs so rare errors aren't lost.
  4. Only after all that — and only for the services where you can name the specific question tail sampling answers — stand up a sharded tail-sampling tier.

The goal isn't to keep every interesting trace. It's to spend collector memory and engineer attention where they actually change an outcome. If you want a second pair of eyes on a rollout, our team does this kind of work as part of our DevOps and platform engagements.

#OpenTelemetry#Observability#DevOps#Cost Engineering

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project