What is Distributed Tracing (OpenTelemetry) in High-Level Design?

Observabilitycritical

Distributed Tracing (OpenTelemetry)

Distributed tracing tracks a request as it flows across multiple services by propagating a trace context (trace ID + span ID) through all calls. OpenTelemetry is the vendor-neutral standard for instrumentation.

Memory anchor

Distributed tracing = a GPS tracker on a package. You can see every warehouse, truck, and sorting facility it passed through — and exactly how long it sat at each stop.

Expected depth

A trace is a collection of spans. A span represents a single unit of work (an HTTP handler, a DB query, a Kafka message processing) with timing, attributes, and status. The trace context is propagated via HTTP headers (W3C Trace Context standard: traceparent header) or gRPC metadata. OpenTelemetry SDKs auto-instrument popular frameworks (Spring, Express, Django) with zero code changes. Traces are exported to backends (Jaeger, Zipkin, Tempo, Datadog, Honeycomb) for visualization. Use traces to: find which service is the bottleneck in a slow request, understand the causal chain of a production error, measure the performance impact of a code change.

Deep — senior internals

Sampling is critical: tracing 100% of requests at high QPS is expensive in terms of storage and processing. Head-based sampling (decide at trace root whether to sample) is simple but loses traces for rare errors. Tail-based sampling (buffer trace data and decide after the trace is complete, favoring errored or slow traces) is more complex but captures the high-value traces. OpenTelemetry Collector supports tail sampling processors. Trace context must be propagated through async boundaries: when a service publishes a Kafka message, it must embed the trace context in the message headers; the consumer must extract it and create a child span. Without this, traces are fragmented and async hops are invisible.

🎤Interview-ready answer

I instrument all services with the OpenTelemetry SDK using auto-instrumentation, which captures 95% of spans without code changes. I configure the OTel Collector for tail-based sampling: always sample error traces and traces over 2s p99; sample 5% of normal traffic. For Kafka async boundaries, I use the OTel Kafka instrumentation that automatically propagates trace context in message headers. Traces go to Tempo for storage with Grafana for visualization, integrated with the same dashboards as metrics and logs.

⚠Common trap

Conflating distributed tracing with logging. Logs record discrete events; traces record the causal structure and timing of a request across services. Both are necessary — traces tell you where time is spent; logs tell you what happened at each step.

Related concepts

High-Level Design

RED Metrics (Rate / Errors / Duration)