Monitoring & Observabilitymedium

AWS X-Ray

X-Ray traces requests as they travel through distributed systems — capturing timing, metadata, and errors at each service hop. It creates a service map and flame graphs to identify performance bottlenecks and error sources.

Memory anchor

X-Ray is a package tracker for your requests — you see every warehouse (service) it passed through, how long it sat there, and where it got delayed or damaged. Annotations are 'fragile' stickers you add to find specific packages later.

Expected depth

Instrumentation: X-Ray SDK (code-level), X-Ray Daemon (sidecar for EC2/ECS), Lambda automatic tracing (enable via console/config). Concepts: Trace (end-to-end request), Segment (one service's contribution), Subsegment (function call, SQL query, external HTTP call). Annotations (indexed, searchable) and metadata (not indexed). Sampling rules: capture a percentage of requests to manage cost. Service map: visual graph of service dependencies with error rates and latency percentiles.

Deep — senior internals

X-Ray groups enable filtering traces by annotation values for targeted analysis. X-Ray Insights proactively detects anomalies and creates events (similar to alarms). X-Ray traces propagate via HTTP headers (X-Amzn-Trace-Id) across service boundaries. For Lambda, X-Ray captures the invocation overhead (initialization, queuing) plus your code's segments. X-Ray Integration with API Gateway, ALB, SQS, and SNS enables full distributed tracing without code changes. CloudWatch ServiceLens integrates X-Ray service maps with CloudWatch metrics and logs into a single console view.

🎤Interview-ready answer

X-Ray is essential for debugging latency in distributed/microservice architectures where a single user request touches many services. I enable automatic tracing on Lambda and ECS, add custom subsegments around database calls and external HTTP requests, and use annotations for searchable metadata (user ID, order ID). The service map shows me where in the call chain latency or errors originate in seconds.

Common trap

Sampling every request in production at high traffic. X-Ray charges per trace stored — 100% sampling at 10,000 RPS becomes expensive. Set sampling rules to capture 5% of requests plus 100% of error requests.