Monitoring & Observabilitycritical

CloudWatch

CloudWatch is AWS's observability service for collecting metrics, logs, and alarms. It aggregates data from AWS services and custom applications, enabling dashboards, alerting, and log analysis.

Memory anchor

CloudWatch is the hospital's patient monitoring system — ECG (metrics), nurse notes (logs), crash cart alarm (alarms), and scheduled check-ins (synthetics). You get paged when something abnormal is detected.

Expected depth

Metrics: namespace/dimension/metric name hierarchy. Standard resolution (1-minute granularity) and high resolution (1-second). Custom metrics published via PutMetricData API or CloudWatch Agent. Alarms: trigger actions (SNS notification, Auto Scaling, EC2 action) when metrics cross thresholds. Log Groups and Log Streams: structured storage for application logs. Log Insights: SQL-like query language for log analysis. Container Insights: ECS/EKS metrics and logs. Lambda Insights: enhanced Lambda metrics (duration, memory, cold starts).

Deep — senior internals

CloudWatch Metrics Insights enables SQL-style queries across all CloudWatch metrics for fleet-wide analysis. Anomaly Detection uses ML to establish baseline and alerts on deviations — better than static thresholds for seasonal workloads. CloudWatch Evidently provides feature flagging and A/B testing with metrics. Embedded Metrics Format (EMF) lets Lambda and containers write structured JSON logs that CloudWatch automatically converts to metrics — cheaper than PutMetricData for high-frequency metrics. Metric Math enables calculated metrics (e.g., error rate = errors/requests × 100) without storing derived metrics separately. CloudWatch Synthetics runs canary scripts (Puppeteer/Selenium) against endpoints to proactively detect outages before users do.

🎤Interview-ready answer

CloudWatch is my primary observability platform on AWS. I use the CloudWatch Agent to collect system metrics and structured logs, Log Insights to query logs during incidents, and Alarms tied to SNS for alerting. For Lambda, I enable Lambda Insights for cold start and memory metrics. I use Embedded Metrics Format for high-frequency custom metrics from Lambda to avoid PutMetricData costs. Synthetics canaries run every minute to catch endpoint outages proactively.

Common trap

Creating alarms only on error counts, not error rates. During traffic spikes, error counts naturally increase even if the error rate is stable. A rate-based alarm (errors / requests) avoids false pages during healthy traffic growth.