Step Functions
Step Functions orchestrates multi-step workflows as state machines. Each step can invoke Lambda, call AWS APIs, run ECS tasks, or wait for external input. It provides built-in error handling, retries, and parallel execution.
Step Functions is a factory assembly line — each station (state) does its job, passes the widget (data) to the next, and if a station breaks (error), the floor supervisor (retry/catch) handles it instead of everything grinding to a halt.
Workflow types: Standard (up to 1 year, at-least-once, auditable history) and Express (up to 5 minutes, at-least-once or exactly-once, high throughput). States: Task, Choice, Parallel, Map, Wait, Pass, Succeed, Fail. SDK Integrations (optimistic pattern): Step Functions calls AWS APIs directly without Lambda — reduces cost and latency. Callback pattern: tasks pause until an external system calls SendTaskSuccess/Failure with a task token. Useful for human approval workflows.
Step Functions charges by state transition — complex workflows with many states can become expensive with Standard workflows. Express Workflows are cheaper for high-frequency, short-duration processes (order processing, IoT). Map state enables parallel processing of arrays (fan-out). Distributed Map processes millions of items from S3 using S3 Inventory as input, running child executions in parallel at massive scale. Step Functions integrates with EventBridge for async triggering and X-Ray for end-to-end tracing. Error handling: Catch and Retry within each state, with exponential backoff configuration. Step Functions is the answer to 'how do I coordinate multiple Lambda functions without chaining them' — direct invocation chains are brittle and hard to monitor.
Step Functions orchestrates long-running multi-step workflows where I need visibility, retry logic, and parallel execution. I use it for order fulfillment pipelines, ML workflows, and human approval loops (callback pattern with task tokens). SDK integrations let me call AWS services directly without Lambda wrappers. I choose Express Workflows for high-volume short workflows to control costs. The visual execution history makes debugging distributed workflows vastly easier than tracing Lambda chains.
Using Lambda chaining (Lambda calling Lambda) for multi-step workflows. If step 3 of 7 fails, you have no visibility into which step failed or why, and error handling requires custom code in every Lambda. Step Functions provides this out of the box.