Long-running operations like cluster initialization and deployment rollouts are now powered by Hatchet, an open-source durable execution engine.
We needed a workflow engine that could handle multi-step infrastructure operations reliably. Hatchet stood out for a few reasons:
Cluster setup, deployment rollouts, and resource cleanup now run as Hatchet workflows. Each step has its own retry policy and timeout, so transient failures (like a cloud API being slow) are handled automatically without failing the entire operation.
Progress is tracked per-step, so the UI always reflects exactly where things stand.