Protocol Overview
Stepflow uses a queue-based task protocol to coordinate work between the orchestrator and workers. The orchestrator dispatches tasks onto named queues; workers pull tasks from those queues, execute them, and report results back. This design decouples the orchestrator from workers, enabling independent scaling, multiple transport backends, and fault-tolerant execution.
Architecture
- The orchestrator executes a workflow and determines which component to invoke for each step.
- It uses routing rules to map the component path to a plugin, which determines the queue to publish the task on.
- A worker listening on that queue picks up the task, executes the component, and reports the result back to the orchestrator.
Queue Backends
Stepflow supports multiple queue backends through its plugin system. All backends share the same task lifecycle — the only difference is how tasks are transported between the orchestrator and workers.
| Backend | Plugin Type | Transport | Best For |
|---|---|---|---|
| gRPC | type: grpc | PullTasks streaming RPC | Single-orchestrator deployments, local development |
| NATS | type: nats | JetStream consumers | Multi-orchestrator, horizontal scaling, durable queues |
Future backends can be added by implementing the queue plugin interface.
gRPC Queues
Workers connect to the orchestrator's TasksService and call PullTasks to open a long-lived stream. The orchestrator pushes task assignments through the stream as they become available.
plugins:
python:
type: grpc
queueName: python
command: uv
args: ["run", "stepflow_py", "--grpc"]
NATS Queues
Tasks are published to NATS JetStream subjects. Workers run as durable consumers, enabling fan-out across multiple orchestrator instances and surviving restarts without task loss.
plugins:
python:
type: nats
natsUrl: "nats://localhost:4222"
stream: STEPFLOW_TASKS
Named Queues and Routing
Each plugin defines a default queue name (via queueName). Routes can override this per-path, allowing a single plugin to serve multiple worker pools:
plugins:
workers:
type: grpc
queueName: default
routes:
"/ml/{*component}":
- plugin: workers
params:
queueName: gpu # Override: route to GPU worker pool
"/python/{*component}":
- plugin: workers # Uses default queue name
"/{*component}":
- plugin: builtin
Workers subscribe to a specific queue name and only receive tasks routed to that queue.
Worker Configuration
Workers discover the orchestrator and queue configuration through environment variables. When the orchestrator launches a worker as a subprocess, it sets these automatically:
| Variable | Description |
|---|---|
STEPFLOW_TRANSPORT | Queue backend: grpc or nats |
STEPFLOW_QUEUE_NAME | Queue name to pull tasks from |
STEPFLOW_TASKS_URL | Orchestrator gRPC address (for gRPC transport) |
STEPFLOW_BLOB_URL | BlobService gRPC address |
For remote workers (deployed independently, e.g., in Kubernetes), these variables are configured in the worker's deployment manifest.
Task Lifecycle
Every task follows the same lifecycle regardless of queue backend:
See Task Lifecycle for the full details on task types, heartbeating, completion, retries, and bidirectional callbacks.
Proto Definitions
The protocol is defined in Protocol Buffer files in stepflow-rs/crates/stepflow-proto/proto/stepflow/v1/:
tasks.proto—TasksService: PullTasks, GetOrchestratorForRunorchestrator.proto—OrchestratorService: CompleteTask, TaskHeartbeat, SubmitRun, GetRuncomponents.proto—ComponentsService: ListRegisteredComponents (public API)blobs.proto—BlobService: PutBlob, GetBlobcommon.proto— Shared types: TaskErrorCode, ObservabilityContext
Next Steps
- Task Lifecycle — How tasks are dispatched, executed, and completed
- Error Handling — Error codes, retry behavior, and failure handling
- Blob Storage — Content-addressable storage for data sharing