Skip to main content

Protocol Overview

Stepflow uses a queue-based task protocol to coordinate work between the orchestrator and workers. The orchestrator dispatches tasks onto named queues; workers pull tasks from those queues, execute them, and report results back. This design decouples the orchestrator from workers, enabling independent scaling, multiple transport backends, and fault-tolerant execution.

Architecture

  1. The orchestrator executes a workflow and determines which component to invoke for each step.
  2. It uses routing rules to map the component path to a plugin, which determines the queue to publish the task on.
  3. A worker listening on that queue picks up the task, executes the component, and reports the result back to the orchestrator.

Queue Backends

Stepflow supports multiple queue backends through its plugin system. All backends share the same task lifecycle — the only difference is how tasks are transported between the orchestrator and workers.

BackendPlugin TypeTransportBest For
gRPCtype: grpcPullTasks streaming RPCSingle-orchestrator deployments, local development
NATStype: natsJetStream consumersMulti-orchestrator, horizontal scaling, durable queues

Future backends can be added by implementing the queue plugin interface.

gRPC Queues

Workers connect to the orchestrator's TasksService and call PullTasks to open a long-lived stream. The orchestrator pushes task assignments through the stream as they become available.

plugins:
python:
type: grpc
queueName: python
command: uv
args: ["run", "stepflow_py", "--grpc"]

NATS Queues

Tasks are published to NATS JetStream subjects. Workers run as durable consumers, enabling fan-out across multiple orchestrator instances and surviving restarts without task loss.

plugins:
python:
type: nats
natsUrl: "nats://localhost:4222"
stream: STEPFLOW_TASKS

Named Queues and Routing

Each plugin defines a default queue name (via queueName). Routes can override this per-path, allowing a single plugin to serve multiple worker pools:

plugins:
workers:
type: grpc
queueName: default

routes:
"/ml/{*component}":
- plugin: workers
params:
queueName: gpu # Override: route to GPU worker pool
"/python/{*component}":
- plugin: workers # Uses default queue name
"/{*component}":
- plugin: builtin

Workers subscribe to a specific queue name and only receive tasks routed to that queue.

Worker Configuration

Workers discover the orchestrator and queue configuration through environment variables. When the orchestrator launches a worker as a subprocess, it sets these automatically:

VariableDescription
STEPFLOW_TRANSPORTQueue backend: grpc or nats
STEPFLOW_QUEUE_NAMEQueue name to pull tasks from
STEPFLOW_TASKS_URLOrchestrator gRPC address (for gRPC transport)
STEPFLOW_BLOB_URLBlobService gRPC address

For remote workers (deployed independently, e.g., in Kubernetes), these variables are configured in the worker's deployment manifest.

Task Lifecycle

Every task follows the same lifecycle regardless of queue backend:

See Task Lifecycle for the full details on task types, heartbeating, completion, retries, and bidirectional callbacks.

Proto Definitions

The protocol is defined in Protocol Buffer files in stepflow-rs/crates/stepflow-proto/proto/stepflow/v1/:

  • tasks.protoTasksService: PullTasks, GetOrchestratorForRun
  • orchestrator.protoOrchestratorService: CompleteTask, TaskHeartbeat, SubmitRun, GetRun
  • components.protoComponentsService: ListRegisteredComponents (public API)
  • blobs.protoBlobService: PutBlob, GetBlob
  • common.proto — Shared types: TaskErrorCode, ObservabilityContext

Next Steps

  • Task Lifecycle — How tasks are dispatched, executed, and completed
  • Error Handling — Error codes, retry behavior, and failure handling
  • Blob Storage — Content-addressable storage for data sharing