Skip to main content

Production Deployment

Stepflow's architecture enables flexible production deployments that scale component execution independently from workflow orchestration. This section covers key concepts and components for deploying Stepflow in production environments.

Overview

In production, Stepflow separates concerns between:

  • Workflow Orchestrator: Manages workflow execution, data flow, and state persistence
  • Workers: Provide business logic and can be scaled independently
  • Task Queues: Route tasks to workers via named gRPC queues

This separation allows you to:

  • Scale different types of components independently based on resource requirements
  • Deploy components on specialized hardware (GPUs, high-memory nodes, etc.)
  • Maintain simple orchestration while distributing compute-intensive work
  • Handle high-throughput batch processing efficiently

Architecture Patterns

Resource-Based Component Segregation

Different workers can be deployed with different resource profiles:

# Configuration routing to different worker pools using per-route queueName
plugins:
builtin:
type: builtin

workers:
type: grpc
queueName: default # Default queue; overridden per-route below

routes:
"/ml/{*component}":
- plugin: workers
params:
queueName: gpu # GPU worker pool
"/data/{*component}":
- plugin: workers
params:
queueName: memory # High-memory worker pool
"/python/{*component}":
- plugin: workers
params:
queueName: cpu # CPU worker pool
"/{*component}":
- plugin: builtin

Deployment Topology

Each worker pool:

  • Pulls tasks from a dedicated named queue
  • Scales independently based on workload
  • Runs on appropriate hardware (CPU, GPU, high-memory nodes)
  • Pulls tasks from the orchestrator and returns results via gRPC

Key Components

1. Task Routing

Workers pull tasks from named queues and return results to the orchestrator via gRPC. The pull-based protocol provides:

  • Named queue-based task routing
  • Heartbeat-based crash detection
  • Automatic retry on transport failures
  • Horizontal scaling across multiple worker instances

Use cases:

  • Distributing tasks across multiple worker pods
  • Scaling worker pools independently
  • Enabling horizontal scaling of workers

2. Worker Pools

Deploy different workers for different workloads:

CPU-Optimized:

  • General-purpose Python components
  • Data transformation and validation
  • API integrations
  • Deployment: Standard compute nodes, high replica count

GPU-Accelerated:

  • ML model inference
  • Image/video processing
  • Large language models
  • Deployment: GPU nodes, fewer replicas, higher cost

Memory-Intensive:

  • Large dataset processing
  • In-memory caching
  • Data aggregation
  • Deployment: High-memory nodes, moderate replica count

3. Configuration Management

Use Configuration and Variables to manage environment-specific settings:

Configuration controls infrastructure:

  • Define plugin routes to different worker pools
  • Configure state storage backends
  • Set worker connection details

Variables parameterize workflows:

  • API endpoints and credentials that differ between environments
  • Feature flags and configuration options
  • Resource limits and timeouts
  • Environment identifiers (dev, staging, production)

This separation allows the same workflow definition to run across environments by only changing configuration and variables, not the workflow itself.

Example: Kubernetes Deployment

The Kubernetes Batch Demo provides a complete working example of:

  • Stepflow orchestrator deployed in Kubernetes
  • Multiple worker replicas pulling from named queues
  • gRPC-based task dispatch and completion
  • Batch execution with distributed compute
  • Heartbeat-based health monitoring and automatic failover

Key features demonstrated:

  • Workers scale from 3 to 20+ replicas
  • Tasks distributed across worker pool via named queues
  • Bidirectional communication (sub-run submission) works correctly
  • Batch workflows process 1000+ items efficiently

Scaling Strategies

Horizontal Scaling

Scale workers based on load:

# Kubernetes HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cpu-components
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cpu-components
minReplicas: 5
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

Resource-Based Routing

Route components to appropriate hardware:

# GPU worker deployment
spec:
template:
spec:
nodeSelector:
accelerator: nvidia-tesla-v100
containers:
- name: gpu-components
resources:
limits:
nvidia.com/gpu: 1

State Management

Development

  • In-memory state store
  • Single orchestrator instance
  • Fast, simple, ephemeral

Production

  • SQLite or PostgreSQL state store
  • Persistent workflow state
  • Multiple orchestrator instances (with PostgreSQL)
  • Durable execution with fault tolerance

See Persistence and Recovery for the full architecture, and Configuration - State Store for configuration options.

Best Practices

1. Separate Component Classes

Group components by resource requirements:

  • Light: API calls, simple transformations → CPU workers
  • Medium: Data processing, batch operations → Memory workers
  • Heavy: ML inference, GPU workloads → GPU workers

2. Use Named Queues

Configure separate named queues for each component class:

  • Enables horizontal scaling per worker pool
  • Provides heartbeat-based health monitoring
  • Automatic crash detection and retry
  • Simplifies configuration

3. Monitor and Scale

Track key metrics:

  • Worker CPU/memory usage
  • Request latency and throughput
  • Error rates and health status
  • Queue depths and backpressure

4. Plan for Failures

Design for resilience:

  • Health checks with automatic failover
  • Heartbeat-based crash detection and retry
  • Configurable per-step error handling in workflows
  • Persistent state storage for crash recovery

Next Steps

Learn More

  • Read the FAQ for comparisons with other orchestration and workflow technologies
  • Learn about the Stepflow Protocol that enables this architecture