Skip to main content

Error Handling

Stepflow uses a TaskErrorCode enum to categorize errors. Each variant has built-in retryability semantics that the orchestrator uses for automatic retry decisions.

TaskErrorCode

The TaskErrorCode enum is defined in the gRPC protocol (common.proto) and used throughout the system — in FlowError, TaskError, ItemResult, and StepStatus.

CodeDescriptionRetry Behavior
TIMEOUTTask exceeded execution deadline or heartbeat timeoutAlways retried
UNREACHABLEWorker could not be reached (subprocess crash, network timeout)Always retried
COMPONENT_FAILEDComponent executed but returned a business-logic failureWith onError: retry
RESOURCE_UNAVAILABLEResource required by the component was not availableWith onError: retry
INVALID_INPUTComponent rejected its input (schema validation, missing fields)Never
CANCELLEDTask explicitly cancelled by orchestratorNever
COMPONENT_NOT_FOUNDRequested component does not exist on the workerNever
EXPRESSION_FAILUREOrchestrator failed to resolve a value expression ($step, $input)Never
ORCHESTRATOR_ERRORCatch-all for unexpected orchestrator errorsNever
WORKER_ERRORCatch-all for unexpected worker/SDK errorsNever

Error Format

Errors in flow results use the FlowError structure:

{
"outcome": "failed",
"error": {
"code": "COMPONENT_FAILED",
"message": "API call returned 503",
"data": { "stack": [] }
}
}

Fields

  • code (required): A TaskErrorCode string value
  • message (required): Human-readable error description
  • data (optional): Structured error data (stack traces, error context)

How Errors Flow

Component errors

When a worker reports a task failure via gRPC CompleteTask, it sends a TaskError with the appropriate TaskErrorCode. The orchestrator uses this directly in the FlowResult.

Transport errors

When the worker is unreachable — subprocess crash, network timeout, connection refused — the orchestrator sets the error code to UNREACHABLE. This indicates the component never executed.

Orchestrator errors

When the orchestrator fails to resolve value expressions (e.g., $step reference to unknown step, path to nonexistent field), it creates an EXPRESSION_FAILURE error. Other unexpected orchestrator issues produce ORCHESTRATOR_ERROR.

Retry Behavior

The orchestrator uses TaskErrorCode variants to make retry decisions:

  • Always retried: UNREACHABLE and TIMEOUT errors are retried up to retry.transportMaxRetries (default: 3). The plugin's prepare_for_retry() is called before each retry.
  • Retried with onError: retry: COMPONENT_FAILED and RESOURCE_UNAVAILABLE errors are retried only if the step has onError: { action: retry }, up to maxRetries (default: 3).
  • Never retried: All other error codes indicate structural problems that won't resolve on retry.
  • Separate budgets: Transport retries and component retries have independent counters. Exhausting one does not affect the other.

Transport errors and component errors share a single monotonically increasing attempt counter visible to the component. See Retries for details.

Error handling actions

Steps can configure onError to control behavior on failure:

  • fail (default): Stop workflow execution with the error
  • useDefault: Use a defaultValue and mark the step as successful
  • retry: Retry the step up to maxRetries times (for component execution errors only)
steps:
- id: flaky_step
component: /external/api
onError:
action: retry
maxRetries: 5

Python SDK Error Handling

Exception hierarchy

The Python SDK maps exceptions to TaskErrorCode variants for gRPC reporting:

ExceptionTaskErrorCode
StepflowExecutionErrorCOMPONENT_FAILED
StepflowRuntimeErrorRESOURCE_UNAVAILABLE
StepflowValidationErrorINVALID_INPUT
StepflowComponentErrorCOMPONENT_NOT_FOUND
StepflowError (base)WORKER_ERROR

Raising errors in components

from stepflow_py.worker.exceptions import StepflowExecutionError, StepflowRuntimeError

# Business logic failure (retriable with onError: retry)
raise StepflowExecutionError("API call failed")

# Resource unavailable (retriable with onError: retry)
raise StepflowRuntimeError("Database connection failed")

Orchestrator Service Errors

When a worker calls OrchestratorService RPCs (CompleteTask, TaskHeartbeat, SubmitRun, GetRun), the orchestrator returns gRPC error codes to signal ownership and availability issues:

gRPC CodeMeaningWorker Action
NOT_FOUNDThe run/task is not on this orchestrator (run migrated or completed)Call GetOrchestratorForRun to discover the current orchestrator, retry on the new URL
UNAVAILABLEThe orchestrator is unreachable or the run is being recoveredRetry with exponential backoff

These are distinct from TaskErrorCode (which categorizes component execution failures). Orchestrator service errors indicate infrastructure-level routing problems, not component logic errors.

Orchestrator Discovery

When a worker receives NOT_FOUND or UNAVAILABLE from any OrchestratorService RPC, it can call TasksService.GetOrchestratorForRun on any orchestrator to discover which orchestrator currently owns the run. This enables automatic recovery when orchestrators restart or runs migrate between orchestrators.

The worker's OrchestratorTracker handles this automatically for all RPCs — heartbeat, task completion, subflow submission, and run queries all share the same tracker per task and benefit from discovery performed by any one of them.