Production Model Serving

Production Model Serving with Stepflow

🚀 Experience Production-Ready AI Workflows in Minutes

This demo shows how Stepflow transforms AI workflow serving from monolithic deployments to scalable, cost-effective microservices. In just one command, you'll see:

Independent model scaling - Text and vision models scale separately based on demand
Smart resource allocation - CPU instances for text, GPU for vision, lightweight orchestration
Zero-downtime deployments - Update models without affecting other services
Built-in fault tolerance - Services fail gracefully without cascading errors

What You'll Experience:

Instant Setup - No complex configuration, everything works out of the box
Real Architecture - See how production AI systems actually work
Cost Benefits - Understand how to optimize GPU spending and resource usage
Scaling Patterns - Learn how to handle varying AI workload demands

Two Ways to Run This Demo

🔧 Development Mode (5 seconds to start)

Perfect for learning, experimentation and development:

Local processes - All services on your machine for easy debugging
Mock AI models - No GPU or ML libraries required
Live reloading - Change code and see results immediately

🏭 Production Mode (Docker Compose)

See real production architecture:

Containerized services - Stepflow runtime + separate AI model servers
Service orchestration - Health checks, dependencies, monitoring
HTTP communication - Production-ready service-to-service calls
Independent deployment - Each service can be updated separately

⚡ Quick Start

Try it now - zero installation required!

# Clone and run in 30 seconds
cd examples/production-model-serving
./scripts/run-dev-direct.sh

What happens when you run this:

🏥 Health checks verify all model servers are ready
🧠 AI pipeline analyzes your input and selects optimal processing strategy
📊 Sentiment analysis processes the text using mock DistilBERT
✨ Results show processing times, model selections, and recommendations

Want to try different scenarios?

./scripts/run-dev-direct.sh multimodal  # Test with image processing
./scripts/run-dev-direct.sh batch       # Test batch processing efficiency

Why This Architecture Matters for Production AI

💰 Save Money on AI Infrastructure

Problem: Traditional AI deployments waste expensive GPU resources on simple tasks. Solution: Stepflow routes text processing to cheap CPU instances, reserves GPUs for vision work.

# Text models: $0.10/hour CPU instances
text_models_cluster:
  url: "http://text-models:8080"  # CPU-optimized

# Vision models: $2.50/hour GPU instances
vision_models_cluster:
  url: "http://vision-models:8081"  # GPU-enabled

Real Impact: Companies report 60-80% cost reduction by right-sizing AI compute resources.

📈 Scale Each AI Service Independently

Problem: Monolithic AI systems can't handle varying workload patterns. Solution: Scale text and vision processing based on actual demand.

# Black Friday: Scale text processing for customer reviews
kubectl scale deployment text-models --replicas=20

# Product launches: Scale vision for image analysis
kubectl scale deployment vision-models --replicas=5

Real Impact: Handle 10x traffic spikes without over-provisioning all services.

🚀 Deploy AI Models Without Downtime

Problem: Model updates require full system restarts, causing service interruptions. Solution: Update individual model servers while others keep running.

# Update text models while vision keeps running
kubectl rollout restart deployment text-models
# Zero impact on vision processing

Real Impact: Deploy new models multiple times per day without user-facing outages.

🛡️ Built-in Fault Tolerance

Problem: One AI model failure brings down the entire system. Solution: Services fail independently with graceful degradation.

Vision model crashes → Text processing continues unaffected
Text model overloaded → Vision processing stays responsive
Health checks automatically route around failed instances

Real Impact: 99.9% uptime even with individual service failures.

What You'll Learn From This Demo

🎯 Production AI Architecture Patterns

Microservice decomposition: See how to break monolithic AI into scalable services
Resource optimization: Learn cost-effective GPU and CPU allocation strategies
Service communication: Understand HTTP vs process-based model server integration
Health monitoring: Implement production-ready AI system observability

🔧 Stepflow Development Patterns

Serve/Submit vs Direct Run: Compare development speed vs production realism
Configuration management: Environment-specific configs for dev/staging/prod
Component routing: Dynamic service discovery and load balancing
Error handling: Graceful degradation and fault isolation techniques

📊 Real Performance Insights

When you run this demo, you'll see actual metrics showing:

Processing time differences between CPU and GPU routing
Health check response times and service discovery
Resource utilization patterns across different model types
Batch processing efficiency gains

Try Different Execution Modes

🏃‍♂️ Development Mode: Direct Run (Fastest)

Perfect for workflow development and testing:

cd examples/production-model-serving
./scripts/run-dev-direct.sh

# Try different scenarios:
./scripts/run-dev-direct.sh multimodal  # With image processing
./scripts/run-dev-direct.sh batch       # Batch processing demo

Best for: Quick iterations, debugging workflows, learning Stepflow

🖥️ Development Mode: Serve/Submit (Production-like)

Test production patterns locally:

cd examples/production-model-serving
./scripts/run-dev.sh

What this shows:

Persistent model servers (faster repeat executions)
Service-to-service communication patterns
How monitoring and health checks work
Realistic production deployment simulation

Best for: Testing production behaviors, multiple workflow runs, server debugging

🐳 Production Mode: Full Container Deployment

Experience complete production architecture:

cd examples/production-model-serving
./scripts/run-prod.sh

What you get:

Stepflow Runtime: Central orchestration server
Text Models Service: CPU-optimized text processing
Vision Models Service: GPU-ready image processing
Monitoring Stack: Prometheus + Redis for production observability

Best for: Understanding production deployment, container orchestration, service scaling

Model Servers

Text Models Server (`text_models_server.py`)

Provides text processing capabilities optimized for CPU workloads:

Text Generation: GPT-2 based text completion
Sentiment Analysis: DistilBERT sentiment classification
Batch Processing: Efficient multi-text processing
Health Monitoring: Resource usage and model status

Components:

models/text/generate_text - Generate text from prompts
models/text/analyze_sentiment - Analyze text sentiment
models/text/batch_process_text - Process multiple texts efficiently
models/text/model_health_check - Server health and metrics

Vision Models Server (`vision_models_server.py`)

Provides computer vision capabilities optimized for GPU workloads:

Image Classification: ResNet, Vision Transformer models
Batch Image Processing: Efficient multi-image processing
Image Analysis: Metadata extraction and model recommendations
GPU Monitoring: Memory usage and performance metrics

Components:

models/vision/classify_image - Classify images with various models
models/vision/batch_classify_images - Process multiple images
models/vision/analyze_image_metrics - Image property analysis
models/vision/vision_health_check - GPU status and model health

Workflow Capabilities

The ai_pipeline_workflow.yaml demonstrates:

Health Checks: Verify model server availability before processing
Content Analysis: Determine optimal processing strategy based on input
Model Selection: Choose appropriate models based on resource preferences
Multi-modal Processing: Handle both text and image inputs
Batch Processing: Demonstrate efficient multi-input processing
Performance Monitoring: Track processing times and resource usage
Production Insights: Generate recommendations for optimization

Sample Inputs

Text-only Processing (`sample_input_text.json`)

{
  "user_text": "I'm excited about our new AI features!",
  "processing_mode": "accurate",
  "prefer_gpu": false
}

Multi-modal Processing (`sample_input_multimodal.json`)

{
  "user_text": "What do you think about this image?",
  "user_image": "data:image/jpeg;base64,...",
  "processing_mode": "accurate",
  "prefer_gpu": true
}

Batch Processing (`sample_input_batch.json`)

{
  "user_text": "Multiple\\nlines\\nof\\ntext",
  "processing_mode": "batch",
  "batch_size": 8
}

Production Deployment Patterns

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: text-models
spec:
  replicas: 5
  selector:
    matchLabels:
      app: text-models
  template:
    spec:
      containers:
      - name: text-models
        image: company/text-models:v1.0
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
          limits:
            cpu: "2"
            memory: "4Gi"

AWS ECS/Fargate

{
  "family": "text-models",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [{
    "name": "text-models",
    "image": "company/text-models:v1.0",
    "essential": true,
    "portMappings": [{"containerPort": 8080}]
  }]
}

Google Cloud Run

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: text-models
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: "10"
        run.googleapis.com/cpu-throttling: "false"
    spec:
      containers:
      - image: gcr.io/company/text-models:v1.0
        resources:
          limits:
            cpu: "2"
            memory: "4Gi"

Monitoring and Observability

Health Check Endpoints

GET /health - Basic service health
GET /metrics - Prometheus metrics
GET /models - Available models status

Key Metrics to Monitor

Request latency: p50, p95, p99 response times
Throughput: Requests per second by model type
Error rates: 4xx/5xx errors and model failures
Resource usage: CPU, memory, GPU utilization
Model performance: Inference time, queue depth

Logging Best Practices

logger.info("Processing request", extra={
    "model": model_name,
    "request_id": request_id,
    "user_id": user_id,
    "processing_time_ms": processing_time
})

Security Considerations

Network Security

Service mesh: Istio/Linkerd for encrypted service-to-service communication
Network policies: Kubernetes NetworkPolicies for traffic isolation
API Gateway: Rate limiting, authentication, and request validation

Data Security

Input sanitization: Validate and sanitize all user inputs
Model security: Regular security scans of model dependencies
Secrets management: Use Kubernetes secrets or cloud secret managers

Cost Optimization

Resource Management

Right-sizing: Match instance types to workload requirements
Auto-scaling: Scale down during low-traffic periods
Spot instances: Use preemptible instances for batch processing
Model caching: Share downloaded models across instances

Workload Optimization

Batch processing: Group similar requests for efficiency
Model selection: Use smaller models for simple tasks
Caching: Cache frequently requested results
Request routing: Route to least-loaded instances

Next Steps

To adapt this demo for production:

Replace mock implementations with real model deployments
Implement authentication and authorization
Add comprehensive monitoring and alerting
Set up CI/CD pipelines for model deployments
Configure auto-scaling policies
Implement circuit breakers and retry logic
Add integration tests for all model endpoints
Set up log aggregation and distributed tracing

🎯 Key Takeaways

After running this demo, you'll understand how Stepflow transforms AI from prototype to production:

For Engineering Teams

Microservice patterns: Break monolithic AI systems into scalable, maintainable services
Resource optimization: Achieve 60-80% cost savings through intelligent compute allocation
Zero-downtime deployments: Ship AI model updates without service interruptions
Production observability: Built-in health checks, metrics, and failure isolation

For Business Teams

Cost predictability: Scale expensive GPU resources only when needed
Faster iteration: Deploy new AI capabilities multiple times per day safely
Risk mitigation: Service isolation prevents AI failures from cascading
Competitive advantage: Production-ready AI architecture from day one

The Stepflow Advantage

This isn't just another workflow engine - it's production-ready AI infrastructure that scales from prototype to enterprise. In one demo, you've seen patterns that typically take months to build and validate.

Ready to apply this to your AI workloads? This example shows you exactly how to architect, deploy, and scale production AI systems using Stepflow's proven patterns.

Two Ways to Run This Demo​

🔧 Development Mode (5 seconds to start)​

🏭 Production Mode (Docker Compose)​

⚡ Quick Start​

Why This Architecture Matters for Production AI​

💰 Save Money on AI Infrastructure​

📈 Scale Each AI Service Independently​

🚀 Deploy AI Models Without Downtime​

🛡️ Built-in Fault Tolerance​

What You'll Learn From This Demo​

🎯 Production AI Architecture Patterns​

🔧 Stepflow Development Patterns​

📊 Real Performance Insights​

Try Different Execution Modes​

🏃‍♂️ Development Mode: Direct Run (Fastest)​

🖥️ Development Mode: Serve/Submit (Production-like)​

🐳 Production Mode: Full Container Deployment​

Model Servers​

Text Models Server (text_models_server.py)​

Vision Models Server (vision_models_server.py)​

Workflow Capabilities​

Sample Inputs​

Text-only Processing (sample_input_text.json)​

Multi-modal Processing (sample_input_multimodal.json)​

Batch Processing (sample_input_batch.json)​

Production Deployment Patterns​

Kubernetes Deployment​

AWS ECS/Fargate​

Google Cloud Run​

Monitoring and Observability​

Health Check Endpoints​

Key Metrics to Monitor​

Logging Best Practices​

Security Considerations​

Network Security​

Data Security​

Cost Optimization​

Resource Management​

Workload Optimization​

Next Steps​

🎯 Key Takeaways​

For Engineering Teams​

For Business Teams​

The Stepflow Advantage​