Production Model Serving
Production Model Serving with Stepflow
🚀 Experience Production-Ready AI Workflows in Minutes
This demo shows how Stepflow transforms AI workflow serving from monolithic deployments to scalable, cost-effective microservices. In just one command, you'll see:
- Independent model scaling - Text and vision models scale separately based on demand
- Smart resource allocation - CPU instances for text, GPU for vision, lightweight orchestration
- Zero-downtime deployments - Update models without affecting other services
- Built-in fault tolerance - Services fail gracefully without cascading errors
What You'll Experience:
- Instant Setup - No complex configuration, everything works out of the box
- Real Architecture - See how production AI systems actually work
- Cost Benefits - Understand how to optimize GPU spending and resource usage
- Scaling Patterns - Learn how to handle varying AI workload demands
Two Ways to Run This Demo
🔧 Development Mode (5 seconds to start)
Perfect for learning, experimentation and development:
- Local processes - All services on your machine for easy debugging
- Mock AI models - No GPU or ML libraries required
- Live reloading - Change code and see results immediately
🏭 Production Mode (Docker Compose)
See real production architecture:
- Containerized services - Stepflow runtime + separate AI model servers
- Service orchestration - Health checks, dependencies, monitoring
- HTTP communication - Production-ready service-to-service calls
- Independent deployment - Each service can be updated separately
⚡ Quick Start
Try it now - zero installation required!
# Clone and run in 30 seconds
cd examples/production-model-serving
./scripts/run-dev-direct.sh
What happens when you run this:
- 🏥 Health checks verify all model servers are ready
- 🧠 AI pipeline analyzes your input and selects optimal processing strategy
- 📊 Sentiment analysis processes the text using mock DistilBERT
- ✨ Results show processing times, model selections, and recommendations
Want to try different scenarios?
./scripts/run-dev-direct.sh multimodal # Test with image processing
./scripts/run-dev-direct.sh batch # Test batch processing efficiency
Why This Architecture Matters for Production AI
💰 Save Money on AI Infrastructure
Problem: Traditional AI deployments waste expensive GPU resources on simple tasks. Solution: Stepflow routes text processing to cheap CPU instances, reserves GPUs for vision work.
# Text models: $0.10/hour CPU instances
text_models_cluster:
url: "http://text-models:8080" # CPU-optimized
# Vision models: $2.50/hour GPU instances
vision_models_cluster:
url: "http://vision-models:8081" # GPU-enabled
Real Impact: Companies report 60-80% cost reduction by right-sizing AI compute resources.
📈 Scale Each AI Service Independently
Problem: Monolithic AI systems can't handle varying workload patterns. Solution: Scale text and vision processing based on actual demand.
# Black Friday: Scale text processing for customer reviews
kubectl scale deployment text-models --replicas=20
# Product launches: Scale vision for image analysis
kubectl scale deployment vision-models --replicas=5
Real Impact: Handle 10x traffic spikes without over-provisioning all services.
🚀 Deploy AI Models Without Downtime
Problem: Model updates require full system restarts, causing service interruptions. Solution: Update individual model servers while others keep running.
# Update text models while vision keeps running
kubectl rollout restart deployment text-models
# Zero impact on vision processing
Real Impact: Deploy new models multiple times per day without user-facing outages.
🛡️ Built-in Fault Tolerance
Problem: One AI model failure brings down the entire system. Solution: Services fail independently with graceful degradation.
- Vision model crashes → Text processing continues unaffected
- Text model overloaded → Vision processing stays responsive
- Health checks automatically route around failed instances
Real Impact: 99.9% uptime even with individual service failures.
What You'll Learn From This Demo
🎯 Production AI Architecture Patterns
- Microservice decomposition: See how to break monolithic AI into scalable services
- Resource optimization: Learn cost-effective GPU and CPU allocation strategies
- Service communication: Understand HTTP vs process-based model server integration
- Health monitoring: Implement production-ready AI system observability
🔧 Stepflow Development Patterns
- Serve/Submit vs Direct Run: Compare development speed vs production realism
- Configuration management: Environment-specific configs for dev/staging/prod
- Component routing: Dynamic service discovery and load balancing
- Error handling: Graceful degradation and fault isolation techniques
📊 Real Performance Insights
When you run this demo, you'll see actual metrics showing:
- Processing time differences between CPU and GPU routing
- Health check response times and service discovery
- Resource utilization patterns across different model types
- Batch processing efficiency gains
Try Different Execution Modes
🏃♂️ Development Mode: Direct Run (Fastest)
Perfect for workflow development and testing:
cd examples/production-model-serving
./scripts/run-dev-direct.sh
# Try different scenarios:
./scripts/run-dev-direct.sh multimodal # With image processing
./scripts/run-dev-direct.sh batch # Batch processing demo
Best for: Quick iterations, debugging workflows, learning Stepflow
🖥️ Development Mode: Serve/Submit (Production-like)
Test production patterns locally:
cd examples/production-model-serving
./scripts/run-dev.sh
What this shows:
- Persistent model servers (faster repeat executions)
- Service-to-service communication patterns
- How monitoring and health checks work
- Realistic production deployment simulation
Best for: Testing production behaviors, multiple workflow runs, server debugging
🐳 Production Mode: Full Container Deployment
Experience complete production architecture:
cd examples/production-model-serving
./scripts/run-prod.sh
What you get:
- Stepflow Runtime: Central orchestration server
- Text Models Service: CPU-optimized text processing
- Vision Models Service: GPU-ready image processing
- Monitoring Stack: Prometheus + Redis for production observability
Best for: Understanding production deployment, container orchestration, service scaling
Model Servers
Text Models Server (text_models_server.py
)
Provides text processing capabilities optimized for CPU workloads:
- Text Generation: GPT-2 based text completion
- Sentiment Analysis: DistilBERT sentiment classification
- Batch Processing: Efficient multi-text processing
- Health Monitoring: Resource usage and model status
Components:
models/text/generate_text
- Generate text from promptsmodels/text/analyze_sentiment
- Analyze text sentimentmodels/text/batch_process_text
- Process multiple texts efficientlymodels/text/model_health_check
- Server health and metrics
Vision Models Server (vision_models_server.py
)
Provides computer vision capabilities optimized for GPU workloads:
- Image Classification: ResNet, Vision Transformer models
- Batch Image Processing: Efficient multi-image processing
- Image Analysis: Metadata extraction and model recommendations
- GPU Monitoring: Memory usage and performance metrics
Components:
models/vision/classify_image
- Classify images with various modelsmodels/vision/batch_classify_images
- Process multiple imagesmodels/vision/analyze_image_metrics
- Image property analysismodels/vision/vision_health_check
- GPU status and model health
Workflow Capabilities
The ai_pipeline_workflow.yaml
demonstrates:
- Health Checks: Verify model server availability before processing
- Content Analysis: Determine optimal processing strategy based on input
- Model Selection: Choose appropriate models based on resource preferences
- Multi-modal Processing: Handle both text and image inputs
- Batch Processing: Demonstrate efficient multi-input processing
- Performance Monitoring: Track processing times and resource usage
- Production Insights: Generate recommendations for optimization
Sample Inputs
Text-only Processing (sample_input_text.json
)
{
"user_text": "I'm excited about our new AI features!",
"processing_mode": "accurate",
"prefer_gpu": false
}
Multi-modal Processing (sample_input_multimodal.json
)
{
"user_text": "What do you think about this image?",
"user_image": "data:image/jpeg;base64,...",
"processing_mode": "accurate",
"prefer_gpu": true
}
Batch Processing (sample_input_batch.json
)
{
"user_text": "Multiple\\nlines\\nof\\ntext",
"processing_mode": "batch",
"batch_size": 8
}
Production Deployment Patterns
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: text-models
spec:
replicas: 5
selector:
matchLabels:
app: text-models
template:
spec:
containers:
- name: text-models
image: company/text-models:v1.0
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
AWS ECS/Fargate
{
"family": "text-models",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"containerDefinitions": [{
"name": "text-models",
"image": "company/text-models:v1.0",
"essential": true,
"portMappings": [{"containerPort": 8080}]
}]
}
Google Cloud Run
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: text-models
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/maxScale: "10"
run.googleapis.com/cpu-throttling: "false"
spec:
containers:
- image: gcr.io/company/text-models:v1.0
resources:
limits:
cpu: "2"
memory: "4Gi"
Monitoring and Observability
Health Check Endpoints
GET /health
- Basic service healthGET /metrics
- Prometheus metricsGET /models
- Available models status
Key Metrics to Monitor
- Request latency: p50, p95, p99 response times
- Throughput: Requests per second by model type
- Error rates: 4xx/5xx errors and model failures
- Resource usage: CPU, memory, GPU utilization
- Model performance: Inference time, queue depth
Logging Best Practices
logger.info("Processing request", extra={
"model": model_name,
"request_id": request_id,
"user_id": user_id,
"processing_time_ms": processing_time
})
Security Considerations
Network Security
- Service mesh: Istio/Linkerd for encrypted service-to-service communication
- Network policies: Kubernetes NetworkPolicies for traffic isolation
- API Gateway: Rate limiting, authentication, and request validation
Data Security
- Input sanitization: Validate and sanitize all user inputs
- Model security: Regular security scans of model dependencies
- Secrets management: Use Kubernetes secrets or cloud secret managers
Cost Optimization
Resource Management
- Right-sizing: Match instance types to workload requirements
- Auto-scaling: Scale down during low-traffic periods
- Spot instances: Use preemptible instances for batch processing
- Model caching: Share downloaded models across instances
Workload Optimization
- Batch processing: Group similar requests for efficiency
- Model selection: Use smaller models for simple tasks
- Caching: Cache frequently requested results
- Request routing: Route to least-loaded instances
Next Steps
To adapt this demo for production:
- Replace mock implementations with real model deployments
- Implement authentication and authorization
- Add comprehensive monitoring and alerting
- Set up CI/CD pipelines for model deployments
- Configure auto-scaling policies
- Implement circuit breakers and retry logic
- Add integration tests for all model endpoints
- Set up log aggregation and distributed tracing
🎯 Key Takeaways
After running this demo, you'll understand how Stepflow transforms AI from prototype to production:
For Engineering Teams
- Microservice patterns: Break monolithic AI systems into scalable, maintainable services
- Resource optimization: Achieve 60-80% cost savings through intelligent compute allocation
- Zero-downtime deployments: Ship AI model updates without service interruptions
- Production observability: Built-in health checks, metrics, and failure isolation
For Business Teams
- Cost predictability: Scale expensive GPU resources only when needed
- Faster iteration: Deploy new AI capabilities multiple times per day safely
- Risk mitigation: Service isolation prevents AI failures from cascading
- Competitive advantage: Production-ready AI architecture from day one
The Stepflow Advantage
This isn't just another workflow engine - it's production-ready AI infrastructure that scales from prototype to enterprise. In one demo, you've seen patterns that typically take months to build and validate.
Ready to apply this to your AI workloads? This example shows you exactly how to architect, deploy, and scale production AI systems using Stepflow's proven patterns.