skypilot-multi-cloud-orchestration by davila7
Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or batch jobs across multiple clouds, leverage spot instances with auto-recovery, or optimize GPU costs across providers.
DevOps
15.7K Stars
1.4K Forks
Updated Jan 12, 2026, 05:31 AM
Why Use This
This skill provides specialized capabilities for davila7's codebase.
Use Cases
- Developing new features in the davila7 repository
- Refactoring existing code to follow davila7 standards
- Understanding and working with davila7's codebase structure
Skill Snapshot
Auto scan of skill assets. Informational only.
Valid SKILL.md
Checks against SKILL.md specification
Source & Community
Repository claude-code-templates
Skill Version
main
Community
15.7K 1.4K
Updated At Jan 12, 2026, 05:31 AM
Skill Stats
SKILL.md 510 Lines
Total Files 1
Total Size 0 B
License MIT
---
name: skypilot-multi-cloud-orchestration
description: Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or batch jobs across multiple clouds, leverage spot instances with auto-recovery, or optimize GPU costs across providers.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Infrastructure, Multi-Cloud, Orchestration, GPU, Cost Optimization, SkyPilot]
dependencies: [skypilot>=0.7.0]
---
# SkyPilot Multi-Cloud Orchestration
Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot.
## When to use SkyPilot
**Use SkyPilot when:**
- Running ML workloads across multiple clouds (AWS, GCP, Azure, etc.)
- Need cost optimization with automatic cloud/region selection
- Running long jobs on spot instances with auto-recovery
- Managing distributed multi-node training
- Want unified interface for 20+ cloud providers
- Need to avoid vendor lock-in
**Key features:**
- **Multi-cloud**: AWS, GCP, Azure, Kubernetes, Lambda, RunPod, 20+ providers
- **Cost optimization**: Automatic cheapest cloud/region selection
- **Spot instances**: 3-6x cost savings with automatic recovery
- **Distributed training**: Multi-node jobs with gang scheduling
- **Managed jobs**: Auto-recovery, checkpointing, fault tolerance
- **Sky Serve**: Model serving with autoscaling
**Use alternatives instead:**
- **Modal**: For simpler serverless GPU with Python-native API
- **RunPod**: For single-cloud persistent pods
- **Kubernetes**: For existing K8s infrastructure
- **Ray**: For pure Ray-based orchestration
## Quick start
### Installation
```bash
pip install "skypilot[aws,gcp,azure,kubernetes]"
# Verify cloud credentials
sky check
```
### Hello World
Create `hello.yaml`:
```yaml
resources:
accelerators: T4:1
run: |
nvidia-smi
echo "Hello from SkyPilot!"
```
Launch:
```bash
sky launch -c hello hello.yaml
# SSH to cluster
ssh hello
# Terminate
sky down hello
```
## Core concepts
### Task YAML structure
```yaml
# Task name (optional)
name: my-task
# Resource requirements
resources:
cloud: aws # Optional: auto-select if omitted
region: us-west-2 # Optional: auto-select if omitted
accelerators: A100:4 # GPU type and count
cpus: 8+ # Minimum CPUs
memory: 32+ # Minimum memory (GB)
use_spot: true # Use spot instances
disk_size: 256 # Disk size (GB)
# Number of nodes for distributed training
num_nodes: 2
# Working directory (synced to ~/sky_workdir)
workdir: .
# Setup commands (run once)
setup: |
pip install -r requirements.txt
# Run commands
run: |
python train.py
```
### Key commands
| Command | Purpose |
|---------|---------|
| `sky launch` | Launch cluster and run task |
| `sky exec` | Run task on existing cluster |
| `sky status` | Show cluster status |
| `sky stop` | Stop cluster (preserve state) |
| `sky down` | Terminate cluster |
| `sky logs` | View task logs |
| `sky queue` | Show job queue |
| `sky jobs launch` | Launch managed job |
| `sky serve up` | Deploy serving endpoint |
## GPU configuration
### Available accelerators
```yaml
# NVIDIA GPUs
accelerators: T4:1
accelerators: L4:1
accelerators: A10G:1
accelerators: L40S:1
accelerators: A100:4
accelerators: A100-80GB:8
accelerators: H100:8
# Cloud-specific
accelerators: V100:4 # AWS/GCP
accelerators: TPU-v4-8 # GCP TPUs
```
### GPU fallbacks
```yaml
resources:
accelerators:
H100: 8
A100-80GB: 8
A100: 8
any_of:
- cloud: gcp
- cloud: aws
- cloud: azure
```
### Spot instances
```yaml
resources:
accelerators: A100:8
use_spot: true
spot_recovery: FAILOVER # Auto-recover on preemption
```
## Cluster management
### Launch and execute
```bash
# Launch new cluster
sky launch -c mycluster task.yaml
# Run on existing cluster (skip setup)
sky exec mycluster another_task.yaml
# Interactive SSH
ssh mycluster
# Stream logs
sky logs mycluster
```
### Autostop
```yaml
resources:
accelerators: A100:4
autostop:
idle_minutes: 30
down: true # Terminate instead of stop
```
```bash
# Set autostop via CLI
sky autostop mycluster -i 30 --down
```
### Cluster status
```bash
# All clusters
sky status
# Detailed view
sky status -a
```
## Distributed training
### Multi-node setup
```yaml
resources:
accelerators: A100:8
num_nodes: 4 # 4 nodes × 8 GPUs = 32 GPUs total
setup: |
pip install torch torchvision
run: |
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
--master_port=12355 \
train.py
```
### Environment variables
| Variable | Description |
|----------|-------------|
| `SKYPILOT_NODE_RANK` | Node index (0 to num_nodes-1) |
| `SKYPILOT_NODE_IPS` | Newline-separated IP addresses |
| `SKYPILOT_NUM_NODES` | Total number of nodes |
| `SKYPILOT_NUM_GPUS_PER_NODE` | GPUs per node |
### Head-node-only execution
```bash
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
python orchestrate.py
fi
```
## Managed jobs
### Spot recovery
```bash
# Launch managed job with spot recovery
sky jobs launch -n my-job train.yaml
```
### Checkpointing
```yaml
name: training-job
file_mounts:
/checkpoints:
name: my-checkpoints
store: s3
mode: MOUNT
resources:
accelerators: A100:8
use_spot: true
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume-from-latest
```
### Job management
```bash
# List jobs
sky jobs queue
# View logs
sky jobs logs my-job
# Cancel job
sky jobs cancel my-job
```
## File mounts and storage
### Local file sync
```yaml
workdir: ./my-project # Synced to ~/sky_workdir
file_mounts:
/data/config.yaml: ./config.yaml
~/.vimrc: ~/.vimrc
```
### Cloud storage
```yaml
file_mounts:
# Mount S3 bucket
/datasets:
source: s3://my-bucket/datasets
mode: MOUNT # Stream from S3
# Copy GCS bucket
/models:
source: gs://my-bucket/models
mode: COPY # Pre-fetch to disk
# Cached mount (fast writes)
/outputs:
name: my-outputs
store: s3
mode: MOUNT_CACHED
```
### Storage modes
| Mode | Description | Best For |
|------|-------------|----------|
| `MOUNT` | Stream from cloud | Large datasets, read-heavy |
| `COPY` | Pre-fetch to disk | Small files, random access |
| `MOUNT_CACHED` | Cache with async upload | Checkpoints, outputs |
## Sky Serve (Model Serving)
### Basic service
```yaml
# service.yaml
service:
readiness_probe: /health
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
resources:
accelerators: A100:1
run: |
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000
```
```bash
# Deploy
sky serve up -n my-service service.yaml
# Check status
sky serve status
# Get endpoint
sky serve status my-service
```
### Autoscaling policies
```yaml
service:
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
upscale_delay_seconds: 60
downscale_delay_seconds: 300
load_balancing_policy: round_robin
```
## Cost optimization
### Automatic cloud selection
```yaml
# SkyPilot finds cheapest option
resources:
accelerators: A100:8
# No cloud specified - auto-select cheapest
```
```bash
# Show optimizer decision
sky launch task.yaml --dryrun
```
### Cloud preferences
```yaml
resources:
accelerators: A100:8
any_of:
- cloud: gcp
region: us-central1
- cloud: aws
region: us-east-1
- cloud: azure
```
### Environment variables
```yaml
envs:
HF_TOKEN: $HF_TOKEN # Inherited from local env
WANDB_API_KEY: $WANDB_API_KEY
# Or use secrets
secrets:
- HF_TOKEN
- WANDB_API_KEY
```
## Common workflows
### Workflow 1: Fine-tuning with checkpoints
```yaml
name: llm-finetune
file_mounts:
/checkpoints:
name: finetune-checkpoints
store: s3
mode: MOUNT_CACHED
resources:
accelerators: A100:8
use_spot: true
setup: |
pip install transformers accelerate
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume
```
### Workflow 2: Hyperparameter sweep
```yaml
name: hp-sweep-${RUN_ID}
envs:
RUN_ID: 0
LEARNING_RATE: 1e-4
BATCH_SIZE: 32
resources:
accelerators: A100:1
use_spot: true
run: |
python train.py \
--lr $LEARNING_RATE \
--batch-size $BATCH_SIZE \
--run-id $RUN_ID
```
```bash
# Launch multiple jobs
for i in {1..10}; do
sky jobs launch sweep.yaml \
--env RUN_ID=$i \
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))")
done
```
## Debugging
```bash
# SSH to cluster
ssh mycluster
# View logs
sky logs mycluster
# Check job queue
sky queue mycluster
# View managed job logs
sky jobs logs my-job
```
## Common issues
| Issue | Solution |
|-------|----------|
| Quota exceeded | Request quota increase, try different region |
| Spot preemption | Use `sky jobs launch` for auto-recovery |
| Slow file sync | Use `MOUNT_CACHED` mode for outputs |
| GPU not available | Use `any_of` for fallback clouds |
## References
- **[Advanced Usage](references/advanced-usage.md)** - Multi-cloud, optimization, production patterns
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
## Resources
- **Documentation**: https://docs.skypilot.co
- **GitHub**: https://github.com/skypilot-org/skypilot
- **Slack**: https://slack.skypilot.co
- **Examples**: https://github.com/skypilot-org/skypilot/tree/master/examples
Name Size