1. Deploy a multi-GPU instance
2. Verify GPU topology
3. PyTorch DDP training
Upload and run your training script withtorchrun:
4. DeepSpeed ZeRO (for larger models)
5. Monitor training
6. Download results
Tips
- DDP scales linearly with GPU count for data-parallel workloads. 4 GPUs = ~3.8x throughput.
- DeepSpeed ZeRO-3 shards model parameters, gradients, and optimizer states across GPUs — use it when the model does not fit on a single GPU.
- Use
--gradient_accumulation_stepsto simulate larger batch sizes without increasing per-GPU memory.