1. Deploy an instance
2. Install tools
3. Authenticate (for gated models)
Some models (Llama, Gemma, Mistral) require HuggingFace access tokens:4. Download a model
Download to/workspace/ so the model persists if you attach a volume:
5. Serve with vLLM
Persist models with a volume
Avoid re-downloading large models by using a persistent volume:/workspace/models/ persist across instance restarts.
Download specific file types
Tips
- For gated models, create a token at
huggingface.co/settings/tokenswithreadscope. - Cloud download speeds are typically 1-5 GB/min — much faster than home connections.
- A 70B FP16 model is ~140 GB. An FP8 version is ~70 GB. Check VRAM before downloading.