1. Deploy and install
2. Start the server
3. Pull models
4. Test the API
5. Use the OpenAI-compatible endpoint
6. Larger models on A100
For 70B+ models, use an A100 80 GB:Monitoring
Tips
- Ollama quantizes models by default (Q4). For higher quality, use
:fp16tags if VRAM allows. - The first request after pulling a model is slower — Ollama loads into GPU memory on demand.
- For production workloads with high concurrency, use vLLM instead.