Which Llama model to pick
| Model | Parameters | GPU | VRAM needed | Approx. cost |
|---|---|---|---|---|
| Llama 3.1 8B Instruct | 8B | RTX 4090 (24 GB) | ~16 GB (FP16) | ~$0.35/hr |
| Llama 4 Scout | 17B active (109B total MoE) | A100 80 GB | ~70 GB (BF16) | ~$1.60/hr |
| Llama 3.1 70B Instruct | 70B | A100 80 GB | ~70 GB (BF16) | ~$1.60/hr |
| Llama 3.1 70B Instruct | 70B | 2x A100 40 GB | ~35 GB each | ~$2.40/hr |
| Llama 3.1 405B Instruct (FP8) | 405B | 4x H100 80 GB | ~50 GB each | ~$10.00/hr |
Option 1: Models API (easiest — no GPU needed)
The fastest path. Hit the Runcrate Models API directly and pay per token. No instance to manage, no vLLM to install, no GPU to provision.curl
Python (OpenAI SDK)
TypeScript (OpenAI SDK)
Option 2: Self-host with vLLM (full control)
Run your own OpenAI-compatible endpoint on a dedicated GPU. You control the model, the context length, the quantization, and the scaling.Deploy Llama 3.1 8B (single RTX 4090)
Deploy Llama 3.1 70B (single A100 80 GB)
Deploy Llama 4 Scout (single A100 80 GB)
Deploy Llama 3.1 405B FP8 (4x H100)
The 405B model requires tensor parallelism across multiple GPUs:Test your endpoint
Point your app at it
Once the server is running, point any OpenAI-compatible SDK at your instance:Monitoring
Option 3: Self-host with Ollama (simpler, quantized)
Ollama runs quantized models with a single command. Good for development and prototyping — not recommended for production throughput.Deploy and set up
Pull and serve a model
Test it
/v1/chat/completions, so you can use the same OpenAI SDK pattern:
Limitations
- Quantized models (Q4/Q5) trade quality for memory efficiency. For production accuracy, use vLLM with FP16 or FP8.
- Ollama’s serving throughput is lower than vLLM — fine for single-user development, not for concurrent production traffic.
- Larger models (70B Q4) need an A100 80 GB even with quantization.
Benchmarks
Expected throughput for each model/GPU combination with vLLM, batch size 1, 2048-token output:| Model | GPU | Tokens/sec (output) | Time to first token |
|---|---|---|---|
| Llama 3.1 8B | RTX 4090 | ~90–110 tok/s | ~50 ms |
| Llama 3.1 8B | A100 80 GB | ~120–150 tok/s | ~35 ms |
| Llama 4 Scout | A100 80 GB | ~60–80 tok/s | ~80 ms |
| Llama 3.1 70B | A100 80 GB | ~25–35 tok/s | ~150 ms |
| Llama 3.1 70B | 2x A100 40 GB | ~20–30 tok/s | ~200 ms |
| Llama 3.1 405B FP8 | 4x H100 | ~15–25 tok/s | ~300 ms |
Which approach to choose
| Approach | Best for | Cost | Setup time |
|---|---|---|---|
| Models API | Production apps, no infra to manage | Per token | 60 seconds |
| vLLM self-host | Custom serving, max throughput, data privacy | Per hour (GPU) | ~10 minutes |
| Ollama self-host | Development, prototyping, experimentation | Per hour (GPU) | ~5 minutes |