1. Deploy benchmark instances
2. Install benchmark tools
3. FP16 matrix multiplication benchmark
Test raw compute throughput — run on each instance to compare:4. Memory bandwidth test
5. Check GPU specs
Expected results
| Benchmark | RTX 4090 | A100 80 GB | H100 80 GB |
|---|---|---|---|
| FP16 matmul (TFLOPS) | ~165 | ~312 | ~990 |
| Memory bandwidth (GB/s) | ~1,000 | ~2,000 | ~3,350 |
| Llama 8B tok/s (batch=1) | ~90 | ~130 | ~180 |
Tips
- Run benchmarks 3 times and average — GPU boost clocks vary between runs.
- FP16 matmul tests compute-bound workloads (training). Memory bandwidth tests memory-bound workloads (inference).
- The RTX 4090 offers the best price-to-performance for inference. The H100 is best for training throughput.