1. Deploy a GPU instance
2. Install lm-eval-harness
3. Run a benchmark
Evaluate Llama 3.1 8B on MMLU (5-shot):4. Run a full benchmark suite
5. Compare two models
6. Download results
Available benchmark tasks
| Task | Measures |
|---|---|
mmlu | Knowledge across 57 subjects |
hellaswag | Common-sense reasoning |
arc_challenge | Science reasoning (hard) |
truthfulqa_mc2 | Truthfulness |
gsm8k | Grade-school math |
humaneval | Code generation |
Tips
- Use
--batch_size autoto find the largest batch size that fits in VRAM. - The vLLM backend is significantly faster than the default HuggingFace backend.
- For gated models, authenticate with
huggingface-cli loginfirst. - Run the same tasks with the same
num_fewshotacross models for fair comparison.