What you’ll build
A self-hosted inference API that serves Llama 3.1 70B (or any model) on an A100/H100, accessible from anywhere via a public IP. You can point your existing OpenAI SDK code at it.Why vLLM
vLLM uses PagedAttention to manage GPU memory efficiently — on an 80GB H100 running a 7B FP16 model, this means serving 100+ concurrent requests instead of ~30. The V1 engine (default since v0.6.0) added disaggregated prefill/decode, preventing long prompts from blocking in-flight requests.Option A: CLI
1. Deploy the instance
2. Install vLLM and start the server
3. Test it
4. Point your app at it
Option B: Python SDK
Option C: MCP (via Claude Code / Cursor)
“Deploy an A100 instance called llm-server. Once it’s ready, install vLLM and start serving Llama 3.1 70B on port 8000. Give me the IP when it’s up.”Your AI assistant will:
- Call
create_instancewithname: "llm-server"andgpu: "A100" - Poll
instance_statusuntil deployed - Call
ssh_executeto install vLLM and start the server - Return the IP from
get_instance