”Run a quick benchmark comparing PyTorch and JAX matmul performance on an RTX 4090.”
The agent handles everything end-to-end:create_instance— deploysbench-runwith RTX 4090instance_status— waits until runningssh_execute—pip install torch jax[cuda12]file_upload— uploads the benchmark scriptssh_execute—python /root/benchmark.pydelete_instance— terminates immediately
Benchmark results (RTX 4090, 4096x4096 matmul, 1000 iterations):Both frameworks perform nearly identically on this workload. JAX is marginally faster (~1.3%) due to XLA compilation. **Total cost: 0.35/hr). Instance terminated. Nothing left running.
Framework Mean time Std dev TFLOPS PyTorch 2.4 2.31 ms 0.04 ms 59.4 JAX 0.4.31 2.28 ms 0.03 ms 60.2
”Test if my fine-tuned model loads correctly. Deploy a GPU, load it, run 5 prompts, and shut down.”
The agent runs the validation:create_instance— deploysmodel-testwith RTX 4090instance_status— waits until runningssh_execute—pip install torch transformersfile_upload— uploadstest_model.pywith 5 test promptsssh_execute—python /root/test_model.pybilling_usage— checks costdelete_instance— terminates
Model validation results:All 5 test prompts passed. Model loads in 8.2 seconds, inference at ~45 tok/s. Cost: $0.06 (10 minutes). Instance terminated.
Prompt Output Quality ”Summarize this paper…“ 3-sentence summary, accurate pass ”Translate to French…” Correct translation pass ”Extract entities from…” Found 4/4 entities pass ”Generate SQL for…” Valid query, correct joins pass ”Classify sentiment…” Correct label, 0.94 confidence pass
”I want to try 3 different LoRA ranks and see which gives the best eval loss. Quick and dirty.”
The agent runs a fast comparison:create_instance— deployslora-testwith A100 80GBinstance_status— waits until runningssh_execute— installs dependenciesfile_upload— uploads training script and small eval datasetssh_execute— runs with--lora-rank 8(500 steps)ssh_execute— runs with--lora-rank 16(500 steps)ssh_execute— runs with--lora-rank 32(500 steps)delete_instance— terminates
LoRA rank comparison (500 steps each, Llama 8B, 1K sample dataset):Rank 16 wins. It achieves the lowest eval loss — rank 32 doesn’t improve further and trains 40% slower. The extra parameters at rank 32 are likely overfitting on this small dataset. **Total cost: 1.60/hr). Instance terminated.
Rank Trainable params Eval loss Train time 8 3.4M 1.842 4 min 16 6.8M 1.687 5 min 32 13.6M 1.691 7 min
Tools used in this workflow
| Tool | Purpose |
|---|---|
create_instance / instance_status | Provision and wait for GPU |
ssh_execute | Install packages, run experiments |
file_upload | Transfer experiment scripts |
delete_instance | Tear down immediately after results |