Run Quick ML Experiments with AI Agents

Run a one-off experiment without any setup overhead. Tell your AI agent what you want to test, and it handles deploy, execute, collect, and cleanup in a single conversation.

”Run a quick benchmark comparing PyTorch and JAX matmul performance on an RTX 4090.”

The agent handles everything end-to-end:

create_instance — deploys bench-run with RTX 4090
instance_status — waits until running
ssh_execute — pip install torch jax[cuda12]
file_upload — uploads the benchmark script
ssh_execute — python /root/benchmark.py
delete_instance — terminates immediately

Benchmark results (RTX 4090, 4096x4096 matmul, 1000 iterations):
Framework Mean time Std dev TFLOPS
PyTorch 2.4 2.31 ms 0.04 ms 59.4
JAX 0.4.31 2.28 ms 0.03 ms 60.2
Both frameworks perform nearly identically on this workload. JAX is marginally faster (~1.3%) due to XLA compilation. **Total cost: $0.12** (20 minutes on RTX 4090 at$ 0.35/hr). Instance terminated. Nothing left running.

Framework	Mean time	Std dev	TFLOPS
PyTorch 2.4	2.31 ms	0.04 ms	59.4
JAX 0.4.31	2.28 ms	0.03 ms	60.2

”Test if my fine-tuned model loads correctly. Deploy a GPU, load it, run 5 prompts, and shut down.”

The agent runs the validation:

create_instance — deploys model-test with RTX 4090
instance_status — waits until running
ssh_execute — pip install torch transformers
file_upload — uploads test_model.py with 5 test prompts
ssh_execute — python /root/test_model.py
billing_usage — checks cost
delete_instance — terminates

Model validation results:
Prompt Output Quality
”Summarize this paper…“ 3-sentence summary, accurate pass
”Translate to French…” Correct translation pass
”Extract entities from…” Found 4/4 entities pass
”Generate SQL for…” Valid query, correct joins pass
”Classify sentiment…” Correct label, 0.94 confidence pass
All 5 test prompts passed. Model loads in 8.2 seconds, inference at ~45 tok/s. Cost: $0.06 (10 minutes). Instance terminated.

Prompt	Output	Quality
”Summarize this paper…“	3-sentence summary, accurate	pass
”Translate to French…”	Correct translation	pass
”Extract entities from…”	Found 4/4 entities	pass
”Generate SQL for…”	Valid query, correct joins	pass
”Classify sentiment…”	Correct label, 0.94 confidence	pass

”I want to try 3 different LoRA ranks and see which gives the best eval loss. Quick and dirty.”

The agent runs a fast comparison:

create_instance — deploys lora-test with A100 80GB
instance_status — waits until running
ssh_execute — installs dependencies
file_upload — uploads training script and small eval dataset
ssh_execute — runs with --lora-rank 8 (500 steps)
ssh_execute — runs with --lora-rank 16 (500 steps)
ssh_execute — runs with --lora-rank 32 (500 steps)
delete_instance — terminates

LoRA rank comparison (500 steps each, Llama 8B, 1K sample dataset):
Rank Trainable params Eval loss Train time
8 3.4M 1.842 4 min
16 6.8M 1.687 5 min
32 13.6M 1.691 7 min
Rank 16 wins. It achieves the lowest eval loss — rank 32 doesn’t improve further and trains 40% slower. The extra parameters at rank 32 are likely overfitting on this small dataset. **Total cost: $0.43** (16 minutes on A100 at$ 1.60/hr). Instance terminated.

Rank	Trainable params	Eval loss	Train time
8	3.4M	1.842	4 min
16	6.8M	1.687	5 min
32	13.6M	1.691	7 min

Tools used in this workflow

Tool	Purpose
`create_instance` / `instance_status`	Provision and wait for GPU
`ssh_execute`	Install packages, run experiments
`file_upload`	Transfer experiment scripts
`delete_instance`	Tear down immediately after results

​”Run a quick benchmark comparing PyTorch and JAX matmul performance on an RTX 4090.”

​”Test if my fine-tuned model loads correctly. Deploy a GPU, load it, run 5 prompts, and shut down.”

​”I want to try 3 different LoRA ranks and see which gives the best eval loss. Quick and dirty.”

​Tools used in this workflow

”Run a quick benchmark comparing PyTorch and JAX matmul performance on an RTX 4090.”

”Test if my fine-tuned model loads correctly. Deploy a GPU, load it, run 5 prompts, and shut down.”

”I want to try 3 different LoRA ranks and see which gives the best eval loss. Quick and dirty.”

Tools used in this workflow