”Deploy 4 RTX 4090 instances for a hyperparameter sweep.”
The agent provisions the fleet:list_gpu_types— confirms RTX 4090 availability and pricingcreate_instance— createssweep-1with RTX 4090create_instance— createssweep-2with RTX 4090create_instance— createssweep-3with RTX 4090create_instance— createssweep-4with RTX 4090instance_status— polls all four until running
Deployed 4 instances:Fleet burn rate: $1.40/hr. All 4 are ready for commands.
Instance GPU Status Cost sweep-1 RTX 4090 running $0.35/hr sweep-2 RTX 4090 running $0.35/hr sweep-3 RTX 4090 running $0.35/hr sweep-4 RTX 4090 running $0.35/hr
”Install dependencies on all of them and start training with different learning rates.”
The agent runs commands across the fleet:ssh_executeon each instance —pip install torch transformers datasetsfile_uploadon each instance — uploadstrain.pyssh_executeon each instance — starts training with different args:- sweep-1:
python train.py --lr 1e-4 - sweep-2:
python train.py --lr 5e-5 - sweep-3:
python train.py --lr 1e-5 - sweep-4:
python train.py --lr 5e-6
- sweep-1:
All 4 instances are running training with different learning rates:Check back in ~30 minutes for results, or ask me to monitor progress.
Instance Learning Rate Status sweep-1 1e-4 training started sweep-2 5e-5 training started sweep-3 1e-5 training started sweep-4 5e-6 training started
”Check training status on all instances and tear down any that are done.”
The agent checks each instance and cleans up:ssh_executeon each —tail -5 /root/train.logssh_executeon each —nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader
sweep-2 (lr=5e-5) has the best loss so far at 0.287. Terminating finished instances to save money.
Instance GPU Util Last Log Line Status sweep-1 0% Training complete. Best loss: 0.312done sweep-2 0% Training complete. Best loss: 0.287done sweep-3 94% Epoch 4/5 Step 800/1000 Loss: 0.301running sweep-4 91% Epoch 3/5 Step 620/1000 Loss: 0.445running
file_downloadon sweep-1 and sweep-2 — downloads resultsdelete_instance— terminatessweep-1andsweep-2
Downloaded results from sweep-1 and sweep-2. Terminated both. Fleet cost reduced to $0.70/hr (2 remaining).
Tools used in this workflow
| Tool | Purpose |
|---|---|
list_gpu_types | Check availability before bulk deployment |
create_instance / instance_status | Deploy fleet and wait for readiness |
file_upload / ssh_execute | Distribute code and run commands |
file_download | Retrieve results from completed runs |
delete_instance | Tear down finished instances |