Cost optimization checklist
Chapter 3 · about 15 minutes to read
Stage 3 chapter 4 gave you the five levers. This chapter is the operating discipline that wraps around those levers — a checklist you can run weekly, quarterly, and any time the bill goes up unexpectedly.
Time: about 15 minutes to read; ~30 minutes to run end-to-end the first time. Prerequisites: Stage 3 chapter 4 read; you can pull cost-advisor recommendations from /api/v1/cost-advisor/recommendations.
How cost flows on Zyra
total_cost = cost_per_hour × elapsed_hours_running per Virtual Server, computed in exact Decimal math by backend/app/services/billing_calculator.py. Default cost_per_hour is 0.10. A stopped VS accrues nothing. The cost-advisor service (backend/app/services/cost_advisor/) runs the analysis you would otherwise do by hand and writes recommendations.
The 20-item checklist
Items 1-10 weekly, 11-15 monthly, 16-20 quarterly.
Weekly
- Right-size every VS using the 7d metrics window. If CPU max < 40% or memory ceiling < 60%, drop the cap on next deploy.
- Terminate everything tagged "dev" or "test" that no one owns. The cost-advisor's
suggest_scheduled_shutdownsflags exactly these. - Stop overnight what does not need to run overnight. Schedule a stop call.
- Batch small jobs onto one VS instead of one-VS-per-job. Ten 30-second jobs on one VS beats ten boots.
- Pull
GET /api/v1/cost-advisor/recommendations. Returns idle, oversized, and scheduled-shutdown recs. Apply or dismiss each. - Pull
GET /api/v1/cost-advisor/summary. Dollar-value savings opportunity in one number. - Check Monitoring → Idle hours. Hours a VS ran with < 5% CPU. Pure waste.
- Audit fan-out coordinators. If a parent job died mid-run, its child VSs may still be billing.
- Confirm checkpoint logic on long jobs. A 4-hour job that crashed at 3h59m and restarted from zero just doubled its cost.
- Scan the per-environment cost split (Stage 4 chapter 6). If dev > prod, something is wrong.
Monthly
- Right-size disk. If
disk_usage_gbis flat at 15 on a 100 GB cap, trim to 30 on next deploy. - Move GPU pre/post-processing off the GPU VS. GPU minutes are scarce; spend them on the model only.
- Compare against the cloud baseline. Monthly $ ÷ (VS-hours running) should sit at ~$0.08-0.10.
- Audit storage volumes. Persistent volumes, snapshots, image storage > 30 days have their own line items.
[VERIFY: storage cost surface in UI vs invoice-line-only at MVP1 GA] - Review the cost-advisor backlog. Recommendations stay
PENDINGuntil applied or dismissed. A growing pending list is unprocessed savings.
Quarterly
- Set or refresh spending alerts.
[VERIFY: budget / spending alert endpoint — alerting infrastructure exists per /api/v1/sla and /api/v1/monitoring but a dedicated cost-threshold alert is on the roadmap, not GA today]. Until then, run a daily script that calls/cost-advisor/summaryand emails when projected monthly cost crosses your threshold. - Tag for cost allocation. Attribute costs to team / project / environment via VS
nameand audit log export (Stage 4 chapter 4). - Run a unit-economics review. $ per VS-hour, $ per job (if tagged), $ per user / customer.
- Decommission stale enrolled devices. A device the placement engine never picks is still tying up enrolment records.
- Reconcile the invoice PDF against the dashboard. The PDF is authoritative. Divergence beyond rounding = file a ticket.
Quick wins — the 5 items that usually move the bill the most
- Right-size every VS over a week of metrics (item 1).
- Terminate orphan dev VSs (item 2).
- Pull and apply cost-advisor recommendations (item 5).
- Audit Idle hours and stop the offenders (item 7).
- Compare against the $0.08-0.10/device-hour baseline (item 13).
The cost-advisor service automates items 1, 2, and 7 — read its recommendations first, do the rest by hand.
What "good" looks like
summary.total_potential_savings< 5% of monthly spend.- Idle hours fleet-wide < 10% of running hours.
- Mean $ per VS-hour within 15% of the published baseline.
- No VS has been running > 30 days at its deploy-time size.
- Every environment has a known monthly target; actual within ±15% of it.
What just happened
You have a calendar-driven discipline, integrated with the cost-advisor service, that catches the things humans forget. The next chapter covers the last piece — how you give back, share patterns, and report issues.
Troubleshooting
- Bill went up but no obvious cause. Run the full checklist top-to-bottom; the answer is almost always items 2, 7, or 8.
- Cost-advisor returns no recommendations. Either you have nothing to optimize, or the analyzer hasn't run recently —
POST /api/v1/cost-advisor/analyzetriggers it. - Invoice ≠ dashboard. Decimal precision is to 6 places. Below that is rounding, above is a bug — file a ticket.
Last reviewed: 2026-05-21