Documentation › User Guides › Organizations › Stage 5 › Cost optimization checklist

Cost optimization checklist

Chapter 3 · about 15 minutes to read

Stage 3 chapter 4 gave you the five levers. This chapter is the operating discipline that wraps around those levers — a checklist you can run weekly, quarterly, and any time the bill goes up unexpectedly.

Time: about 15 minutes to read; ~30 minutes to run end-to-end the first time. Prerequisites: Stage 3 chapter 4 read; you can pull cost-advisor recommendations from /api/v1/cost-advisor/recommendations.

How cost flows on Zyra

total_cost = cost_per_hour × elapsed_hours_running per Virtual Server, computed in exact Decimal math by backend/app/services/billing_calculator.py. Default cost_per_hour is 0.10. A stopped VS accrues nothing. The cost-advisor service (backend/app/services/cost_advisor/) runs the analysis you would otherwise do by hand and writes recommendations.

The 20-item checklist

Items 1-10 weekly, 11-15 monthly, 16-20 quarterly.

Weekly

Right-size every VS using the 7d metrics window. If CPU max < 40% or memory ceiling < 60%, drop the cap on next deploy.
Terminate everything tagged "dev" or "test" that no one owns. The cost-advisor's suggest_scheduled_shutdowns flags exactly these.
Stop overnight what does not need to run overnight. Schedule a stop call.
Batch small jobs onto one VS instead of one-VS-per-job. Ten 30-second jobs on one VS beats ten boots.
Pull GET /api/v1/cost-advisor/recommendations. Returns idle, oversized, and scheduled-shutdown recs. Apply or dismiss each.
Pull GET /api/v1/cost-advisor/summary. Dollar-value savings opportunity in one number.
Check Monitoring → Idle hours. Hours a VS ran with < 5% CPU. Pure waste.
Audit fan-out coordinators. If a parent job died mid-run, its child VSs may still be billing.
Confirm checkpoint logic on long jobs. A 4-hour job that crashed at 3h59m and restarted from zero just doubled its cost.
Scan the per-environment cost split (Stage 4 chapter 6). If dev > prod, something is wrong.

Monthly

Right-size disk. If disk_usage_gb is flat at 15 on a 100 GB cap, trim to 30 on next deploy.
Move GPU pre/post-processing off the GPU VS. GPU minutes are scarce; spend them on the model only.
Compare against the cloud baseline. Monthly $ ÷ (VS-hours running) should sit at ~$0.08-0.10.
Audit storage volumes. Persistent volumes, snapshots, image storage > 30 days have their own line items. [VERIFY: storage cost surface in UI vs invoice-line-only at MVP1 GA]
Review the cost-advisor backlog. Recommendations stay PENDING until applied or dismissed. A growing pending list is unprocessed savings.

Quarterly

Set or refresh spending alerts. [VERIFY: budget / spending alert endpoint — alerting infrastructure exists per /api/v1/sla and /api/v1/monitoring but a dedicated cost-threshold alert is on the roadmap, not GA today]. Until then, run a daily script that calls /cost-advisor/summary and emails when projected monthly cost crosses your threshold.
Tag for cost allocation. Attribute costs to team / project / environment via VS name and audit log export (Stage 4 chapter 4).
Run a unit-economics review. $ per VS-hour, $ per job (if tagged), $ per user / customer.
Decommission stale enrolled devices. A device the placement engine never picks is still tying up enrolment records.
Reconcile the invoice PDF against the dashboard. The PDF is authoritative. Divergence beyond rounding = file a ticket.

Quick wins — the 5 items that usually move the bill the most

Right-size every VS over a week of metrics (item 1).
Terminate orphan dev VSs (item 2).
Pull and apply cost-advisor recommendations (item 5).
Audit Idle hours and stop the offenders (item 7).
Compare against the $0.08-0.10/device-hour baseline (item 13).

The cost-advisor service automates items 1, 2, and 7 — read its recommendations first, do the rest by hand.

What "good" looks like

summary.total_potential_savings < 5% of monthly spend.
Idle hours fleet-wide < 10% of running hours.
Mean $ per VS-hour within 15% of the published baseline.
No VS has been running > 30 days at its deploy-time size.
Every environment has a known monthly target; actual within ±15% of it.

What just happened

You have a calendar-driven discipline, integrated with the cost-advisor service, that catches the things humans forget. The next chapter covers the last piece — how you give back, share patterns, and report issues.

Troubleshooting

Bill went up but no obvious cause. Run the full checklist top-to-bottom; the answer is almost always items 2, 7, or 8.
Cost-advisor returns no recommendations. Either you have nothing to optimize, or the analyzer hasn't run recently — POST /api/v1/cost-advisor/analyze triggers it.
Invoice ≠ dashboard. Decimal precision is to 6 places. Below that is rounding, above is a bug — file a ticket.

Last reviewed: 2026-05-21