Monitor your Virtual Servers
Chapter 1 · about 10 minutes to read
Stage 2 ended with one Virtual Server running and one job completed. Stage 3 starts with the question every operator asks next: "is it actually healthy, and how do I know?" This chapter is the tour of every metric, log surface, and status signal Zyra exposes for a running VS.
Time: about 10 minutes to read, plus a few minutes clicking around your own dashboard. Prerequisites: at least one Virtual Server in running state.
The three places monitoring data lives
Zyra stores monitoring data in three layered surfaces. Open the dashboard, click any VS row, and you'll see all three:
- Live snapshot — the four columns on the VS detail header:
cpu_usage_percent,memory_usage_mb,disk_usage_gb,uptime_seconds. Refreshed by the agent every 10 seconds and written straight onto thevirtual_serversrow. - Time-series history — the
virtual_server_metricstable. The agent appends a row every minute with CPU, memory, network rx/tx bytes, and block read/write bytes. Retained for 90 days per the database retention policy. - Logs — the
virtual_server_logstable. Stdout, stderr, and lifecycle events from the container. Retained 30 days for info, 90 days for errors.
The "Overview" tab — what to look at first
Open Virtual Servers, click your VS, land on Overview. Five panels:
- Status pill. One of
creating / starting / running / stopping / stopped / restarting / terminating / terminated / errorperVirtualServerStatus. Green =running. Yellow = a transition state. Red =error. - Uptime. Wall-clock seconds since
started_at, formatted as "3d 4h 12m". - CPU graph. Last 60 minutes of
cpu_usage_percentfromvirtual_server_metrics. - Memory graph. Last 60 minutes of
memory_usage_mbplotted against the configuredmemory_mbcap from the VS spec. - Network throughput. Last 60 minutes of bytes-per-second derived from
network_rx_bytes/network_tx_bytesdeltas.
The "Metrics" tab — historical trends
Click Metrics. Same data, longer windows: 1h / 6h / 24h / 7d / 30d. Useful for spotting:
- Steady-state drift. CPU was 20% last week, 60% this week, same workload — something is leaking.
- Periodic spikes. Memory climbs every hour and resets — a misbehaving cron or scheduled job inside the container.
- Network surprises. Outbound bytes climbing for no reason — exfil risk or runaway client.
Each metric chart has a "compare to threshold" toggle that overlays any alert rule you've configured for that metric (see Chapter 2: SLA alerts).
The "Logs" tab — what your container is saying
Click Logs. Streams the virtual_server_logs table for this VS, newest first. Filters:
- Level:
info / warning / error. - Source:
container_stdout / container_stderr / lifecycle / agent. - Time window: last hour, last day, last week, or a custom range.
Click any line to expand its full payload (the details JSONB column). Lifecycle events — pulling_image, started, restarted, oom_killed — surface here first, often before they reach the status pill.
The fleet-wide view
Per-VS detail pages are great for one server. When you have a dozen, open Monitoring in the sidebar (the org-wide alerting page wired to /api/v1/monitoring/alerts). It shows:
- A heatmap of every VS in your org by CPU and memory load.
- A list of triggered alerts you haven't acknowledged.
- A list of currently breaching SLAs (see Chapter 2).
Where Zyra itself monitors your fleet
Beyond what you see in the UI, Zyra runs Stage 11 continuous monitoring in the background:
- An external health probe hits every public URL hourly.
- The Observability Engineer publishes a daily health report at roughly 09:00 UTC summarising your VSs alongside platform health.
- An anomaly detector watches for sudden CPU / memory cliffs and writes them to the alert history.
You don't have to configure this — it runs for every organisation by default. [VERIFY: confirm anomaly detector is exposed in customer-facing UI vs internal-only at launch]
What just happened
You know the four surfaces where Zyra exposes VS health: live snapshot on the row, time-series in the Metrics tab, log stream in the Logs tab, and fleet-wide rollup in the Monitoring page. Next chapter turns these signals into alerts that wake someone up.
Troubleshooting
- Charts say "no data". The agent hasn't reported metrics yet. Wait 90 seconds after
runningstatus; if still empty, check Logs → Source: agent. - Uptime resets to 0. The container restarted. Look at Logs → Source: lifecycle for an
oom_killedorexit_code != 0event. - Status flips to
error. Hover the pill for theerror_messagefield. Common values: image-pull failure, port already in use, capability cap exceeded.
Last reviewed: 2026-05-21