Set up SLA alerts
Chapter 2 · about 12 minutes to read
You can see your Virtual Servers' metrics (Chapter 1). Now you want Zyra to wake someone up when a metric crosses a line that matters. That's two systems: alert rules (per-VS thresholds) and SLA agreements (organisation-wide service-level guarantees). This chapter covers both and shows where each surfaces in the UI.
Time: about 12 minutes. Prerequisites: at least one Virtual Server in running state and one notification channel configured.
Alert rules vs SLA breaches
- Alert rule. "When CPU on this VS averages over 85% for 5 minutes, fire an alert." Per-VS, per-metric, you author them. Stored in
alert_rules, evaluated bybackend/app/services/monitoring/alert_evaluator.py. - SLA agreement. "VSs covered by this agreement must stay at 99.5% uptime over the calendar month." Organisation-wide, attached to a template, evaluated every 5 minutes by
backend/app/services/sla/monitor.py. When a VS enterserrororstoppedwhile supposed to be running, ansla_breachrow is recorded.
Use alert rules for tactical "investigate now" signals. Use SLA agreements for contractual or internal-policy guarantees that need a paper trail.
Step 1 — Configure a notification channel
Open Settings → Notification Channels → New channel. Four types per ChannelType:
- Email. Distribution list, e.g.
oncall@yourco.example. - Slack. Incoming webhook URL from your Slack workspace.
- Webhook. Generic HTTPS POST to any URL — wire it into PagerDuty, OpsGenie, your own bot, anything that speaks JSON.
- PagerDuty. Native integration via PagerDuty Events API v2 — provide a service routing key.
[VERIFY: Slack and PagerDuty front-end forms are wired in production at launch — backend channel types exist; UI completeness pending]
The SMS channel is on the roadmap but not shipping in MVP1. Use webhook → SMS gateway in the meantime.
Step 2 — Author an alert rule
Open Monitoring → Rules → New rule. Fields:
- Name. Plain English, e.g.
prod-api CPU > 85%. - Resource. A specific VS or "all VSs in org".
- Metric. One of
cpu,memory,disk,latency. - Threshold. Number + comparator (
>,>=,<,<=). - Evaluation window. 60 / 300 / 600 / 900 seconds. The rule fires only if the average over that window crosses the threshold — protects against single-spike noise.
- Severity.
info / warning / critical. - Channels. One or more channels from Step 1.
Step 3 — Read alert history
Open Monitoring → Alerts. Lists every alert that has fired, newest first. Each row: Triggered at, Rule, Resource (click-through), Status (triggered / acknowledged / resolved), and an Acknowledge button that claims the alert under your user.
Resolution is automatic: when the underlying metric returns under threshold for the same window, status flips to resolved.
Step 4 — Attach an SLA agreement
Open Settings → SLA Agreements → New agreement (org admins only). Pick a template (Standard 99.5%, Premium 99.9%, etc., from sla_templates) and a covered set of VSs or device groups. Save.
The SLA monitor evaluates every 5 minutes (INTERVAL_SECONDS = 300): lists active agreements, finds VSs in error or stopped when they shouldn't be, dedupes against existing breaches by the canonical key vs:<uuid>, and records a new sla_breach row with type UPTIME when needed.
The dedup key matters: a recent fix made sure affected_service formatting is consistent across the monitor and the storage layer, so a single downed VS produces one breach row, not one per cycle.
Step 5 — Read SLA breaches
Open SLA → Breaches. Same shape as alert history but scoped to contractual breaches:
- Type —
uptime,response_time,error_rate. MVP1 firesuptimeonly. - Affected service — the
vs:<uuid>key. - Started at — when the VS first entered the bad state.
- Resolved at — when it returned to
running, or null if still down. - Description — e.g.
VS prod-api in error state.
The monthly SLA report (SLA → Report) rolls these into a credit calculation against the agreement template's targets.
What just happened
You wired one channel, one alert rule, and one SLA agreement. The platform will now ping your channel when a tactical threshold breaks, and record a permanent breach row when a covered VS goes down.
Troubleshooting
- Alert didn't fire when CPU hit 95%. Check the evaluation window — a 5-minute window won't fire on a 30-second spike. Drop to 60 seconds for spike-sensitive workloads.
- Slack channel doesn't receive messages. Test the webhook URL with
curland an empty JSON{}body; Slack returnsinvalid_payloadif the URL is live,no_serviceif not. - One downed VS produces multiple breach rows. Should be impossible after the dedup fix. File a ticket with the
affected_servicevalues — the dedup key is meant to bevs:<uuid>exactly.
Last reviewed: 2026-05-21