Documentation › User Guides › Organizations › Stage 3 › Set up SLA alerts

Set up SLA alerts

Chapter 2 · about 12 minutes to read

You can see your Virtual Servers' metrics (Chapter 1). Now you want Zyra to wake someone up when a metric crosses a line that matters. That's two systems: alert rules (per-VS thresholds) and SLA agreements (organisation-wide service-level guarantees). This chapter covers both and shows where each surfaces in the UI.

Time: about 12 minutes. Prerequisites: at least one Virtual Server in running state and one notification channel configured.

Alert rules vs SLA breaches

Alert rule. "When CPU on this VS averages over 85% for 5 minutes, fire an alert." Per-VS, per-metric, you author them. Stored in alert_rules, evaluated by backend/app/services/monitoring/alert_evaluator.py.
SLA agreement. "VSs covered by this agreement must stay at 99.5% uptime over the calendar month." Organisation-wide, attached to a template, evaluated every 5 minutes by backend/app/services/sla/monitor.py. When a VS enters error or stopped while supposed to be running, an sla_breach row is recorded.

Use alert rules for tactical "investigate now" signals. Use SLA agreements for contractual or internal-policy guarantees that need a paper trail.

Step 1 — Configure a notification channel

Open Settings → Notification Channels → New channel. Four types per ChannelType:

Email. Distribution list, e.g. oncall@yourco.example.
Slack. Incoming webhook URL from your Slack workspace.
Webhook. Generic HTTPS POST to any URL — wire it into PagerDuty, OpsGenie, your own bot, anything that speaks JSON.
PagerDuty. Native integration via PagerDuty Events API v2 — provide a service routing key.

[VERIFY: Slack and PagerDuty front-end forms are wired in production at launch — backend channel types exist; UI completeness pending]

The SMS channel is on the roadmap but not shipping in MVP1. Use webhook → SMS gateway in the meantime.

Step 2 — Author an alert rule

Open Monitoring → Rules → New rule. Fields:

Name. Plain English, e.g. prod-api CPU > 85%.
Resource. A specific VS or "all VSs in org".
Metric. One of cpu, memory, disk, latency.
Threshold. Number + comparator (>, >=, <, <=).
Evaluation window. 60 / 300 / 600 / 900 seconds. The rule fires only if the average over that window crosses the threshold — protects against single-spike noise.
Severity. info / warning / critical.
Channels. One or more channels from Step 1.

[SCREENSHOT: New alert rule form with metric, threshold, and channels filled in]

Step 3 — Read alert history

Open Monitoring → Alerts. Lists every alert that has fired, newest first. Each row: Triggered at, Rule, Resource (click-through), Status (triggered / acknowledged / resolved), and an Acknowledge button that claims the alert under your user.

Resolution is automatic: when the underlying metric returns under threshold for the same window, status flips to resolved.

Step 4 — Attach an SLA agreement

Open Settings → SLA Agreements → New agreement (org admins only). Pick a template (Standard 99.5%, Premium 99.9%, etc., from sla_templates) and a covered set of VSs or device groups. Save.

The SLA monitor evaluates every 5 minutes (INTERVAL_SECONDS = 300): lists active agreements, finds VSs in error or stopped when they shouldn't be, dedupes against existing breaches by the canonical key vs:<uuid>, and records a new sla_breach row with type UPTIME when needed.

The dedup key matters: a recent fix made sure affected_service formatting is consistent across the monitor and the storage layer, so a single downed VS produces one breach row, not one per cycle.

Step 5 — Read SLA breaches

Open SLA → Breaches. Same shape as alert history but scoped to contractual breaches:

Type — uptime, response_time, error_rate. MVP1 fires uptime only.
Affected service — the vs:<uuid> key.
Started at — when the VS first entered the bad state.
Resolved at — when it returned to running, or null if still down.
Description — e.g. VS prod-api in error state.

The monthly SLA report (SLA → Report) rolls these into a credit calculation against the agreement template's targets.

What just happened

You wired one channel, one alert rule, and one SLA agreement. The platform will now ping your channel when a tactical threshold breaks, and record a permanent breach row when a covered VS goes down.

Troubleshooting

Alert didn't fire when CPU hit 95%. Check the evaluation window — a 5-minute window won't fire on a 30-second spike. Drop to 60 seconds for spike-sensitive workloads.
Slack channel doesn't receive messages. Test the webhook URL with curl and an empty JSON {} body; Slack returns invalid_payload if the URL is live, no_service if not.
One downed VS produces multiple breach rows. Should be impossible after the dedup fix. File a ticket with the affected_service values — the dedup key is meant to be vs:<uuid> exactly.

Last reviewed: 2026-05-21