Documentation › User Guides › Organizations › Stage 5 › Architecture patterns

Architecture patterns for distributed workloads

Chapter 1 · about 12 minutes to read

You have the API, you have webhooks, you have CI/CD wired up. The question now is how to design workloads so they thrive on Zyra's underlying reality: a mesh of heterogeneous Compute Nodes, scheduled by the placement engine, occasionally churning. The patterns below are the ones that survive that environment cleanly.

Time: about 12 minutes. Prerequisites: Stages 1-4 complete; you can deploy a Virtual Server and call the API from a script.

Why distributed workloads need their own patterns

A Zyra Virtual Server is a Docker container scheduled onto one device by the placement engine (backend/app/services/placement/engine.py). The engine scores candidates on hardware, load, reliability, cost, and network, then picks the winner. When a device falls offline, the SLA loop and the auto-scaling engine notice and react — but your workload must be able to tolerate the gap.

See docs/architecture/SYSTEM_ARCHITECTURE.md for the full topology.

Pattern 1 — Stateless, idempotent jobs

The cheapest insurance policy you will ever buy. Make the unit of work:

Stateless. Everything the job needs comes in as input. Outputs go to a known sink. Restarting from scratch is safe.
Idempotent. Running it twice produces the same result. Tag each invocation with a client-side job_id so consumers can deduplicate.

Concrete example — ML batch inference. Read 10,000 records from S3, score each with the model, write predictions back keyed by record ID. If the VS dies at record 7,234, the rerun re-scores everything; output is overwritten in place. No partial state to reconcile.

Pattern 2 — Checkpoint and resume

For long jobs (hours), restarting from zero is wasteful. Checkpoint:

Pick a natural checkpoint cadence — every N records, every M minutes, every batch boundary.
At each checkpoint, write progress to durable storage outside the VS (S3, your own Postgres, a Redis key you operate).
On startup, your container reads the checkpoint and resumes.

Concrete example — video transcoding. A 4-hour 4K source. Checkpoint per 1-minute output segment to S3. On restart, list the segments already written, resume from the next one.

The Zyra side stays simple: one VS, ordinary cost_per_hour accrual (backend/app/services/billing_calculator.py). The resilience lives in your application code.

Pattern 3 — Fan-out / fan-in for embarrassingly parallel work

When the work is naturally parallel, do not run it serially on one big VS. Fan out:

Split the input into N shards.
Fan out: deploy N small VSs via POST /api/v1/virtual-servers, each with the shard ID as an env var.
Each VS runs to completion, writes its shard output, then stops itself via the API.
Fan in: a final coordinator job (or a webhook handler triggered by the last vs.terminated event) merges the shard outputs.

Concrete example — scientific compute. 500 Monte Carlo simulations. 50 VSs × 10 simulations each finishes in ~1/50th the wall-clock of one VS running serial — at the same total cost.

Rules that make this stable:

Cap concurrency to what downstream sinks can absorb. 1,000 VSs hammering one S3 prefix will throttle.
Tag each VS with the parent job ID in name — easy to find and clean up if the coordinator crashes.
Use webhooks (Stage 4 chapter 2) for fan-in, not polling.

Pattern 4 — Pipelines: chaining Virtual Servers

Real workflows are rarely one step. The Zyra-native way to chain steps:

Step 1 VS produces output to S3, then terminates.
A webhook fires on vs.terminated (or you poll a known sink).
Your orchestration code deploys Step 2 VS with Step 1's output as input.

Keep each step in its own VS, sized for that step's hardware needs. Pre/post-processing on a small CPU VS, the heavy model inference on a GPU VS, the final write on another small CPU VS. This is also the cheapest topology — you stop paying for GPU minutes the moment that stage finishes (see Stage 3 chapter 4).

Pattern 5 — Leader/follower for stateful coordination

Some workloads need a single source of truth across parallel workers — a queue head, a counter, a lock. Pick one, do not invent a sixth:

External Redis/Postgres you operate, shared by all VSs. Simplest, most boring, works.
One designated "leader" VS that holds the state; followers call its endpoint. Fragile — leader death is a coordinated restart event.
A consensus library (Raft, etc.) inside the VSs. Heavy. Only do this if you have specific durability requirements.

Zyra does not ship a managed coordination primitive. [VERIFY: managed Redis or similar offering is on the long-tail roadmap, not MVP1]

Pattern 6 — Idempotency keys at the API layer

When your own CI or orchestration code calls the Zyra API, network blips happen. A retried POST /api/v1/virtual-servers could create a duplicate VS. The defence:

Generate a deterministic client-side ID (uuidv4 per logical action) and put it in the name field.
Before calling create, GET /api/v1/virtual-servers?name=<id>. If it exists, you already succeeded.
On retry, repeat — the GET tells you the truth.

This is application-layer idempotency. [VERIFY: server-side Idempotency-Key header support is not currently implemented in the Zyra API]

Picking the right pattern

Short retriable task — Pattern 1
Long-running single task — Pattern 2
N parallel independent units — Pattern 3
Multi-stage workflow — Pattern 4
Shared mutable state — Pattern 5
Your own retry logic — Pattern 6

What just happened

You now have a catalogue of six patterns that work well on a distributed compute mesh. None of them require Zyra-specific glue beyond the API surface you already know from Stage 4. The next chapter takes the fleet-wide view: how to size that mesh in the first place.

Troubleshooting

VS dies mid-job and I lose hours of work. You needed Pattern 2 (checkpoint). Retrofit by writing progress to S3 every N minutes.
Coordinator crashes leave orphan VSs running and accruing cost. Tag VSs with the parent job ID and run a daily cleanup script that terminates orphans.
Fan-out queues my workers indefinitely. Read Chapter 2 — Capacity planning — your fleet does not yet have the headroom.

Last reviewed: 2026-05-21