Architecture patterns for distributed workloads
Chapter 1 · about 12 minutes to read
You have the API, you have webhooks, you have CI/CD wired up. The question now is how to design workloads so they thrive on Zyra's underlying reality: a mesh of heterogeneous Compute Nodes, scheduled by the placement engine, occasionally churning. The patterns below are the ones that survive that environment cleanly.
Time: about 12 minutes. Prerequisites: Stages 1-4 complete; you can deploy a Virtual Server and call the API from a script.
Why distributed workloads need their own patterns
A Zyra Virtual Server is a Docker container scheduled onto one device by the placement engine (backend/app/services/placement/engine.py). The engine scores candidates on hardware, load, reliability, cost, and network, then picks the winner. When a device falls offline, the SLA loop and the auto-scaling engine notice and react — but your workload must be able to tolerate the gap.
See docs/architecture/SYSTEM_ARCHITECTURE.md for the full topology.
Pattern 1 — Stateless, idempotent jobs
The cheapest insurance policy you will ever buy. Make the unit of work:
- Stateless. Everything the job needs comes in as input. Outputs go to a known sink. Restarting from scratch is safe.
- Idempotent. Running it twice produces the same result. Tag each invocation with a client-side
job_idso consumers can deduplicate.
Concrete example — ML batch inference. Read 10,000 records from S3, score each with the model, write predictions back keyed by record ID. If the VS dies at record 7,234, the rerun re-scores everything; output is overwritten in place. No partial state to reconcile.
Pattern 2 — Checkpoint and resume
For long jobs (hours), restarting from zero is wasteful. Checkpoint:
- Pick a natural checkpoint cadence — every N records, every M minutes, every batch boundary.
- At each checkpoint, write progress to durable storage outside the VS (S3, your own Postgres, a Redis key you operate).
- On startup, your container reads the checkpoint and resumes.
Concrete example — video transcoding. A 4-hour 4K source. Checkpoint per 1-minute output segment to S3. On restart, list the segments already written, resume from the next one.
The Zyra side stays simple: one VS, ordinary cost_per_hour accrual (backend/app/services/billing_calculator.py). The resilience lives in your application code.
Pattern 3 — Fan-out / fan-in for embarrassingly parallel work
When the work is naturally parallel, do not run it serially on one big VS. Fan out:
- Split the input into N shards.
- Fan out: deploy N small VSs via
POST /api/v1/virtual-servers, each with the shard ID as an env var. - Each VS runs to completion, writes its shard output, then stops itself via the API.
- Fan in: a final coordinator job (or a webhook handler triggered by the last
vs.terminatedevent) merges the shard outputs.
Concrete example — scientific compute. 500 Monte Carlo simulations. 50 VSs × 10 simulations each finishes in ~1/50th the wall-clock of one VS running serial — at the same total cost.
Rules that make this stable:
- Cap concurrency to what downstream sinks can absorb. 1,000 VSs hammering one S3 prefix will throttle.
- Tag each VS with the parent job ID in
name— easy to find and clean up if the coordinator crashes. - Use webhooks (Stage 4 chapter 2) for fan-in, not polling.
Pattern 4 — Pipelines: chaining Virtual Servers
Real workflows are rarely one step. The Zyra-native way to chain steps:
- Step 1 VS produces output to S3, then terminates.
- A webhook fires on
vs.terminated(or you poll a known sink). - Your orchestration code deploys Step 2 VS with Step 1's output as input.
Keep each step in its own VS, sized for that step's hardware needs. Pre/post-processing on a small CPU VS, the heavy model inference on a GPU VS, the final write on another small CPU VS. This is also the cheapest topology — you stop paying for GPU minutes the moment that stage finishes (see Stage 3 chapter 4).
Pattern 5 — Leader/follower for stateful coordination
Some workloads need a single source of truth across parallel workers — a queue head, a counter, a lock. Pick one, do not invent a sixth:
- External Redis/Postgres you operate, shared by all VSs. Simplest, most boring, works.
- One designated "leader" VS that holds the state; followers call its endpoint. Fragile — leader death is a coordinated restart event.
- A consensus library (Raft, etc.) inside the VSs. Heavy. Only do this if you have specific durability requirements.
Zyra does not ship a managed coordination primitive. [VERIFY: managed Redis or similar offering is on the long-tail roadmap, not MVP1]
Pattern 6 — Idempotency keys at the API layer
When your own CI or orchestration code calls the Zyra API, network blips happen. A retried POST /api/v1/virtual-servers could create a duplicate VS. The defence:
- Generate a deterministic client-side ID (
uuidv4per logical action) and put it in thenamefield. - Before calling create,
GET /api/v1/virtual-servers?name=<id>. If it exists, you already succeeded. - On retry, repeat — the GET tells you the truth.
This is application-layer idempotency. [VERIFY: server-side Idempotency-Key header support is not currently implemented in the Zyra API]
Picking the right pattern
- Short retriable task — Pattern 1
- Long-running single task — Pattern 2
- N parallel independent units — Pattern 3
- Multi-stage workflow — Pattern 4
- Shared mutable state — Pattern 5
- Your own retry logic — Pattern 6
What just happened
You now have a catalogue of six patterns that work well on a distributed compute mesh. None of them require Zyra-specific glue beyond the API surface you already know from Stage 4. The next chapter takes the fleet-wide view: how to size that mesh in the first place.
Troubleshooting
- VS dies mid-job and I lose hours of work. You needed Pattern 2 (checkpoint). Retrofit by writing progress to S3 every N minutes.
- Coordinator crashes leave orphan VSs running and accruing cost. Tag VSs with the parent job ID and run a daily cleanup script that terminates orphans.
- Fan-out queues my workers indefinitely. Read Chapter 2 — Capacity planning — your fleet does not yet have the headroom.
Last reviewed: 2026-05-21