Run your first job
Chapter 6 · about 8 minutes to read
This is the final chapter of Stage 2. Virtual Servers (chapter 5) are great for things that should keep running. Jobs are for things that should finish — a model training run, a batch ETL, a nightly report. Same Compute Nodes, same orchestrator, different lifecycle.
Time: about 15 minutes for a small training job; arbitrary for real workloads. Prerequisites: at least one Compute Node in Ready state, finished Stage 2 chapter 5 (so you've seen the Virtual Server lifecycle).
Job vs Virtual Server — quick recap
| Virtual Server | Job | |
|---|---|---|
| Intent | Persistent service | Runs to completion |
| Lifecycle | creating → running → stopped | queued → running → completed (or failed) |
| Backing model | virtual_servers table | ml_jobs for ML; pipelines + pipeline_runs for data pipelines |
| Billing | Per second while running | Per second from started_at to completed_at |
| Output | The running service itself | A file/log/artifact you collect afterwards |
Zyra ships two job flavours today, both backed by real models in backend/app/models/:
- ML Job (
MLJob) — training and inference jobs with explicit GPU requirements, framework hint (pytorch/tensorflow/jax), runtime cap. - Pipeline run (
Pipeline+PipelineRun) — multi-step data pipelines built with the Pipelines UI. Each step is a container; the orchestrator wires their outputs together.
This chapter walks through the ML Job flow because it's the shortest path to a first finished job. Pipelines are covered in Stage 3.
Step 1 — Open the job form
In the sidebar, click ML Jobs → Submit Job. The page maps to POST /api/v1/ml-jobs/ and persists into the ml_jobs table.
Step 2 — Fill in the form
The required fields map to columns on MLJob:
- Name — up to 255 chars.
first-training-testis fine. - Job type —
training(default) orinference. Stick with training for this walkthrough. - Framework —
pytorch(default),tensorflow,jax, orcustom. Affects which base image is recommended and which GPU drivers the scheduler looks for. - Container image — any Docker image. Two known-working starters:
pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime— official PyTorch with CUDA.python:3.11-slim— plain Python if you just want to confirm scheduling works.
- Command — the entrypoint command for the container. For a smoke test:
python -c "import time; [print(f'step {i}') or time.sleep(1) for i in range(30)]; print('done')". This prints 30 lines over 30 seconds, then exits. - Environment variables — JSON object. Common entries:
WANDB_API_KEY,HF_TOKEN,S3_BUCKET. - CPU cores / Memory (MB) / GPU requirements — resource asks. Default 4 cores, 8192 MB, no GPU. For the smoke test, leave defaults.
- Max runtime (seconds) — hard timeout. Default 86400 (24 hours). The scheduler will mark the job as
timeoutif it runs longer. Thetask_timeout_watchdogbackground job enforces this every 60 seconds.
Optional:
- Output path — where the job writes results. For Zyra-hosted MinIO, use
s3://<your-bucket>/jobs/{job_id}/. If left blank, output is captured to the job log only.
Step 3 — Submit
Click Submit. The dashboard calls POST /api/v1/ml-jobs/ with your payload. The backend inserts an MLJob row with status = "queued", hands it to the scheduler, and redirects you to the job detail page.
The status pill cycles through:
queued— accepted by the API; waiting for a matching Ready device.scheduling— scheduler picked a device, agent dispatch in flight.running— container started on the device.started_atset.completed— container exited with code 0.completed_atset,cost_totalfinalized.failed— container exited non-zero, or the runtime cap fired.
For the 30-second smoke test, you'll see the full cycle in about a minute (most of which is pulling the Python image if the device hasn't run Python yet).
Step 4 — Watch the logs
The detail page has a Logs tab that streams the container's stdout/stderr in near-real time via WebSocket. After the job completes, the same log is downloadable as a single text file. Logs are retained at least 30 days for info, 90 days for error.
For the smoke test you'll see step 0, step 1, … step 29, done. The status flips to completed.
Step 5 — Collect outputs
If you set an output path, the agent uploads any files written to /output/ inside the container to that path before exiting. The dashboard surfaces them as downloadable links on the Artifacts tab. For the smoke test there are no artifacts — you only need the log.
Step 6 — Check cost
The Cost panel shows cost_total from the ml_jobs row, computed from runtime × the device's rate. For the 30-second smoke test on a no-GPU device, this comes out to fractions of a cent. The point is: you only paid for the seconds the container was actually running.
Concurrency — how many jobs can I run at once?
Concurrency is bounded by your fleet, not by an artificial plan cap:
- A single Compute Node can run multiple jobs in parallel up to its capability score and the capacity caps you set in chapter 4.
- The org-wide concurrency limit is the sum across your Ready devices.
- The Starter plan caps you at 1 concurrent job for evaluation purposes; Pro removes that cap.
What just happened
You submitted a containerized workload, the scheduler matched it to a Compute Node you control, the agent ran it under the resource caps you set, and you paid only for the seconds it ran. That's the entire Zyra value proposition compressed into one job: arbitrary compute, on hardware you own, billed by the second.
Troubleshooting
queuedforever. No Ready device meets the resource ask. Either reduce the ask (fewer cores, less RAM, no GPU) or enroll a beefier device.failedimmediately with exit code 125. Docker couldn't start the container. Usually a bad image reference; double-check thecontainer_imagestring.failedwith exit code 137. OOM-killed. Bump thememory_mbcap.timeout. Container ran longer thanmax_runtime_seconds. Thetask_timeout_watchdogbackground job marks it as timeout — increase the cap or split the workload.- No GPU detected inside container. You asked for GPU resources but the device's GPU exposure setting (chapter 4) is Hidden. Flip it to Visible and re-submit.
Stage 2 complete
You've gone from a freshly verified account to a running Compute Node, a deployed Virtual Server with a connect URL, and a completed job with collected logs. That's the full "aha" arc.
Stage 3 — Common workflows — picks up from here with Virtual Server deep dive, job lifecycle in detail, device groups, billing, audit log, alerts, and exports. (Stage 3 chapters coming soon.)
Last reviewed: 2026-05-21