Documentation › User Guides › Organizations › Stage 2 › Run your first job

Run your first job

Chapter 6 · about 8 minutes to read

This is the final chapter of Stage 2. Virtual Servers (chapter 5) are great for things that should keep running. Jobs are for things that should finish — a model training run, a batch ETL, a nightly report. Same Compute Nodes, same orchestrator, different lifecycle.

Time: about 15 minutes for a small training job; arbitrary for real workloads. Prerequisites: at least one Compute Node in Ready state, finished Stage 2 chapter 5 (so you've seen the Virtual Server lifecycle).

Job vs Virtual Server — quick recap

	Virtual Server	Job
Intent	Persistent service	Runs to completion
Lifecycle	`creating → running → stopped`	`queued → running → completed` (or `failed`)
Backing model	`virtual_servers` table	`ml_jobs` for ML; `pipelines` + `pipeline_runs` for data pipelines
Billing	Per second while `running`	Per second from `started_at` to `completed_at`
Output	The running service itself	A file/log/artifact you collect afterwards

Zyra ships two job flavours today, both backed by real models in backend/app/models/:

ML Job (MLJob) — training and inference jobs with explicit GPU requirements, framework hint (pytorch/tensorflow/jax), runtime cap.
Pipeline run (Pipeline + PipelineRun) — multi-step data pipelines built with the Pipelines UI. Each step is a container; the orchestrator wires their outputs together.

This chapter walks through the ML Job flow because it's the shortest path to a first finished job. Pipelines are covered in Stage 3.

Step 1 — Open the job form

In the sidebar, click ML Jobs → Submit Job. The page maps to POST /api/v1/ml-jobs/ and persists into the ml_jobs table.

[SCREENSHOT: ML Jobs list with "Submit Job" button highlighted]

Step 2 — Fill in the form

The required fields map to columns on MLJob:

Name — up to 255 chars. first-training-test is fine.
Job type — training (default) or inference. Stick with training for this walkthrough.
Framework — pytorch (default), tensorflow, jax, or custom. Affects which base image is recommended and which GPU drivers the scheduler looks for.
Container image — any Docker image. Two known-working starters:
- pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime — official PyTorch with CUDA.
- python:3.11-slim — plain Python if you just want to confirm scheduling works.
Command — the entrypoint command for the container. For a smoke test: python -c "import time; [print(f'step {i}') or time.sleep(1) for i in range(30)]; print('done')". This prints 30 lines over 30 seconds, then exits.
Environment variables — JSON object. Common entries: WANDB_API_KEY, HF_TOKEN, S3_BUCKET.
CPU cores / Memory (MB) / GPU requirements — resource asks. Default 4 cores, 8192 MB, no GPU. For the smoke test, leave defaults.
Max runtime (seconds) — hard timeout. Default 86400 (24 hours). The scheduler will mark the job as timeout if it runs longer. The task_timeout_watchdog background job enforces this every 60 seconds.

Optional:

Output path — where the job writes results. For Zyra-hosted MinIO, use s3://<your-bucket>/jobs/{job_id}/. If left blank, output is captured to the job log only.

[SCREENSHOT: ML Job submit form with fields filled for the smoke test above]

Step 3 — Submit

Click Submit. The dashboard calls POST /api/v1/ml-jobs/ with your payload. The backend inserts an MLJob row with status = "queued", hands it to the scheduler, and redirects you to the job detail page.

The status pill cycles through:

queued — accepted by the API; waiting for a matching Ready device.
scheduling — scheduler picked a device, agent dispatch in flight.
running — container started on the device. started_at set.
completed — container exited with code 0. completed_at set, cost_total finalized.
failed — container exited non-zero, or the runtime cap fired.

For the 30-second smoke test, you'll see the full cycle in about a minute (most of which is pulling the Python image if the device hasn't run Python yet).

[SCREENSHOT: ML Job detail page mid-run with live log tail visible]

Step 4 — Watch the logs

The detail page has a Logs tab that streams the container's stdout/stderr in near-real time via WebSocket. After the job completes, the same log is downloadable as a single text file. Logs are retained at least 30 days for info, 90 days for error.

For the smoke test you'll see step 0, step 1, … step 29, done. The status flips to completed.

Step 5 — Collect outputs

If you set an output path, the agent uploads any files written to /output/ inside the container to that path before exiting. The dashboard surfaces them as downloadable links on the Artifacts tab. For the smoke test there are no artifacts — you only need the log.

[SCREENSHOT: Job detail page with status "completed", duration, cost, and log download]

Step 6 — Check cost

The Cost panel shows cost_total from the ml_jobs row, computed from runtime × the device's rate. For the 30-second smoke test on a no-GPU device, this comes out to fractions of a cent. The point is: you only paid for the seconds the container was actually running.

Concurrency — how many jobs can I run at once?

Concurrency is bounded by your fleet, not by an artificial plan cap:

A single Compute Node can run multiple jobs in parallel up to its capability score and the capacity caps you set in chapter 4.
The org-wide concurrency limit is the sum across your Ready devices.
The Starter plan caps you at 1 concurrent job for evaluation purposes; Pro removes that cap.

What just happened

You submitted a containerized workload, the scheduler matched it to a Compute Node you control, the agent ran it under the resource caps you set, and you paid only for the seconds it ran. That's the entire Zyra value proposition compressed into one job: arbitrary compute, on hardware you own, billed by the second.

Troubleshooting

queued forever. No Ready device meets the resource ask. Either reduce the ask (fewer cores, less RAM, no GPU) or enroll a beefier device.
failed immediately with exit code 125. Docker couldn't start the container. Usually a bad image reference; double-check the container_image string.
failed with exit code 137. OOM-killed. Bump the memory_mb cap.
timeout. Container ran longer than max_runtime_seconds. The task_timeout_watchdog background job marks it as timeout — increase the cap or split the workload.
No GPU detected inside container. You asked for GPU resources but the device's GPU exposure setting (chapter 4) is Hidden. Flip it to Visible and re-submit.

Stage 2 complete

You've gone from a freshly verified account to a running Compute Node, a deployed Virtual Server with a connect URL, and a completed job with collected logs. That's the full "aha" arc.

Stage 3 — Common workflows — picks up from here with Virtual Server deep dive, job lifecycle in detail, device groups, billing, audit log, alerts, and exports. (Stage 3 chapters coming soon.)

Last reviewed: 2026-05-21