Parallel Dev — Monday Morning Walkthrough

A 20-minute hands-on walkthrough of the fleet workflow — start a 3-instance fleet, fan out a real task to two workers, handle a mid-run failure with the kill switch, watch the barrier join, tear down cleanly.

This tutorial requires a version of spl with fleet support. The commands below — spl fleet, spl kill, spl task fan-out — are only available in builds with fleet support. If your build doesn't have them, see the single-instance Quickstart first.

What you'll learn

By the end of this tutorial you will have:

Booted a 3-instance fleet on your laptop (1 coordinator + 2 workers)
Seen the fleet register and report health via heartbeats
Fanned out a real multi-subtask task across both workers
Watched a live fleet status view refresh as workers executed in parallel
Triggered a soft freeze mid-run to stop a subtask cleanly, then unfrozen and retried
Observed the barrier join and the parent task's completion KNOW
Torn the fleet down cleanly

Total time: about 20 minutes if everything works, 30 minutes if you want to explore.

Prerequisites: spl binary installed with fleet support (spl fleet --help works), a working ~/.syncro prod daemon (or be willing to set one up fresh), and ~2 GB of free disk for per-worker state roots.

Part 1 — Start the fleet

Open a terminal. Start the fleet with two workers:

spl fleet start --workers 2

Starting local fleet: 1 coordinator + 2 workers

  [coordinator] spawned PID 429874 on :9100 (home=~/.syncro)
  [worker-a]  spawned PID 429875 on :9201 (home=~/.syncro-worker-a)
  [worker-b]  spawned PID 429888 on :9202 (home=~/.syncro-worker-b)

Waiting for fleet convergence (3 live)...
  ✓ fleet converged

Fleet ready. Inspect with:
  spl fleet list
  spl fleet status

This does three things:

Verifies the coordinator (your existing prod daemon at ~/.syncro on port 9100) is healthy. If no coordinator is running, it starts one.
Creates ~/.syncro-worker-a and ~/.syncro-worker-b if they don't exist (empty state roots, fresh init).
Launches two worker daemons on ports 9201 and 9202, each with SPL_FLEET_COORDINATOR_URL=http://localhost:9100 so they know where to register.

Within 15 seconds, both workers POST their first heartbeat to the coordinator. Verify:

spl fleet list

  DID                                      ENDPOINT                  STATUS   VERSION  UPTIME     HEALTH
  did:sync:instance:923a4e6d10646e27       http://127.0.0.1:9201     live     0.X.Y    5s         healthy
  did:sync:instance:dfe40ff4272ad900       http://127.0.0.1:9202     live     0.X.Y    5s         healthy
  did:sync:instance:e6f09984f9ccd634       http://127.0.0.1:9100     live     0.X.Y    5s         healthy

  3 live · 0 stale · 0 archived

You should see 3 instances: 1 coordinator and 2 workers, all live, all with recent last_heartbeat timestamps. If any are stale or missing, see Worker not registering in the operator runbook.

The workers are persistent — their state roots at ~/.syncro-worker-{a,b} survive fleet restarts. Next time you run spl fleet start --workers 2, the same two workers come back with whatever records they had before. To start fresh, delete the worker home directories before launching.

Part 2 — Create a fan-out candidate

On the coordinator, look at your task backlog for something parallelizable. For this tutorial we'll use a synthetic task, but in real work you'd pick an actual multi-subtask task from your director's proposals.

Create a practice task you'll fan out:

spl task add "Tutorial: parallel docs writing" \
  --priority medium \
  --domain docs \
  --hypothesis "Validate spl task fan-out end-to-end on a 3-instance fleet." \
  --criteria "All 3 subtasks complete within wall-clock 5 minutes" \
  --alias TUT-001

◇ Task created: Tutorial: parallel docs writing
  thread: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
  alias:  TUT-001

The task is just a placeholder at this point — no subtasks, no fan-out spec. The decomposition lives in the next command's flags, not in the task body. (You can also pre-declare subtasks in the body file; see the fan-out reference for that form.)

Part 3 — Fan out

Fan out the placeholder task into three subtasks — two parallel writers and one synthesis step that depends on both. Each --subtask flag is a comma-separated list of key=value pairs; the engine parses them into a body subtasks array before emitting the parent INTEND.

spl task fan-out TUT-001 \
  --subtask "goal=Write section A on Loki integration,target=worker-a,budget=0.50,timeout=300" \
  --subtask "goal=Write section B on Prometheus integration,target=worker-b,budget=0.50,timeout=300" \
  --subtask "goal=Write summary synthesizing A and B,target=least-loaded-worker,budget=0.50,timeout=300,depends=0+1" \
  --join all

◇ Fan-out parent INTEND created
  thread: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
  record: f2bb6627c70d9e723bb8f04d53dc40ae6883fdfc7421b2d3b8ab0874e3093dba
  subtasks: 3
  join:     all

  The engine will spawn child INTENDs to the targeted workers
  and emit a join KNOW once the predicate is satisfied.

  Watch progress: spl task join-status TUT-001

What just happened:

The coordinator's reconciler detected the fan-out parent INTEND (your task) and scanned body.subtasks.
For each subtask with no unsatisfied depends_on, it resolved the target (worker-a, worker-b) via the instance registry.
It computed content-addressed child thread IDs and POSTed child genesis INTENDs to each target worker's /v1/records endpoint.
It emitted a DO topic: "fanout_spawned" record on the parent thread to record what was spawned.
The third subtask (with depends_on: [0, 1]) did not spawn yet — the reconciler will pick it up once the first two complete.

On each worker, the child INTEND triggered its local dispatch pipeline. The worker's agent is now executing the subtask in isolation, just like any single-instance dispatch.

Part 4 — Monitor in real time

In a second terminal, check the join state on demand:

spl task join-status TUT-001

Fan-out join status: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
  Subtasks:        3
  Spawned:         2 / 3
  Completed:       0 / 3
  Join predicate:  all
  Join complete:   no

  Per-subtask:
    [0] Write section A on Loki integration → specific:did:sync:instance:923a4e6d10646e27
        in progress
    [1] Write section B on Prometheus integration → specific:did:sync:instance:dfe40ff4272ad900
        in progress
    [2] Write summary synthesizing A and B → specific:did:sync:instance:923a4e6d10646e27
        pending spawn

Subtasks 0 and 1 spawned immediately because they had no dependencies. Subtask 2 sits in pending spawn because its depends=0+1 blocks it until both predecessors report accept. The reconciler picks it up automatically once that condition flips; you don't have to do anything.

The continuously-refreshing dashboard view is spl fleet status --live — it shows coordinator and worker heartbeats, active dispatches per instance (alias + elapsed time + spend so far), pending fan-outs, and per-instance load. Use it on the same terminal you'd otherwise be running spl fleet list over and over on.

Watch the two workers execute in parallel. You should see the wall-clock time of the fan-out track max(subtask wall-times) rather than the sum — that's the parallelism payoff.

Part 5 — Trigger a mid-run failure (the kill switch drill)

Halfway through execution, pretend you realize worker-a is going down a bad path. Freeze just that namespace cleanly:

spl kill --namespace docs/v0.11 --reason "wrong approach on section A"

✓ soft freeze of 'docs/v0.11' recorded

The namespace passed to --kill is whatever body.namespace your fan-out parent INTEND declared. For a task with no explicit namespace it defaults to default — pick something specific in real workflows so a freeze on one slice doesn't lock out unrelated work.

What this does:

Writes a LEARN topic: "freeze" record on th_fleet_control with level: soft.
A CEL permission rule fires on every subsequent record write: new INTEND, DO, or CALL records in the frozen namespace get rejected with 403 NAMESPACE_FROZEN.
KNOW and LEARN records still pass — in-flight dispatches drain cleanly and write their completion records.
After the configured grace window (default 60s), the in-flight dispatch that was on worker-a completes or times out naturally.

Verify the freeze is live:

spl fleet status

Fleet status
  Coordinator URL:  http://127.0.0.1:9100
  Heartbeat:        every 5s
  Instances:        3 live · 0 stale · 0 archived

  Active freezes:
    [soft] docs/v0.11 — wrong approach on section A

  Emergency stop:   inactive

The namespace appears under "Active freezes" with the reason you supplied.

Now rethink, then unfreeze:

spl unkill --namespace docs/v0.11

✓ unfreeze of 'docs/v0.11' recorded

The fleet returns to normal operation. Re-issue the stuck subtask:

spl task retry TUT-001 --subtask 0

◇ Retry requested for subtask 0 on th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
  Watch progress: spl task join-status TUT-001

The retry emits a LEARN topic: "fanout_retry_subtask" record on the parent thread; the reconciler's spawn-pending walk excludes index 0 from the "already spawned" set on its next pass and re-resolves the descriptor's target via the configured routing strategy. The content-addressed child thread ID makes the second POST idempotent against the original worker — if the original landed cleanly, the worker's local store deduplicates.

Part 6 — Watch the barrier join

Return to the spl fleet status --live view. As the subtasks complete, their completion records flow back to the coordinator:

Worker-a emits its local KNOW (fulfills child INTEND) after completing section A (with the correction applied).
Worker-a POSTs a DO topic: "fanout_child_done" record back to the coordinator's parent thread.
Worker-b does the same for section B.
The coordinator's reconciler sees both fanout_child_done records and now has enough to spawn subtask 2 (the dependent one). It POSTs the third child INTEND to the least-loaded worker.
When the third subtask completes and reports back, the coordinator runs the join predicate (all → children.all(c, c.verdict == "accept")). If satisfied, it emits a parent KNOW fulfilling the original fan-out INTEND.

spl task show TUT-001

Task: Tutorial: parallel docs writing
Thread: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
Alias: TUT-001
Status: approved
Priority: medium
Hypothesis: Validate spl task fan-out end-to-end on a 3-instance fleet.
Success Criteria:
  [ ] All 3 subtasks complete within wall-clock 5 minutes

Evaluation:
  Completed by: engine
  Verdict: accept (by engine)

Records: 9

For the per-subtask cost and wall-clock breakdown, run spl task join-status TUT-001 after the parent KNOW lands:

Fan-out join status: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
  Subtasks:        3
  Spawned:         3 / 3
  Completed:       3 / 3
  Join predicate:  all
  Join complete:   yes

  Per-subtask:
    [0] Write section A on Loki integration → specific:did:sync:instance:923a4e6d10646e27
        accept ($0.42, 120s)
    [1] Write section B on Prometheus integration → specific:did:sync:instance:dfe40ff4272ad900
        accept ($0.42, 120s)
    [2] Write summary synthesizing A and B → specific:did:sync:instance:923a4e6d10646e27
        accept ($0.31, 78s)

The task is now in review or done state (depending on your evaluation gate config). Each subtask shows verdict, cost, and wall-clock — the raw inputs to the speedup ratio you'd compare against a serial baseline.

Part 7 — Tear down

spl fleet stop --workers

This sends SIGTERM to both worker daemons. Each emits a final heartbeat with health: "draining" before exiting. The coordinator stays running for overnight director proposals (the common case at end-of-day).

If you want to stop everything including the coordinator:

spl fleet stop --all

Verify:

spl fleet list

Should show 0 live instances (or just the coordinator, depending on which stop you ran).

What you just demonstrated

You ran a real distributed workflow. The coordinator dispatched parallel work to two autonomous workers, handled a mid-run failure gracefully with the kill switch, and barrier-joined the results. The wall-clock vs serial ratio — how much faster the parallel run was than running the three subtasks sequentially on one instance — is the quantitative proof of the pet→cattle thesis.

More importantly, every line you typed is a protocol operation, not a script. The fan-out, the kill switch, the heartbeat, the join — all of those are records emitted on well-known threads and reconciled by CEL-configurable folds. You could rewrite any of the behavior by editing CEL rules, not by recompiling spl.

Going further

Operator Runbook: Multi-instance fleet operations — day-2 procedures for running a fleet in production: coordinator failover, worker crash recovery, fleet-wide upgrades, kill drills.
Guide: CEL Expressions — how to write custom join predicates and routing rules for fan-out.

Parallel Dev — Monday Morning Walkthrough

On this page