Parallel Dev — Monday Morning Walkthrough
A 20-minute hands-on walkthrough of the fleet workflow — start a 3-instance fleet, fan out a real task to two workers, handle a mid-run failure with the kill switch, watch the barrier join, tear down cleanly.
This tutorial requires a version of spl with fleet support. The commands below — spl fleet, spl kill, spl task fan-out — are only available in builds with fleet support. If your build doesn't have them, see the single-instance Quickstart first.
What you'll learn
By the end of this tutorial you will have:
- Booted a 3-instance fleet on your laptop (1 coordinator + 2 workers)
- Seen the fleet register and report health via heartbeats
- Fanned out a real multi-subtask task across both workers
- Watched a live fleet status view refresh as workers executed in parallel
- Triggered a soft freeze mid-run to stop a subtask cleanly, then unfrozen and retried
- Observed the barrier join and the parent task's completion KNOW
- Torn the fleet down cleanly
Total time: about 20 minutes if everything works, 30 minutes if you want to explore.
Prerequisites: spl binary installed with fleet support (spl fleet --help works), a working ~/.syncro prod daemon (or be willing to set one up fresh), and ~2 GB of free disk for per-worker state roots.
Part 1 — Start the fleet
Open a terminal. Start the fleet with two workers:
spl fleet start --workers 2Starting local fleet: 1 coordinator + 2 workers
[coordinator] spawned PID 429874 on :9100 (home=~/.syncro)
[worker-a] spawned PID 429875 on :9201 (home=~/.syncro-worker-a)
[worker-b] spawned PID 429888 on :9202 (home=~/.syncro-worker-b)
Waiting for fleet convergence (3 live)...
✓ fleet converged
Fleet ready. Inspect with:
spl fleet list
spl fleet statusThis does three things:
- Verifies the coordinator (your existing prod daemon at
~/.syncroon port 9100) is healthy. If no coordinator is running, it starts one. - Creates
~/.syncro-worker-aand~/.syncro-worker-bif they don't exist (empty state roots, fresh init). - Launches two worker daemons on ports 9201 and 9202, each with
SPL_FLEET_COORDINATOR_URL=http://localhost:9100so they know where to register.
Within 15 seconds, both workers POST their first heartbeat to the coordinator. Verify:
spl fleet list DID ENDPOINT STATUS VERSION UPTIME HEALTH
did:sync:instance:923a4e6d10646e27 http://127.0.0.1:9201 live 0.X.Y 5s healthy
did:sync:instance:dfe40ff4272ad900 http://127.0.0.1:9202 live 0.X.Y 5s healthy
did:sync:instance:e6f09984f9ccd634 http://127.0.0.1:9100 live 0.X.Y 5s healthy
3 live · 0 stale · 0 archivedYou should see 3 instances: 1 coordinator and 2 workers, all live, all with recent last_heartbeat timestamps. If any are stale or missing, see Worker not registering in the operator runbook.
The workers are persistent — their state roots at ~/.syncro-worker-{a,b} survive fleet restarts. Next time you run spl fleet start --workers 2, the same two workers come back with whatever records they had before. To start fresh, delete the worker home directories before launching.
Part 2 — Create a fan-out candidate
On the coordinator, look at your task backlog for something parallelizable. For this tutorial we'll use a synthetic task, but in real work you'd pick an actual multi-subtask task from your director's proposals.
Create a practice task you'll fan out:
spl task add "Tutorial: parallel docs writing" \
--priority medium \
--domain docs \
--hypothesis "Validate spl task fan-out end-to-end on a 3-instance fleet." \
--criteria "All 3 subtasks complete within wall-clock 5 minutes" \
--alias TUT-001◇ Task created: Tutorial: parallel docs writing
thread: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
alias: TUT-001The task is just a placeholder at this point — no subtasks, no fan-out spec. The decomposition lives in the next command's flags, not in the task body. (You can also pre-declare subtasks in the body file; see the fan-out reference for that form.)
Part 3 — Fan out
Fan out the placeholder task into three subtasks — two parallel writers and one synthesis step that depends on both. Each --subtask flag is a comma-separated list of key=value pairs; the engine parses them into a body subtasks array before emitting the parent INTEND.
spl task fan-out TUT-001 \
--subtask "goal=Write section A on Loki integration,target=worker-a,budget=0.50,timeout=300" \
--subtask "goal=Write section B on Prometheus integration,target=worker-b,budget=0.50,timeout=300" \
--subtask "goal=Write summary synthesizing A and B,target=least-loaded-worker,budget=0.50,timeout=300,depends=0+1" \
--join all◇ Fan-out parent INTEND created
thread: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
record: f2bb6627c70d9e723bb8f04d53dc40ae6883fdfc7421b2d3b8ab0874e3093dba
subtasks: 3
join: all
The engine will spawn child INTENDs to the targeted workers
and emit a join KNOW once the predicate is satisfied.
Watch progress: spl task join-status TUT-001What just happened:
- The coordinator's reconciler detected the fan-out parent INTEND (your task) and scanned
body.subtasks. - For each subtask with no unsatisfied
depends_on, it resolved thetarget(worker-a, worker-b) via the instance registry. - It computed content-addressed child thread IDs and POSTed child genesis INTENDs to each target worker's
/v1/recordsendpoint. - It emitted a
DO topic: "fanout_spawned"record on the parent thread to record what was spawned. - The third subtask (with
depends_on: [0, 1]) did not spawn yet — the reconciler will pick it up once the first two complete.
On each worker, the child INTEND triggered its local dispatch pipeline. The worker's agent is now executing the subtask in isolation, just like any single-instance dispatch.
Part 4 — Monitor in real time
In a second terminal, check the join state on demand:
spl task join-status TUT-001Fan-out join status: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
Subtasks: 3
Spawned: 2 / 3
Completed: 0 / 3
Join predicate: all
Join complete: no
Per-subtask:
[0] Write section A on Loki integration → specific:did:sync:instance:923a4e6d10646e27
in progress
[1] Write section B on Prometheus integration → specific:did:sync:instance:dfe40ff4272ad900
in progress
[2] Write summary synthesizing A and B → specific:did:sync:instance:923a4e6d10646e27
pending spawnSubtasks 0 and 1 spawned immediately because they had no dependencies. Subtask 2 sits in pending spawn because its depends=0+1 blocks it until both predecessors report accept. The reconciler picks it up automatically once that condition flips; you don't have to do anything.
The continuously-refreshing dashboard view is spl fleet status --live — it shows coordinator and worker heartbeats, active dispatches per instance (alias + elapsed time + spend so far), pending fan-outs, and per-instance load. Use it on the same terminal you'd otherwise be running spl fleet list over and over on.
Watch the two workers execute in parallel. You should see the wall-clock time of the fan-out track max(subtask wall-times) rather than the sum — that's the parallelism payoff.
Part 5 — Trigger a mid-run failure (the kill switch drill)
Halfway through execution, pretend you realize worker-a is going down a bad path. Freeze just that namespace cleanly:
spl kill --namespace docs/v0.11 --reason "wrong approach on section A"✓ soft freeze of 'docs/v0.11' recordedThe namespace passed to --kill is whatever body.namespace your fan-out parent INTEND declared. For a task with no explicit namespace it defaults to default — pick something specific in real workflows so a freeze on one slice doesn't lock out unrelated work.
What this does:
- Writes a
LEARN topic: "freeze"record onth_fleet_controlwithlevel: soft. - A CEL permission rule fires on every subsequent record write: new
INTEND,DO, orCALLrecords in the frozen namespace get rejected with403 NAMESPACE_FROZEN. KNOWandLEARNrecords still pass — in-flight dispatches drain cleanly and write their completion records.- After the configured grace window (default 60s), the in-flight dispatch that was on worker-a completes or times out naturally.
Verify the freeze is live:
spl fleet statusFleet status
Coordinator URL: http://127.0.0.1:9100
Heartbeat: every 5s
Instances: 3 live · 0 stale · 0 archived
Active freezes:
[soft] docs/v0.11 — wrong approach on section A
Emergency stop: inactiveThe namespace appears under "Active freezes" with the reason you supplied.
Now rethink, then unfreeze:
spl unkill --namespace docs/v0.11✓ unfreeze of 'docs/v0.11' recordedThe fleet returns to normal operation. Re-issue the stuck subtask:
spl task retry TUT-001 --subtask 0◇ Retry requested for subtask 0 on th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
Watch progress: spl task join-status TUT-001The retry emits a LEARN topic: "fanout_retry_subtask" record on the parent thread; the reconciler's spawn-pending walk excludes index 0 from the "already spawned" set on its next pass and re-resolves the descriptor's target via the configured routing strategy. The content-addressed child thread ID makes the second POST idempotent against the original worker — if the original landed cleanly, the worker's local store deduplicates.
Part 6 — Watch the barrier join
Return to the spl fleet status --live view. As the subtasks complete, their completion records flow back to the coordinator:
- Worker-a emits its local KNOW (fulfills child INTEND) after completing section A (with the correction applied).
- Worker-a POSTs a
DO topic: "fanout_child_done"record back to the coordinator's parent thread. - Worker-b does the same for section B.
- The coordinator's reconciler sees both
fanout_child_donerecords and now has enough to spawn subtask 2 (the dependent one). It POSTs the third child INTEND to the least-loaded worker. - When the third subtask completes and reports back, the coordinator runs the join predicate (
all→children.all(c, c.verdict == "accept")). If satisfied, it emits a parent KNOW fulfilling the original fan-out INTEND.
spl task show TUT-001Task: Tutorial: parallel docs writing
Thread: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
Alias: TUT-001
Status: approved
Priority: medium
Hypothesis: Validate spl task fan-out end-to-end on a 3-instance fleet.
Success Criteria:
[ ] All 3 subtasks complete within wall-clock 5 minutes
Evaluation:
Completed by: engine
Verdict: accept (by engine)
Records: 9For the per-subtask cost and wall-clock breakdown, run spl task join-status TUT-001 after the parent KNOW lands:
Fan-out join status: th_4afffe301dd40d5d3080b932ea6dee71fe897fa823edd03243646ddf300a7451
Subtasks: 3
Spawned: 3 / 3
Completed: 3 / 3
Join predicate: all
Join complete: yes
Per-subtask:
[0] Write section A on Loki integration → specific:did:sync:instance:923a4e6d10646e27
accept ($0.42, 120s)
[1] Write section B on Prometheus integration → specific:did:sync:instance:dfe40ff4272ad900
accept ($0.42, 120s)
[2] Write summary synthesizing A and B → specific:did:sync:instance:923a4e6d10646e27
accept ($0.31, 78s)The task is now in review or done state (depending on your evaluation gate config). Each subtask shows verdict, cost, and wall-clock — the raw inputs to the speedup ratio you'd compare against a serial baseline.
Part 7 — Tear down
spl fleet stop --workersThis sends SIGTERM to both worker daemons. Each emits a final heartbeat with health: "draining" before exiting. The coordinator stays running for overnight director proposals (the common case at end-of-day).
If you want to stop everything including the coordinator:
spl fleet stop --allVerify:
spl fleet listShould show 0 live instances (or just the coordinator, depending on which stop you ran).
What you just demonstrated
You ran a real distributed workflow. The coordinator dispatched parallel work to two autonomous workers, handled a mid-run failure gracefully with the kill switch, and barrier-joined the results. The wall-clock vs serial ratio — how much faster the parallel run was than running the three subtasks sequentially on one instance — is the quantitative proof of the pet→cattle thesis.
More importantly, every line you typed is a protocol operation, not a script. The fan-out, the kill switch, the heartbeat, the join — all of those are records emitted on well-known threads and reconciled by CEL-configurable folds. You could rewrite any of the behavior by editing CEL rules, not by recompiling spl.
Going further
- Operator Runbook: Multi-instance fleet operations — day-2 procedures for running a fleet in production: coordinator failover, worker crash recovery, fleet-wide upgrades, kill drills.
- Guide: CEL Expressions — how to write custom join predicates and routing rules for fan-out.
Your First Fan-Out
The 5-minute version. Boot a 3-instance fleet, fan out a trivial task to two workers, watch the join, inspect the speedup ratio, tear down. No real work, no LLM spend — just the shape of the thing.
Your First SDK Integration
15-minute walk from npm install to a working Node.js script that emits records, attaches canonical references, and queries them back. Zero magic — just the kernel speaking JSON over HTTP.