Operator Runbook
Day-2 operations for running Syncropel in production — instance lifecycle, recovery from corruption, backup discipline, in-place upgrades, and how to recognize the failure modes you're about to hit.
Audience
This page is for the person on the other end of a spl serve that's actually serving real work. It's not for first-time installation (see Quickstart) or feature exploration (see Guides). It assumes you have an instance running and you need to keep it healthy, recover when it isn't, and upgrade it without losing data.
If you're reading this because something just broke, jump to Recovery.
Instance lifecycle
Starting the instance
spl serveThis forks a background process, writes a PID file to ~/.syncro/run/spl.pid, opens the SQLite store at ~/.syncro/hub.db, takes a startup backup (see Backup discipline), binds 127.0.0.1:9100, and binds a Unix socket at ~/.syncro/run/spl.sock.
Verify it came up:
spl status
curl -fsS http://localhost:9100/healthBoth should report ok with the current version and record count.
Stopping the instance
spl stopThis reads ~/.syncro/run/spl.pid, sends SIGTERM to the instance process, and waits for graceful shutdown. The instance flushes the SQLite WAL, closes the socket, removes the PID file, and exits.
When spl stop says "not running" but the instance clearly is
This happens after long-running instances lose their PID file (host reboot, WSL restart, or the file was clobbered by a runaway test). The instance is still bound to port 9100 but --stop can't find it.
Recovery:
# Find the actual process
pgrep -af "spl serve"
# Send SIGTERM directly. Use SIGKILL only as last resort.
kill <pid>
# Verify the port is free
ss -tlnp 2>/dev/null | grep 9100Then restart cleanly.,
Reading instance logs
Logs live at ~/.syncro/logs/spl.log (JSON-line format). Tail them:
tail -f ~/.syncro/logs/spl.logThe structured fields you'll care about most:
| Field | Meaning |
|---|---|
level | ERROR, WARN, INFO, DEBUG, TRACE |
target | The Rust module emitting the log (filter on syncropel_engine::reconciler to see routing decisions) |
actor | The DID involved in the operation |
thread | The thread ID being touched |
record | Record ID for ingest events |
To filter for permission denials specifically:
grep "PERMISSION DENIED" ~/.syncro/logs/spl.logBackup discipline
CRITICAL — read this section twice. Syncropel's backup mechanism is a safety net, not a backup system. It will fail to save you if you don't supplement it with off-host copies. The recovery drill at
tests/drills/recovery.shexposed exactly this failure mode.
What the instance does for you
On every startup, spl serve checks if ~/.syncro/hub.db exists. If it does, the instance copies it to ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak. The backup directory is outside ~/.syncro/ so rm -rf ~/.syncro/ doesn't kill it.
The instance key is one of:
instance-<did-tail>if a content-addressed instance DID is bootstrapped.home-<short-hash>ifSYNCROPEL_HOMEis set but no DID exists yet.- The default single path
~/.local/share/syncropel/backups/hub.db.bakfor the default instance.
What the instance does NOT do — and the trap to avoid
The startup backup is destructive. Every spl serve invocation overwrites the backup file with a snapshot of the current hub.db, even if the current hub.db is empty, corrupt, or wrong.
Concretely: if you delete hub.db and restart the instance, the instance comes up with an empty database, then on the next restart writes that empty database OVER your good backup. By the time you realize what happened, the backup is gone.
The recovery drill (bash tests/drills/recovery.sh) demonstrates this: it captures the backup file off-host immediately after creation, then deliberately wipes the volume and restores from the off-host copy. You should do the same in production.
What you should do instead
Schedule a periodic off-host backup of BOTH ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak (records, trust, config) AND ~/.syncro-data/ (task content files and alias map). The instance auto-backs up only the former; the latter is referenced by path during dispatch but not auto-snapshotted.
# Daily snapshot to a directory you trust
DEST=$HOME/backups/syncropel
DATE=$(date +%Y%m%d-%H%M%S)
mkdir -p "$DEST"
# Records + trust + engine config
cp ~/.local/share/syncropel/backups/*/hub.db.bak \
"$DEST/hub.db.$DATE.bak"
# Task content files + alias map (rich-text bodies referenced by spl task dispatch)
tar -czf "$DEST/syncro-data.$DATE.tar.gz" -C "$HOME" .syncro-data
# Keep last 14 days of each
find "$DEST" -name 'hub.db.*.bak' -mtime +14 -delete
find "$DEST" -name 'syncro-data.*.tar.gz' -mtime +14 -deleteOr, if you're running in Docker, mount a host directory into the backup path so the rolling backup lives outside the container's ephemeral filesystem:
docker run -d \
-p 9100:9100 \
-v spl-home:/syncropel \
-v $HOME/backups/syncropel:/home/syncropel/.local/share/syncropel/backups \
syncropic/spl:devFor added safety, a daily cron or systemd timer snapshotting that host directory to remote storage (S3, B2, SFTP, etc.) is appropriate for any instance carrying real data.
What does NOT need to be backed up separately
- Trust scores: derived from KNOW/DO records on startup. Will rebuild from the record log.
- Engine config: derived from LEARN records on
th_engine_config. Will rebuild from the record log. - Routing rules, fold rules, health checks, AITL rules, permission rules: all stored as LEARN records. Will rebuild.
The record log IS the truth. Everything else is a fold over it. As long as hub.db is intact, all derived state recovers automatically.
What needs separate attention
~/.syncro/secrets/: API keys for any LLM provider you've configured (Anthropic, OpenAI, Google, etc.). Not in the record log. If you lose this, you lose your provider keys — back them up to a password manager or secure vault.~/.syncro-data/: task content files (the rich-Markdown bodies for each task — addressed by their alias, e.g.,MY-001.md) and the alias-to-thread mapping. The metadata (titles, statuses, hashes) lives in records onhub.db; the canonical content files live here on disk and are referenced by path duringspl task dispatch. Backing uphub.dbwithout~/.syncro-data/leaves task content stranded. Both belong in your snapshot.
Recovery from corruption or data loss
Symptom: instance won't start, panics on first ingest
Likely cause: SQLite store is corrupt. Read the panic message in ~/.syncro/logs/spl.log for the exact error.
Recovery procedure:
# 1. Stop the instance (or kill any lingering process)
spl stop || pkill -f "spl serve"
# 2. Move the corrupt store aside (don't delete — you may want it for forensics)
mv ~/.syncro/hub.db ~/.syncro/hub.db.corrupt.$(date +%s)
mv ~/.syncro/hub.db-wal ~/.syncro/hub.db-wal.corrupt.$(date +%s) 2>/dev/null || true
mv ~/.syncro/hub.db-shm ~/.syncro/hub.db-shm.corrupt.$(date +%s) 2>/dev/null || true
# 3. Restore from your OFF-HOST backup (NOT the instance's auto-backup,
# which may have been overwritten — see "Backup discipline" above)
cp $HOME/backups/syncropel/hub.db.20260412-123000.bak ~/.syncro/hub.db
# 4. Restart and verify
spl serve
spl statusIf you don't have an off-host backup, the instance's auto-backup MAY still be intact:
ls -la ~/.local/share/syncropel/backups/
# pick the most recent hub.db.bak with a non-trivial size
cp ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak ~/.syncro/hub.dbBut only do this if the instance hasn't already started and overwritten the backup. If the corrupt instance already ran, the auto-backup is now a snapshot of the corrupt state.
Symptom: hub.db deleted by accident
Same procedure as above, minus step 2. Restore from off-host backup, restart.
Symptom: trust scores wrong, routing rules missing
These are derived state. They were correct the last time the instance ran cleanly. To force a rebuild:
spl stop
spl serve # rebuild_from_store() runs on startup
spl trust # verify rebuild
spl config list-rulesIf they're still wrong after a restart, the underlying records are wrong (or missing). Check the record log directly:
spl thread records th_engine_config | head -50 # config history
spl thread records <thread-id-of-concern> # specific threadSymptom: permission enforcement locked you out
You enabled spl config permissions-enable without an admin allow rule, and now spl config permissions-disable returns 403. This is prevented by a pre-flight check, but if you're on an older version or you disabled an admin rule by mistake:
spl stop # cached config in memory will shadow our write
spl config permissions-unlock # writes permissions_enabled=false direct to store
spl serve # restart — enforcement is offSee CEL Expressions → Permission Enforcement for the full lockout trap explanation.
In-place upgrades
The supported upgrade path
# 1. Snapshot first. Always.
cp ~/.syncro/hub.db $HOME/backups/syncropel/hub.db.pre-upgrade.$(date +%s).bak
# 2. Stop the instance
spl stop
# 3. Install the new binary (atomic mv via the install script)
curl -sSf https://get.syncropic.com/spl | sh
# 4. Verify the new binary version
spl version
# 5. Restart
spl serve
# 6. Verify health + record count match pre-upgrade
spl statusThe Syncropel engine's SQLite schema has been backward-compatible across 0.8.x → 0.9.x. The startup path runs CREATE TABLE IF NOT EXISTS for any new tables; existing data is untouched. If a future release requires a non-backward-compatible migration, the release notes will say so explicitly and the instance will refuse to start until you run the migration.
The upgrade drill (bash tests/drills/upgrade.sh) verifies the data-preservation property end-to-end on every release: 5 user records + 1 routing rule + 1 permission rule are all confirmed intact across instance stop → container swap → instance start on the same volume. If you're maintaining a fork or shipping a custom build, run this drill before you cut a release.
What can go wrong, and how to recognize it
- Binary download fails partway: the install script downloads to a temp file and atomically
mvs into place. A failed download leaves the old binary intact. You'll see the old version onspl version. - Instance won't start on the new binary: a panic or graceful refusal. Read the log. Roll back by reinstalling the previous version (
curl -sSf https://get.syncropic.com/spl?v=0.9.1 | sh) and restarting. - Records present but trust scores empty: the new instance rebuilt trust from records on startup. Wait a moment for
rebuild_from_store()to complete; checkspl trustagain.
Common issues and fixes
"Address already in use" on spl serve
Another instance is already bound to port 9100. Either you have a stale spl serve running (use pgrep -af "spl serve" then kill), or another process took the port. Find it with ss -tlnp | grep 9100.
~/.syncro/run/spl.pid exists but no process
Stale PID file. Delete it and start clean:
rm -f ~/.syncro/run/spl.pid
spl servespl task done returns "uncommitted changes"
The task completion gate refuses to mark a task complete when the working tree has uncommitted changes — it's protecting against attributing wrong commits to the task. Either commit your work first, or use --force if you're sure (e.g. for triage of pre-existing tasks).
Health check returning non-200 even though the instance is up
The HEALTHCHECK in the docker image hits /health. If you're behind a reverse proxy, make sure the proxy isn't injecting auth that breaks the request. The /health endpoint is exempt from permission enforcement specifically so liveness probes work.
High memory growth over multi-day uptime
The 4-loop engine holds working state in memory for active threads. If you're seeing unbounded growth, capture a snapshot via spl status -o json and file it. This hasn't been observed in practice, but it's a class of bug to watch for.
Dispatch under memory pressure: read the warn log
When you dispatch a worker — spl task dispatch <SKL>, spl run <goal>, or any other path that spawns a subprocess — the instance emits one of three log lines at spawn:
DEBUG dispatch pre-flight: host memory healthy avail_gb=8.1
INFO dispatch pre-flight: host approaching memory pressure avail_gb=3.2
WARN dispatch pre-flight: host under memory pressure — OOM-kill risk; consider sequential dispatch avail_gb=1.4The lines are visible in ~/.syncro/logs/spl.log (filter for dispatch pre-flight) and in journalctl if you're running under systemd. They're driven by MemAvailable from /proc/meminfo, classified into three bands:
| Band | Threshold | Meaning |
|---|---|---|
| Healthy | ≥ 4 GiB available | Plenty of headroom. Parallel dispatch is fine. |
| Approaching | 2-4 GiB available | Workers may run, but you're trending toward pressure. Consider whether a second parallel worker is worth it. |
| Critical | < 2 GiB available | OOM-kill risk. Drop to sequential dispatch (max_concurrent = 1 on the adapter) before you spawn more workers. |
The Critical line is the actionable one. When you see it:
- Don't start more parallel workers. The currently-spawning worker may already be at risk.
- Free memory — close idle browser tabs, idle agent sessions, anything not load-bearing. Each idle agent process commonly consumes 200-500 MB resident.
- Verify free memory after cleanup —
free -hand checkavailable. - If a worker was killed during the pressure window, follow the salvage procedure below.
Linux-only — the pre-flight check no-ops gracefully on macOS and Windows (no /proc/meminfo). On those platforms, watch the OS process monitor instead.
Dispatched worker died unexpectedly: salvage procedure
If spl task show <task-id> reports Status: failed with Prior attempts: 1 failed ($0.00 total) shortly after a parallel dispatch, the worker subprocess almost certainly hit a SIGKILL — most often from the kernel OOM-killer under host memory pressure. The good news: any work the worker had finished writing to disk before the kill is still in the worktree, even though commits ahead: 0. You can usually salvage it without re-dispatching.
Step 1 — Diagnose the failure mode
spl task diagnose <task-id> | tail -30Look for the Subprocess and Completion blocks. The OOM-kill signature is:
Subprocess
pid: <pid>
runtime: <ms> ← typically <10min
exit_code: None
signal: Some(9) ← SIGKILL
Completion
codepath: stream_eof_fallback # see body-kind reference
failure_reason: result_missing
success: falseIf signal: Some(15) (SIGTERM) instead of Some(9), you're in a different failure class — process-level shutdown rather than kernel OOM. The salvage steps below still apply, but the prevention guidance at the bottom of this section won't help; investigate budget exhaustion, agent-CLI signal handling, or container/cgroup limits instead.
Step 2 — Inspect the worktree
The diagnose output reports worktree: and latest commit:. Switch in and check status:
cd /path/to/<your-project>-<task-id>
git status -sModified or untracked files = real work the worker produced before SIGKILL. Do not delete the worktree.
Step 3 — Read the diff + decide salvage path
git diff
ls -la <new-directories-the-worker-created>Does the work look substantively complete (worker had finished, just hadn't committed)? Or partial (mid-flight)?
- Substantively complete — proceed to Step 4.
- Partial — read the task brief at
~/.syncro-data/tasks/<task-id>.md, identify the missing acceptance-criteria items, complete them in the worktree, then proceed.
Step 4 — Run the worker end-of-task gate
From the worktree, run whatever local-CI gate your project uses. For a Rust project:
cargo fmt --all -- --check # if it complains, run `cargo fmt --all` to fix
cargo check
cargo clippy -- -D warnings
cargo nextest run # or `cargo test` if you don't have nextestSubstitute the equivalent for your stack (npm test, pytest, go test ./..., etc.). All gates must pass — the salvage is only as trustworthy as the gate.
Step 5 — Commit on the task branch
The worktree is already on a task branch. Commit there:
git add <specific-files>
git commit -m "feat(scope): <change> (<task-id>)
<commit body>
Salvaged in-session after parallel-dispatch SIGKILL on the original
worker (host OOM, signal=9 at ~Xmin). Worker had completed <work
inventory>; only <missing-item> needed adding. Worker gate green."Step 6 — Merge to main + run merge-time gate
cd /path/to/<your-project> # main worktree
git merge task/<task-id> --ff-only # or `git cherry-pick <hash>` if main moved
# Re-run your full-workspace gates
git push origin mainStep 7 — Reopen + done + approve
The original worker's dispatch_complete left the task in failed status. Reopen and walk it through the lifecycle:
spl task reopen <task-id> --reason "salvaged in-session after worker SIGKILL"
SPL_ACTOR=did:sync:agent:dev spl task done <task-id> --summary "..." --domain code
SPL_ACTOR=did:sync:agent:reviewer spl task approve <task-id> --domain code --notes "..."Reference the salvage commit hash + workspace gate result in the reviewer's --notes so trust evidence accumulates correctly.
Step 8 — Clean up the worktree
cd /path/to/<your-project>
git worktree remove <your-project>-<task-id> --forceThe task branch persists post-merge; that's fine, it's part of the git history.
Avoiding the OOM kill in the first place
- Cap parallel dispatch by available memory. On a 16GB host already running a few
claudesessions andspl serve, evenmax_concurrent = 2may be too many. IfMemAvailable < 4GB × N_workers, drop tomax_concurrent = 1. - Close idle agent sessions before dispatching parallel. Each background agent-CLI session commonly consumes 200-500MB resident.
- Watch
dmesgafter a SIGKILL incident —dmesg | grep -i 'killed process'confirms whether the OOM-killer was the culprit (vs. other SIGKILL sources like signal injection or container limits). - Set generous timeouts. OOM doesn't care about timeout, but a longer timeout gives the worker more chance to commit before any other failure mode fires.
Multi-instance fleet operations
This section covers multi-instance deployments. If you're running a single instance, skip it.
A "fleet" is two or more spl instances that coordinate via the instance registry. One instance is the coordinator (holds the registry thread, receives heartbeats, processes fan-out barriers). The others are workers (execute dispatched work, report heartbeats, POST completion records back to the coordinator). Any instance can play either role; the distinction is config, not code.
This section covers the operations you run day-2 on a fleet. For the first-time walkthrough, see the Parallel Dev Tutorial.
Booting a fleet
spl fleet start --workers 2Starting local fleet: 1 coordinator + 2 workers
[coordinator] spawned PID 414153 on :9100 (home=~/.syncro)
[worker-a] spawned PID 414154 on :9201 (home=~/.syncro-worker-a)
[worker-b] spawned PID 414161 on :9202 (home=~/.syncro-worker-b)
Waiting for fleet convergence (3 live)...
✓ fleet converged
Fleet ready. Inspect with:
spl fleet list
spl fleet statusThis boots the coordinator on ~/.syncro port 9100 plus N workers on ~/.syncro-worker-{a,b,...} ports 9201, 9202, ... Each worker inherits the coordinator URL from the [fleet] config section and begins emitting heartbeats within ~15 seconds. The wrapper polls /v1/fleet/status until all expected instances are live or 30 seconds pass; if convergence times out, fall through to Worker not registering.
Verify the fleet is healthy:
spl fleet list DID ENDPOINT STATUS VERSION UPTIME HEALTH
did:sync:instance:923a4e6d10646e27 http://127.0.0.1:9201 live 0.X.Y 0s healthy
did:sync:instance:dfe40ff4272ad900 http://127.0.0.1:9202 live 0.X.Y 0s healthy
did:sync:instance:e6f09984f9ccd634 http://127.0.0.1:9100 live 0.X.Y 0s healthy
3 live · 0 stale · 0 archivedAll declared instances should appear with status live and a recent last_heartbeat. If any show stale or are missing, see Worker not registering.
Observing the fleet
spl fleet status # snapshot of live/stale/archived counts
spl fleet status --live # continuously refreshing view
spl fleet show <did> # detailed view of one instance
spl fleet ping <did> # reachability + latency check via HTTP /healthspl fleet status aggregates everything an operator usually wants to see at a glance:
Fleet status
Coordinator URL: http://127.0.0.1:9100
Heartbeat: every 5s
Instances: 3 live · 0 stale · 0 archived
Active freezes: (none)
Emergency stop: inactivespl fleet show <did> drills into one instance's details — endpoint, version, uptime, current dispatch count, store size, and last heartbeat:
Instance: did:sync:instance:923a4e6d10646e27
Endpoint: http://127.0.0.1:9201
Status: live
Version: 0.X.Y
Uptime: 5s
Health: healthy
Active dispatches: 0
Store records: 7
Last heartbeat: 1776119018 (unix)spl fleet ping <did> measures reachability + latency via the worker's /health endpoint:
✓ did:sync:instance:923a4e6d10646e27 reachable in 0ms (200 OK)For deep inspection of a specific instance, SSH into its host and run spl doctor + spl status locally against that instance's port. The fleet-level commands aggregate over HTTP; the per-instance commands hit the local socket.
Coordinator replacement (manual failover)
The coordinator is a single writer for the registry thread. If it dies, workers continue operating locally but cannot register or coordinate fan-out until a new coordinator is nominated. Automatic failover is not yet supported.
To replace a dead coordinator:
- Pick a surviving worker to promote. Any worker can play the coordinator role — they differ only in config.
- Stop the chosen worker cleanly:
spl fleet stop --instance worker-a(or send SIGTERM directly ifspl fleetis unreachable). - Edit its config at
~/.syncro-worker-a/config.toml: remove the[fleet] coordinator_urlline (so it no longer reports to an external coordinator), or leave it pointing at its own endpoint. - Restart:
SYNCROPEL_HOME=~/.syncro-worker-a spl serve --port 9201. The promoted instance now holds the registry. - Update remaining workers to point their
[fleet] coordinator_urlat the new coordinator's endpoint. Restart each. - Verify:
spl fleet listagainst the new coordinator should show all workers live.
Programmatic Agents — agents that write code
Integrate Syncropel with agents that accomplish tasks by writing and executing code. One context round-trip for N operations. Works with any code-execution-capable LLM, edge-runtime agents, and any sandboxed code-generation harness.
spl doctor
Top-down diagnostic that audits an instance's filesystem state, PID files, ports, config, and permissions. Run this first when something feels wrong.