Operator Runbook
Day-2 operations for running Syncropel in production — daemon lifecycle, recovery from corruption, backup discipline, in-place upgrades, and how to recognize the failure modes you're about to hit.
Audience
This page is for the person on the other end of a spl serve --daemon that's actually serving real work. It's not for first-time installation (see Quickstart) or feature exploration (see Guides). It assumes you have an instance running and you need to keep it healthy, recover when it isn't, and upgrade it without losing data.
If you're reading this because something just broke, jump to Recovery.
Daemon lifecycle
Starting the daemon
spl serve --daemonThis forks a background process, writes a PID file to ~/.syncro/run/spl.pid, opens the SQLite store at ~/.syncro/hub.db, takes a startup backup (see Backup discipline), binds 127.0.0.1:9100, and binds a Unix socket at ~/.syncro/run/spl.sock.
Verify it came up:
spl status
curl -fsS http://localhost:9100/healthBoth should report ok with the current version and record count.
Stopping the daemon
spl serve --stopThis reads ~/.syncro/run/spl.pid, sends SIGTERM to the daemon process, and waits for graceful shutdown. The daemon flushes the SQLite WAL, closes the socket, removes the PID file, and exits.
When spl serve --stop says "not running" but the daemon clearly is
This happens after long-running daemons lose their PID file (host reboot, WSL restart, or the file was clobbered by a runaway test). The daemon is still bound to port 9100 but --stop can't find it.
Recovery:
# Find the actual process
pgrep -af "spl serve"
# Send SIGTERM directly. Use SIGKILL only as last resort.
kill <pid>
# Verify the port is free
ss -tlnp 2>/dev/null | grep 9100Then restart cleanly.,
Reading daemon logs
Logs live at ~/.syncro/logs/spl.log (JSON-line format). Tail them:
tail -f ~/.syncro/logs/spl.logThe structured fields you'll care about most:
| Field | Meaning |
|---|---|
level | ERROR, WARN, INFO, DEBUG, TRACE |
target | The Rust module emitting the log (filter on syncropel_engine::reconciler to see routing decisions) |
actor | The DID involved in the operation |
thread | The thread ID being touched |
record | Record ID for ingest events |
To filter for permission denials specifically:
grep "PERMISSION DENIED" ~/.syncro/logs/spl.logBackup discipline
CRITICAL — read this section twice. Syncropel's backup mechanism is a safety net, not a backup system. It will fail to save you if you don't supplement it with off-host copies. The recovery drill at
tests/drills/recovery.shexposed exactly this failure mode.
What the daemon does for you
On every startup, spl serve checks if ~/.syncro/hub.db exists. If it does, the daemon copies it to ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak. The backup directory is outside ~/.syncro/ so rm -rf ~/.syncro/ doesn't kill it.
The instance key is one of:
instance-<did-tail>if a content-addressed instance DID is bootstrapped.home-<short-hash>ifSYNCROPEL_HOMEis set but no DID exists yet.- The default single path
~/.local/share/syncropel/backups/hub.db.bakfor the default instance.
What the daemon does NOT do — and the trap to avoid
The startup backup is destructive. Every spl serve --daemon invocation overwrites the backup file with a snapshot of the current hub.db, even if the current hub.db is empty, corrupt, or wrong.
Concretely: if you delete hub.db and restart the daemon, the daemon comes up with an empty database, then on the next restart writes that empty database OVER your good backup. By the time you realize what happened, the backup is gone.
The recovery drill (bash tests/drills/recovery.sh) demonstrates this: it captures the backup file off-host immediately after creation, then deliberately wipes the volume and restores from the off-host copy. You should do the same in production.
What you should do instead
Schedule a periodic off-host backup of ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak:
# Daily snapshot to a directory you trust
DEST=$HOME/backups/syncropel
mkdir -p "$DEST"
cp ~/.local/share/syncropel/backups/*/hub.db.bak \
"$DEST/hub.db.$(date +%Y%m%d-%H%M%S).bak"
# Keep last 14 days
find "$DEST" -name 'hub.db.*.bak' -mtime +14 -deleteOr, if you're running in Docker, mount a host directory into the backup path so the rolling backup lives outside the container's ephemeral filesystem:
docker run -d \
-p 9100:9100 \
-v spl-home:/syncropel \
-v $HOME/backups/syncropel:/home/syncropel/.local/share/syncropel/backups \
syncropic/spl:devFor added safety, a daily cron or systemd timer snapshotting that host directory to remote storage (S3, B2, SFTP, etc.) is appropriate for any instance carrying real data.
What does NOT need to be backed up separately
- Trust scores: derived from KNOW/DO records on startup. Will rebuild from the record log.
- Engine config: derived from LEARN records on
th_engine_config. Will rebuild from the record log. - Routing rules, fold rules, health checks, AITL rules, permission rules: all stored as LEARN records. Will rebuild.
The record log IS the truth. Everything else is a fold over it. As long as hub.db is intact, all derived state recovers automatically.
What needs separate attention
~/.syncro/secrets/: API keys for any LLM provider you've configured (Anthropic, OpenAI, Google, etc.). Not in the record log. If you lose this, you lose your provider keys — back them up to a password manager or secure vault.~/.syncro-data/(optional): task content files and aliases used by earlier CLI versions. The current Rust kernel stores all this in records. Only matters if you've been running an older CLI alongside.
Recovery from corruption or data loss
Symptom: daemon won't start, panics on first ingest
Likely cause: SQLite store is corrupt. Read the panic message in ~/.syncro/logs/spl.log for the exact error.
Recovery procedure:
# 1. Stop the daemon (or kill any lingering process)
spl serve --stop || pkill -f "spl serve"
# 2. Move the corrupt store aside (don't delete — you may want it for forensics)
mv ~/.syncro/hub.db ~/.syncro/hub.db.corrupt.$(date +%s)
mv ~/.syncro/hub.db-wal ~/.syncro/hub.db-wal.corrupt.$(date +%s) 2>/dev/null || true
mv ~/.syncro/hub.db-shm ~/.syncro/hub.db-shm.corrupt.$(date +%s) 2>/dev/null || true
# 3. Restore from your OFF-HOST backup (NOT the daemon's auto-backup,
# which may have been overwritten — see "Backup discipline" above)
cp $HOME/backups/syncropel/hub.db.20260412-123000.bak ~/.syncro/hub.db
# 4. Restart and verify
spl serve --daemon
spl statusIf you don't have an off-host backup, the daemon's auto-backup MAY still be intact:
ls -la ~/.local/share/syncropel/backups/
# pick the most recent hub.db.bak with a non-trivial size
cp ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak ~/.syncro/hub.dbBut only do this if the daemon hasn't already started and overwritten the backup. If the corrupt daemon already ran, the auto-backup is now a snapshot of the corrupt state.
Symptom: hub.db deleted by accident
Same procedure as above, minus step 2. Restore from off-host backup, restart.
Symptom: trust scores wrong, routing rules missing
These are derived state. They were correct the last time the daemon ran cleanly. To force a rebuild:
spl serve --stop
spl serve --daemon # rebuild_from_store() runs on startup
spl trust # verify rebuild
spl config list-rulesIf they're still wrong after a restart, the underlying records are wrong (or missing). Check the record log directly:
spl thread records th_engine_config | head -50 # config history
spl thread records <thread-id-of-concern> # specific threadSymptom: permission enforcement locked you out
You enabled spl config permissions-enable without an admin allow rule, and now spl config permissions-disable returns 403. This is prevented by a pre-flight check, but if you're on an older version or you disabled an admin rule by mistake:
spl serve --stop # cached config in memory will shadow our write
spl config permissions-unlock # writes permissions_enabled=false direct to store
spl serve --daemon # restart — enforcement is offSee CEL Expressions → Permission Enforcement for the full lockout trap explanation.
In-place upgrades
The supported upgrade path
# 1. Snapshot first. Always.
cp ~/.syncro/hub.db $HOME/backups/syncropel/hub.db.pre-upgrade.$(date +%s).bak
# 2. Stop the daemon
spl serve --stop
# 3. Install the new binary (atomic mv via the install script)
curl -sSf https://get.syncropic.com/spl | sh
# 4. Verify the new binary version
spl version
# 5. Restart
spl serve --daemon
# 6. Verify health + record count match pre-upgrade
spl statusThe Rust kernel's SQLite schema has been backward-compatible across 0.8.x → 0.9.x. The startup path runs CREATE TABLE IF NOT EXISTS for any new tables; existing data is untouched. If a future release requires a non-backward-compatible migration, the release notes will say so explicitly and the daemon will refuse to start until you run the migration.
The upgrade drill (bash tests/drills/upgrade.sh) verifies the data-preservation property end-to-end on every release: 5 user records + 1 routing rule + 1 permission rule are all confirmed intact across daemon stop → container swap → daemon start on the same volume. If you're maintaining a fork or shipping a custom build, run this drill before you cut a release.
What can go wrong, and how to recognize it
- Binary download fails partway: the install script downloads to a temp file and atomically
mvs into place. A failed download leaves the old binary intact. You'll see the old version onspl version. - Daemon won't start on the new binary: a panic or graceful refusal. Read the log. Roll back by reinstalling the previous version (
curl -sSf https://get.syncropic.com/spl?v=0.9.1 | sh) and restarting. - Records present but trust scores empty: the new daemon rebuilt trust from records on startup. Wait a moment for
rebuild_from_store()to complete; checkspl trustagain.
Common issues and fixes
"Address already in use" on spl serve
Another daemon is already bound to port 9100. Either you have a stale spl serve running (use pgrep -af "spl serve" then kill), or another process took the port. Find it with ss -tlnp | grep 9100.
~/.syncro/run/spl.pid exists but no process
Stale PID file. Delete it and start clean:
rm -f ~/.syncro/run/spl.pid
spl serve --daemonspl task done returns "uncommitted changes"
The task completion gate refuses to mark a task complete when the working tree has uncommitted changes — it's protecting against attributing wrong commits to the task. Either commit your work first, or use --force if you're sure (e.g. for triage of pre-existing tasks).
Health check returning non-200 even though the daemon is up
The HEALTHCHECK in the docker image hits /health. If you're behind a reverse proxy, make sure the proxy isn't injecting auth that breaks the request. The /health endpoint is exempt from permission enforcement specifically so liveness probes work.
High memory growth over multi-day uptime
The 4-loop kernel holds working state in memory for active threads. If you're seeing unbounded growth, capture a snapshot via spl status -o json and file it. This hasn't been observed in practice, but it's a class of bug to watch for.
Multi-instance fleet operations
This section covers multi-instance deployments. If you're running a single daemon, skip it.
A "fleet" is two or more spl instances that coordinate via the instance registry. One instance is the coordinator (holds the registry thread, receives heartbeats, processes fan-out barriers). The others are workers (execute dispatched work, report heartbeats, POST completion records back to the coordinator). Any instance can play either role; the distinction is config, not code.
This section covers the operations you run day-2 on a fleet. For the first-time walkthrough, see the Parallel Dev Tutorial.
Booting a fleet
spl fleet start --workers 2Starting local fleet: 1 coordinator + 2 workers
[coordinator] spawned PID 414153 on :9100 (home=~/.syncro)
[worker-a] spawned PID 414154 on :9201 (home=~/.syncro-worker-a)
[worker-b] spawned PID 414161 on :9202 (home=~/.syncro-worker-b)
Waiting for fleet convergence (3 live)...
✓ fleet converged
Fleet ready. Inspect with:
spl fleet list
spl fleet statusThis boots the coordinator on ~/.syncro port 9100 plus N workers on ~/.syncro-worker-{a,b,...} ports 9201, 9202, ... Each worker inherits the coordinator URL from the [fleet] config section and begins emitting heartbeats within ~15 seconds. The wrapper polls /v1/fleet/status until all expected instances are live or 30 seconds pass; if convergence times out, fall through to Worker not registering.
Verify the fleet is healthy:
spl fleet list DID ENDPOINT STATUS VERSION UPTIME HEALTH
did:sync:instance:923a4e6d10646e27 http://127.0.0.1:9201 live 0.X.Y 0s healthy
did:sync:instance:dfe40ff4272ad900 http://127.0.0.1:9202 live 0.X.Y 0s healthy
did:sync:instance:e6f09984f9ccd634 http://127.0.0.1:9100 live 0.X.Y 0s healthy
3 live · 0 stale · 0 archivedAll declared instances should appear with status live and a recent last_heartbeat. If any show stale or are missing, see Worker not registering.
Observing the fleet
spl fleet status # snapshot of live/stale/archived counts
spl fleet status --live # continuously refreshing view
spl fleet show <did> # detailed view of one instance
spl fleet ping <did> # reachability + latency check via HTTP /healthspl fleet status aggregates everything an operator usually wants to see at a glance:
Fleet status
Coordinator URL: http://127.0.0.1:9100
Heartbeat: every 5s
Instances: 3 live · 0 stale · 0 archived
Active freezes: (none)
Emergency stop: inactivespl fleet show <did> drills into one instance's details — endpoint, version, uptime, current dispatch count, store size, and last heartbeat:
Instance: did:sync:instance:923a4e6d10646e27
Endpoint: http://127.0.0.1:9201
Status: live
Version: 0.X.Y
Uptime: 5s
Health: healthy
Active dispatches: 0
Store records: 7
Last heartbeat: 1776119018 (unix)spl fleet ping <did> measures reachability + latency via the worker's /health endpoint:
✓ did:sync:instance:923a4e6d10646e27 reachable in 0ms (200 OK)For deep inspection of a specific instance, SSH into its host and run spl doctor + spl status locally against that instance's port. The fleet-level commands aggregate over HTTP; the per-instance commands hit the local socket.
Coordinator replacement (manual failover)
The coordinator is a single writer for the registry thread. If it dies, workers continue operating locally but cannot register or coordinate fan-out until a new coordinator is nominated. Automatic failover is not yet supported.
To replace a dead coordinator:
- Pick a surviving worker to promote. Any worker can play the coordinator role — they differ only in config.
- Stop the chosen worker cleanly:
spl fleet stop --instance worker-a(or send SIGTERM directly ifspl fleetis unreachable). - Edit its config at
~/.syncro-worker-a/config.toml: remove the[fleet] coordinator_urlline (so it no longer reports to an external coordinator), or leave it pointing at its own endpoint. - Restart:
SYNCROPEL_HOME=~/.syncro-worker-a spl serve --daemon --port 9201. The promoted instance now holds the registry. - Update remaining workers to point their
[fleet] coordinator_urlat the new coordinator's endpoint. Restart each. - Verify:
spl fleet listagainst the new coordinator should show all workers live.
Programmatic Agents — agents that write code
Integrate Syncropel with agents that accomplish tasks by writing and executing code. One context round-trip for N operations. Works with any code-execution-capable LLM, Workers agents, and any sandboxed code-generation harness.
Keeping Your Instance Running
Make `spl serve` start automatically at login, restart after crashes, and survive reboots — systemd user unit on Linux, launchd plist on macOS, Windows Service on Windows.