Operator Runbook

Day-2 operations for running Syncropel in production — daemon lifecycle, recovery from corruption, backup discipline, in-place upgrades, and how to recognize the failure modes you're about to hit.

Audience

This page is for the person on the other end of a spl serve --daemon that's actually serving real work. It's not for first-time installation (see Quickstart) or feature exploration (see Guides). It assumes you have an instance running and you need to keep it healthy, recover when it isn't, and upgrade it without losing data.

If you're reading this because something just broke, jump to Recovery.

Daemon lifecycle

Starting the daemon

spl serve --daemon

This forks a background process, writes a PID file to ~/.syncro/run/spl.pid, opens the SQLite store at ~/.syncro/hub.db, takes a startup backup (see Backup discipline), binds 127.0.0.1:9100, and binds a Unix socket at ~/.syncro/run/spl.sock.

Verify it came up:

spl status
curl -fsS http://localhost:9100/health

Both should report ok with the current version and record count.

Stopping the daemon

spl serve --stop

This reads ~/.syncro/run/spl.pid, sends SIGTERM to the daemon process, and waits for graceful shutdown. The daemon flushes the SQLite WAL, closes the socket, removes the PID file, and exits.

When `spl serve --stop` says "not running" but the daemon clearly is

This happens after long-running daemons lose their PID file (host reboot, WSL restart, or the file was clobbered by a runaway test). The daemon is still bound to port 9100 but --stop can't find it.

Recovery:

# Find the actual process
pgrep -af "spl serve"

# Send SIGTERM directly. Use SIGKILL only as last resort.
kill <pid>

# Verify the port is free
ss -tlnp 2>/dev/null | grep 9100

Then restart cleanly.,

Reading daemon logs

Logs live at ~/.syncro/logs/spl.log (JSON-line format). Tail them:

tail -f ~/.syncro/logs/spl.log

The structured fields you'll care about most:

Field	Meaning
`level`	`ERROR`, `WARN`, `INFO`, `DEBUG`, `TRACE`
`target`	The Rust module emitting the log (filter on `syncropel_engine::reconciler` to see routing decisions)
`actor`	The DID involved in the operation
`thread`	The thread ID being touched
`record`	Record ID for ingest events

To filter for permission denials specifically:

grep "PERMISSION DENIED" ~/.syncro/logs/spl.log

Backup discipline

CRITICAL — read this section twice. Syncropel's backup mechanism is a safety net, not a backup system. It will fail to save you if you don't supplement it with off-host copies. The recovery drill at tests/drills/recovery.sh exposed exactly this failure mode.

What the daemon does for you

On every startup, spl serve checks if ~/.syncro/hub.db exists. If it does, the daemon copies it to ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak. The backup directory is outside ~/.syncro/ so rm -rf ~/.syncro/ doesn't kill it.

The instance key is one of:

instance-<did-tail> if a content-addressed instance DID is bootstrapped.
home-<short-hash> if SYNCROPEL_HOME is set but no DID exists yet.
The default single path ~/.local/share/syncropel/backups/hub.db.bak for the default instance.

What the daemon does NOT do — and the trap to avoid

The startup backup is destructive. Every spl serve --daemon invocation overwrites the backup file with a snapshot of the current hub.db, even if the current hub.db is empty, corrupt, or wrong.

Concretely: if you delete hub.db and restart the daemon, the daemon comes up with an empty database, then on the next restart writes that empty database OVER your good backup. By the time you realize what happened, the backup is gone.

The recovery drill (bash tests/drills/recovery.sh) demonstrates this: it captures the backup file off-host immediately after creation, then deliberately wipes the volume and restores from the off-host copy. You should do the same in production.

What you should do instead

Schedule a periodic off-host backup of ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak:

# Daily snapshot to a directory you trust
DEST=$HOME/backups/syncropel
mkdir -p "$DEST"
cp ~/.local/share/syncropel/backups/*/hub.db.bak \
   "$DEST/hub.db.$(date +%Y%m%d-%H%M%S).bak"

# Keep last 14 days
find "$DEST" -name 'hub.db.*.bak' -mtime +14 -delete

Or, if you're running in Docker, mount a host directory into the backup path so the rolling backup lives outside the container's ephemeral filesystem:

docker run -d \
  -p 9100:9100 \
  -v spl-home:/syncropel \
  -v $HOME/backups/syncropel:/home/syncropel/.local/share/syncropel/backups \
  syncropic/spl:dev

For added safety, a daily cron or systemd timer snapshotting that host directory to remote storage (S3, B2, SFTP, etc.) is appropriate for any instance carrying real data.

What does NOT need to be backed up separately

Trust scores: derived from KNOW/DO records on startup. Will rebuild from the record log.
Engine config: derived from LEARN records on th_engine_config. Will rebuild from the record log.
Routing rules, fold rules, health checks, AITL rules, permission rules: all stored as LEARN records. Will rebuild.

The record log IS the truth. Everything else is a fold over it. As long as hub.db is intact, all derived state recovers automatically.

What needs separate attention

~/.syncro/secrets/: API keys for any LLM provider you've configured (Anthropic, OpenAI, Google, etc.). Not in the record log. If you lose this, you lose your provider keys — back them up to a password manager or secure vault.
~/.syncro-data/ (optional): task content files and aliases used by earlier CLI versions. The current Rust kernel stores all this in records. Only matters if you've been running an older CLI alongside.

Recovery from corruption or data loss

Symptom: daemon won't start, panics on first ingest

Likely cause: SQLite store is corrupt. Read the panic message in ~/.syncro/logs/spl.log for the exact error.

Recovery procedure:

# 1. Stop the daemon (or kill any lingering process)
spl serve --stop || pkill -f "spl serve"

# 2. Move the corrupt store aside (don't delete — you may want it for forensics)
mv ~/.syncro/hub.db ~/.syncro/hub.db.corrupt.$(date +%s)
mv ~/.syncro/hub.db-wal ~/.syncro/hub.db-wal.corrupt.$(date +%s) 2>/dev/null || true
mv ~/.syncro/hub.db-shm ~/.syncro/hub.db-shm.corrupt.$(date +%s) 2>/dev/null || true

# 3. Restore from your OFF-HOST backup (NOT the daemon's auto-backup,
#    which may have been overwritten — see "Backup discipline" above)
cp $HOME/backups/syncropel/hub.db.20260412-123000.bak ~/.syncro/hub.db

# 4. Restart and verify
spl serve --daemon
spl status

If you don't have an off-host backup, the daemon's auto-backup MAY still be intact:

ls -la ~/.local/share/syncropel/backups/
# pick the most recent hub.db.bak with a non-trivial size
cp ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak ~/.syncro/hub.db

But only do this if the daemon hasn't already started and overwritten the backup. If the corrupt daemon already ran, the auto-backup is now a snapshot of the corrupt state.

Symptom: hub.db deleted by accident

Same procedure as above, minus step 2. Restore from off-host backup, restart.

Symptom: trust scores wrong, routing rules missing

These are derived state. They were correct the last time the daemon ran cleanly. To force a rebuild:

spl serve --stop
spl serve --daemon  # rebuild_from_store() runs on startup
spl trust           # verify rebuild
spl config list-rules

If they're still wrong after a restart, the underlying records are wrong (or missing). Check the record log directly:

spl thread records th_engine_config | head -50  # config history
spl thread records <thread-id-of-concern>       # specific thread

Symptom: permission enforcement locked you out

You enabled spl config permissions-enable without an admin allow rule, and now spl config permissions-disable returns 403. This is prevented by a pre-flight check, but if you're on an older version or you disabled an admin rule by mistake:

spl serve --stop                  # cached config in memory will shadow our write
spl config permissions-unlock     # writes permissions_enabled=false direct to store
spl serve --daemon                # restart — enforcement is off

See CEL Expressions → Permission Enforcement for the full lockout trap explanation.

In-place upgrades

The supported upgrade path

# 1. Snapshot first. Always.
cp ~/.syncro/hub.db $HOME/backups/syncropel/hub.db.pre-upgrade.$(date +%s).bak

# 2. Stop the daemon
spl serve --stop

# 3. Install the new binary (atomic mv via the install script)
curl -sSf https://get.syncropic.com/spl | sh

# 4. Verify the new binary version
spl version

# 5. Restart
spl serve --daemon

# 6. Verify health + record count match pre-upgrade
spl status

The Rust kernel's SQLite schema has been backward-compatible across 0.8.x → 0.9.x. The startup path runs CREATE TABLE IF NOT EXISTS for any new tables; existing data is untouched. If a future release requires a non-backward-compatible migration, the release notes will say so explicitly and the daemon will refuse to start until you run the migration.

The upgrade drill (bash tests/drills/upgrade.sh) verifies the data-preservation property end-to-end on every release: 5 user records + 1 routing rule + 1 permission rule are all confirmed intact across daemon stop → container swap → daemon start on the same volume. If you're maintaining a fork or shipping a custom build, run this drill before you cut a release.

What can go wrong, and how to recognize it

Binary download fails partway: the install script downloads to a temp file and atomically mvs into place. A failed download leaves the old binary intact. You'll see the old version on spl version.
Daemon won't start on the new binary: a panic or graceful refusal. Read the log. Roll back by reinstalling the previous version (curl -sSf https://get.syncropic.com/spl?v=0.9.1 | sh) and restarting.
Records present but trust scores empty: the new daemon rebuilt trust from records on startup. Wait a moment for rebuild_from_store() to complete; check spl trust again.

Common issues and fixes

"Address already in use" on `spl serve`

Another daemon is already bound to port 9100. Either you have a stale spl serve running (use pgrep -af "spl serve" then kill), or another process took the port. Find it with ss -tlnp | grep 9100.

`~/.syncro/run/spl.pid` exists but no process

Stale PID file. Delete it and start clean:

rm -f ~/.syncro/run/spl.pid
spl serve --daemon

`spl task done` returns "uncommitted changes"

The task completion gate refuses to mark a task complete when the working tree has uncommitted changes — it's protecting against attributing wrong commits to the task. Either commit your work first, or use --force if you're sure (e.g. for triage of pre-existing tasks).

Health check returning non-200 even though the daemon is up

The HEALTHCHECK in the docker image hits /health. If you're behind a reverse proxy, make sure the proxy isn't injecting auth that breaks the request. The /health endpoint is exempt from permission enforcement specifically so liveness probes work.

High memory growth over multi-day uptime

The 4-loop kernel holds working state in memory for active threads. If you're seeing unbounded growth, capture a snapshot via spl status -o json and file it. This hasn't been observed in practice, but it's a class of bug to watch for.

Multi-instance fleet operations

This section covers multi-instance deployments. If you're running a single daemon, skip it.

A "fleet" is two or more spl instances that coordinate via the instance registry. One instance is the coordinator (holds the registry thread, receives heartbeats, processes fan-out barriers). The others are workers (execute dispatched work, report heartbeats, POST completion records back to the coordinator). Any instance can play either role; the distinction is config, not code.

This section covers the operations you run day-2 on a fleet. For the first-time walkthrough, see the Parallel Dev Tutorial.

Booting a fleet

spl fleet start --workers 2

Starting local fleet: 1 coordinator + 2 workers

  [coordinator] spawned PID 414153 on :9100 (home=~/.syncro)
  [worker-a]  spawned PID 414154 on :9201 (home=~/.syncro-worker-a)
  [worker-b]  spawned PID 414161 on :9202 (home=~/.syncro-worker-b)

Waiting for fleet convergence (3 live)...
  ✓ fleet converged

Fleet ready. Inspect with:
  spl fleet list
  spl fleet status

This boots the coordinator on ~/.syncro port 9100 plus N workers on ~/.syncro-worker-{a,b,...} ports 9201, 9202, ... Each worker inherits the coordinator URL from the [fleet] config section and begins emitting heartbeats within ~15 seconds. The wrapper polls /v1/fleet/status until all expected instances are live or 30 seconds pass; if convergence times out, fall through to Worker not registering.

Verify the fleet is healthy:

spl fleet list

  DID                                      ENDPOINT                  STATUS   VERSION  UPTIME     HEALTH
  did:sync:instance:923a4e6d10646e27       http://127.0.0.1:9201     live     0.X.Y    0s         healthy
  did:sync:instance:dfe40ff4272ad900       http://127.0.0.1:9202     live     0.X.Y    0s         healthy
  did:sync:instance:e6f09984f9ccd634       http://127.0.0.1:9100     live     0.X.Y    0s         healthy

  3 live · 0 stale · 0 archived

All declared instances should appear with status live and a recent last_heartbeat. If any show stale or are missing, see Worker not registering.

Observing the fleet

spl fleet status         # snapshot of live/stale/archived counts
spl fleet status --live  # continuously refreshing view
spl fleet show <did>     # detailed view of one instance
spl fleet ping <did>     # reachability + latency check via HTTP /health

spl fleet status aggregates everything an operator usually wants to see at a glance:

Fleet status
  Coordinator URL:  http://127.0.0.1:9100
  Heartbeat:        every 5s
  Instances:        3 live · 0 stale · 0 archived

  Active freezes:   (none)

  Emergency stop:   inactive

spl fleet show <did> drills into one instance's details — endpoint, version, uptime, current dispatch count, store size, and last heartbeat:

Instance: did:sync:instance:923a4e6d10646e27
  Endpoint:         http://127.0.0.1:9201
  Status:           live
  Version:          0.X.Y
  Uptime:           5s
  Health:           healthy
  Active dispatches: 0
  Store records:    7
  Last heartbeat:   1776119018 (unix)

spl fleet ping <did> measures reachability + latency via the worker's /health endpoint:

  ✓ did:sync:instance:923a4e6d10646e27 reachable in 0ms (200 OK)

For deep inspection of a specific instance, SSH into its host and run spl doctor + spl status locally against that instance's port. The fleet-level commands aggregate over HTTP; the per-instance commands hit the local socket.

Coordinator replacement (manual failover)

The coordinator is a single writer for the registry thread. If it dies, workers continue operating locally but cannot register or coordinate fan-out until a new coordinator is nominated. Automatic failover is not yet supported.

To replace a dead coordinator:

Pick a surviving worker to promote. Any worker can play the coordinator role — they differ only in config.
Stop the chosen worker cleanly: spl fleet stop --instance worker-a (or send SIGTERM directly if spl fleet is unreachable).
Edit its config at ~/.syncro-worker-a/config.toml: remove the [fleet] coordinator_url line (so it no longer reports to an external coordinator), or leave it pointing at its own endpoint.
Restart: SYNCROPEL_HOME=~/.syncro-worker-a spl serve --daemon --port 9201. The promoted instance now holds the registry.
Update remaining workers to point their [fleet] coordinator_url at the new coordinator's endpoint. Restart each.
Verify: spl fleet list against the new coordinator should show all workers live.

On this page