Operator Runbook

Day-2 operations for running Syncropel in production — instance lifecycle, recovery from corruption, backup discipline, in-place upgrades, and how to recognize the failure modes you're about to hit.

Audience

This page is for the person on the other end of a spl serve that's actually serving real work. It's not for first-time installation (see Quickstart) or feature exploration (see Guides). It assumes you have an instance running and you need to keep it healthy, recover when it isn't, and upgrade it without losing data.

If you're reading this because something just broke, jump to Recovery.

Instance lifecycle

Starting the instance

spl serve

This forks a background process, writes a PID file to ~/.syncro/run/spl.pid, opens the SQLite store at ~/.syncro/hub.db, takes a startup backup (see Backup discipline), binds 127.0.0.1:9100, and binds a Unix socket at ~/.syncro/run/spl.sock.

Verify it came up:

spl status
curl -fsS http://localhost:9100/health

Both should report ok with the current version and record count.

Stopping the instance

spl stop

This reads ~/.syncro/run/spl.pid, sends SIGTERM to the instance process, and waits for graceful shutdown. The instance flushes the SQLite WAL, closes the socket, removes the PID file, and exits.

When `spl stop` says "not running" but the instance clearly is

This happens after long-running instances lose their PID file (host reboot, WSL restart, or the file was clobbered by a runaway test). The instance is still bound to port 9100 but --stop can't find it.

Recovery:

# Find the actual process
pgrep -af "spl serve"

# Send SIGTERM directly. Use SIGKILL only as last resort.
kill <pid>

# Verify the port is free
ss -tlnp 2>/dev/null | grep 9100

Then restart cleanly.,

Reading instance logs

Logs live at ~/.syncro/logs/spl.log (JSON-line format). Tail them:

tail -f ~/.syncro/logs/spl.log

The structured fields you'll care about most:

Field	Meaning
`level`	`ERROR`, `WARN`, `INFO`, `DEBUG`, `TRACE`
`target`	The Rust module emitting the log (filter on `syncropel_engine::reconciler` to see routing decisions)
`actor`	The DID involved in the operation
`thread`	The thread ID being touched
`record`	Record ID for ingest events

To filter for permission denials specifically:

grep "PERMISSION DENIED" ~/.syncro/logs/spl.log

Backup discipline

CRITICAL — read this section twice. Syncropel's backup mechanism is a safety net, not a backup system. It will fail to save you if you don't supplement it with off-host copies. The recovery drill at tests/drills/recovery.sh exposed exactly this failure mode.

What the instance does for you

On every startup, spl serve checks if ~/.syncro/hub.db exists. If it does, the instance copies it to ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak. The backup directory is outside ~/.syncro/ so rm -rf ~/.syncro/ doesn't kill it.

The instance key is one of:

instance-<did-tail> if a content-addressed instance DID is bootstrapped.
home-<short-hash> if SYNCROPEL_HOME is set but no DID exists yet.
The default single path ~/.local/share/syncropel/backups/hub.db.bak for the default instance.

What the instance does NOT do — and the trap to avoid

The startup backup is destructive. Every spl serve invocation overwrites the backup file with a snapshot of the current hub.db, even if the current hub.db is empty, corrupt, or wrong.

Concretely: if you delete hub.db and restart the instance, the instance comes up with an empty database, then on the next restart writes that empty database OVER your good backup. By the time you realize what happened, the backup is gone.

The recovery drill (bash tests/drills/recovery.sh) demonstrates this: it captures the backup file off-host immediately after creation, then deliberately wipes the volume and restores from the off-host copy. You should do the same in production.

What you should do instead

Schedule a periodic off-host backup of BOTH ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak (records, trust, config) AND ~/.syncro-data/ (task content files and alias map). The instance auto-backs up only the former; the latter is referenced by path during dispatch but not auto-snapshotted.

# Daily snapshot to a directory you trust
DEST=$HOME/backups/syncropel
DATE=$(date +%Y%m%d-%H%M%S)
mkdir -p "$DEST"

# Records + trust + engine config
cp ~/.local/share/syncropel/backups/*/hub.db.bak \
   "$DEST/hub.db.$DATE.bak"

# Task content files + alias map (rich-text bodies referenced by spl task dispatch)
tar -czf "$DEST/syncro-data.$DATE.tar.gz" -C "$HOME" .syncro-data

# Keep last 14 days of each
find "$DEST" -name 'hub.db.*.bak' -mtime +14 -delete
find "$DEST" -name 'syncro-data.*.tar.gz' -mtime +14 -delete

Or, if you're running in Docker, mount a host directory into the backup path so the rolling backup lives outside the container's ephemeral filesystem:

docker run -d \
  -p 9100:9100 \
  -v spl-home:/syncropel \
  -v $HOME/backups/syncropel:/home/syncropel/.local/share/syncropel/backups \
  syncropic/spl:dev

For added safety, a daily cron or systemd timer snapshotting that host directory to remote storage (S3, B2, SFTP, etc.) is appropriate for any instance carrying real data.

What does NOT need to be backed up separately

Trust scores: derived from KNOW/DO records on startup. Will rebuild from the record log.
Engine config: derived from LEARN records on th_engine_config. Will rebuild from the record log.
Routing rules, fold rules, health checks, AITL rules, permission rules: all stored as LEARN records. Will rebuild.

The record log IS the truth. Everything else is a fold over it. As long as hub.db is intact, all derived state recovers automatically.

What needs separate attention

~/.syncro/secrets/: API keys for any LLM provider you've configured (Anthropic, OpenAI, Google, etc.). Not in the record log. If you lose this, you lose your provider keys — back them up to a password manager or secure vault.
~/.syncro-data/: task content files (the rich-Markdown bodies for each task — addressed by their alias, e.g., MY-001.md) and the alias-to-thread mapping. The metadata (titles, statuses, hashes) lives in records on hub.db; the canonical content files live here on disk and are referenced by path during spl task dispatch. Backing up hub.db without ~/.syncro-data/ leaves task content stranded. Both belong in your snapshot.

Recovery from corruption or data loss

Symptom: instance won't start, panics on first ingest

Likely cause: SQLite store is corrupt. Read the panic message in ~/.syncro/logs/spl.log for the exact error.

Recovery procedure:

# 1. Stop the instance (or kill any lingering process)
spl stop || pkill -f "spl serve"

# 2. Move the corrupt store aside (don't delete — you may want it for forensics)
mv ~/.syncro/hub.db ~/.syncro/hub.db.corrupt.$(date +%s)
mv ~/.syncro/hub.db-wal ~/.syncro/hub.db-wal.corrupt.$(date +%s) 2>/dev/null || true
mv ~/.syncro/hub.db-shm ~/.syncro/hub.db-shm.corrupt.$(date +%s) 2>/dev/null || true

# 3. Restore from your OFF-HOST backup (NOT the instance's auto-backup,
#    which may have been overwritten — see "Backup discipline" above)
cp $HOME/backups/syncropel/hub.db.20260412-123000.bak ~/.syncro/hub.db

# 4. Restart and verify
spl serve
spl status

If you don't have an off-host backup, the instance's auto-backup MAY still be intact:

ls -la ~/.local/share/syncropel/backups/
# pick the most recent hub.db.bak with a non-trivial size
cp ~/.local/share/syncropel/backups/<instance-key>/hub.db.bak ~/.syncro/hub.db

But only do this if the instance hasn't already started and overwritten the backup. If the corrupt instance already ran, the auto-backup is now a snapshot of the corrupt state.

Symptom: hub.db deleted by accident

Same procedure as above, minus step 2. Restore from off-host backup, restart.

Symptom: trust scores wrong, routing rules missing

These are derived state. They were correct the last time the instance ran cleanly. To force a rebuild:

spl stop
spl serve  # rebuild_from_store() runs on startup
spl trust           # verify rebuild
spl config list-rules

If they're still wrong after a restart, the underlying records are wrong (or missing). Check the record log directly:

spl thread records th_engine_config | head -50  # config history
spl thread records <thread-id-of-concern>       # specific thread

Symptom: permission enforcement locked you out

You enabled spl config permissions-enable without an admin allow rule, and now spl config permissions-disable returns 403. This is prevented by a pre-flight check, but if you're on an older version or you disabled an admin rule by mistake:

spl stop                  # cached config in memory will shadow our write
spl config permissions-unlock     # writes permissions_enabled=false direct to store
spl serve                # restart — enforcement is off

See CEL Expressions → Permission Enforcement for the full lockout trap explanation.

In-place upgrades

The supported upgrade path

# 1. Snapshot first. Always.
cp ~/.syncro/hub.db $HOME/backups/syncropel/hub.db.pre-upgrade.$(date +%s).bak

# 2. Stop the instance
spl stop

# 3. Install the new binary (atomic mv via the install script)
curl -sSf https://get.syncropic.com/spl | sh

# 4. Verify the new binary version
spl version

# 5. Restart
spl serve

# 6. Verify health + record count match pre-upgrade
spl status

The Syncropel engine's SQLite schema has been backward-compatible across 0.8.x → 0.9.x. The startup path runs CREATE TABLE IF NOT EXISTS for any new tables; existing data is untouched. If a future release requires a non-backward-compatible migration, the release notes will say so explicitly and the instance will refuse to start until you run the migration.

The upgrade drill (bash tests/drills/upgrade.sh) verifies the data-preservation property end-to-end on every release: 5 user records + 1 routing rule + 1 permission rule are all confirmed intact across instance stop → container swap → instance start on the same volume. If you're maintaining a fork or shipping a custom build, run this drill before you cut a release.

What can go wrong, and how to recognize it

Binary download fails partway: the install script downloads to a temp file and atomically mvs into place. A failed download leaves the old binary intact. You'll see the old version on spl version.
Instance won't start on the new binary: a panic or graceful refusal. Read the log. Roll back by reinstalling the previous version (curl -sSf https://get.syncropic.com/spl?v=0.9.1 | sh) and restarting.
Records present but trust scores empty: the new instance rebuilt trust from records on startup. Wait a moment for rebuild_from_store() to complete; check spl trust again.

Common issues and fixes

"Address already in use" on `spl serve`

Another instance is already bound to port 9100. Either you have a stale spl serve running (use pgrep -af "spl serve" then kill), or another process took the port. Find it with ss -tlnp | grep 9100.

`~/.syncro/run/spl.pid` exists but no process

Stale PID file. Delete it and start clean:

rm -f ~/.syncro/run/spl.pid
spl serve

`spl task done` returns "uncommitted changes"

The task completion gate refuses to mark a task complete when the working tree has uncommitted changes — it's protecting against attributing wrong commits to the task. Either commit your work first, or use --force if you're sure (e.g. for triage of pre-existing tasks).

Health check returning non-200 even though the instance is up

The HEALTHCHECK in the docker image hits /health. If you're behind a reverse proxy, make sure the proxy isn't injecting auth that breaks the request. The /health endpoint is exempt from permission enforcement specifically so liveness probes work.

High memory growth over multi-day uptime

The 4-loop engine holds working state in memory for active threads. If you're seeing unbounded growth, capture a snapshot via spl status -o json and file it. This hasn't been observed in practice, but it's a class of bug to watch for.

Dispatch under memory pressure: read the warn log

When you dispatch a worker — spl task dispatch <SKL>, spl run <goal>, or any other path that spawns a subprocess — the instance emits one of three log lines at spawn:

DEBUG dispatch pre-flight: host memory healthy avail_gb=8.1
INFO  dispatch pre-flight: host approaching memory pressure avail_gb=3.2
WARN  dispatch pre-flight: host under memory pressure — OOM-kill risk; consider sequential dispatch avail_gb=1.4

The lines are visible in ~/.syncro/logs/spl.log (filter for dispatch pre-flight) and in journalctl if you're running under systemd. They're driven by MemAvailable from /proc/meminfo, classified into three bands:

Band	Threshold	Meaning
Healthy	`≥ 4 GiB` available	Plenty of headroom. Parallel dispatch is fine.
Approaching	`2-4 GiB` available	Workers may run, but you're trending toward pressure. Consider whether a second parallel worker is worth it.
Critical	`< 2 GiB` available	OOM-kill risk. Drop to sequential dispatch (`max_concurrent = 1` on the adapter) before you spawn more workers.

The Critical line is the actionable one. When you see it:

Don't start more parallel workers. The currently-spawning worker may already be at risk.
Free memory — close idle browser tabs, idle agent sessions, anything not load-bearing. Each idle agent process commonly consumes 200-500 MB resident.
Verify free memory after cleanup — free -h and check available.
If a worker was killed during the pressure window, follow the salvage procedure below.

Linux-only — the pre-flight check no-ops gracefully on macOS and Windows (no /proc/meminfo). On those platforms, watch the OS process monitor instead.

Dispatched worker died unexpectedly: salvage procedure

If spl task show <task-id> reports Status: failed with Prior attempts: 1 failed ($0.00 total) shortly after a parallel dispatch, the worker subprocess almost certainly hit a SIGKILL — most often from the kernel OOM-killer under host memory pressure. The good news: any work the worker had finished writing to disk before the kill is still in the worktree, even though commits ahead: 0. You can usually salvage it without re-dispatching.

Step 1 — Diagnose the failure mode

spl task diagnose <task-id> | tail -30

Look for the Subprocess and Completion blocks. The OOM-kill signature is:

Subprocess
  pid:        <pid>
  runtime:    <ms>            ← typically <10min
  exit_code:  None
  signal:     Some(9)         ← SIGKILL

Completion
  codepath:        stream_eof_fallback   # see body-kind reference
  failure_reason:  result_missing
  success:         false

If signal: Some(15) (SIGTERM) instead of Some(9), you're in a different failure class — process-level shutdown rather than kernel OOM. The salvage steps below still apply, but the prevention guidance at the bottom of this section won't help; investigate budget exhaustion, agent-CLI signal handling, or container/cgroup limits instead.

Step 2 — Inspect the worktree

The diagnose output reports worktree: and latest commit:. Switch in and check status:

cd /path/to/<your-project>-<task-id>
git status -s

Modified or untracked files = real work the worker produced before SIGKILL. Do not delete the worktree.

Step 3 — Read the diff + decide salvage path

git diff
ls -la <new-directories-the-worker-created>

Does the work look substantively complete (worker had finished, just hadn't committed)? Or partial (mid-flight)?

Substantively complete — proceed to Step 4.
Partial — read the task brief at ~/.syncro-data/tasks/<task-id>.md, identify the missing acceptance-criteria items, complete them in the worktree, then proceed.

Step 4 — Run the worker end-of-task gate

From the worktree, run whatever local-CI gate your project uses. For a Rust project:

cargo fmt --all -- --check          # if it complains, run `cargo fmt --all` to fix
cargo check
cargo clippy -- -D warnings
cargo nextest run                    # or `cargo test` if you don't have nextest

Substitute the equivalent for your stack (npm test, pytest, go test ./..., etc.). All gates must pass — the salvage is only as trustworthy as the gate.

Step 5 — Commit on the task branch

The worktree is already on a task branch. Commit there:

git add <specific-files>
git commit -m "feat(scope): <change> (<task-id>)

<commit body>

Salvaged in-session after parallel-dispatch SIGKILL on the original
worker (host OOM, signal=9 at ~Xmin). Worker had completed <work
inventory>; only <missing-item> needed adding. Worker gate green."

Step 6 — Merge to main + run merge-time gate

cd /path/to/<your-project>   # main worktree
git merge task/<task-id> --ff-only       # or `git cherry-pick <hash>` if main moved
# Re-run your full-workspace gates
git push origin main

Step 7 — Reopen + done + approve

The original worker's dispatch_complete left the task in failed status. Reopen and walk it through the lifecycle:

spl task reopen <task-id> --reason "salvaged in-session after worker SIGKILL"
SPL_ACTOR=did:sync:agent:dev      spl task done    <task-id> --summary "..." --domain code
SPL_ACTOR=did:sync:agent:reviewer spl task approve <task-id> --domain code --notes "..."

Reference the salvage commit hash + workspace gate result in the reviewer's --notes so trust evidence accumulates correctly.

Step 8 — Clean up the worktree

cd /path/to/<your-project>
git worktree remove <your-project>-<task-id> --force

The task branch persists post-merge; that's fine, it's part of the git history.

Avoiding the OOM kill in the first place

Cap parallel dispatch by available memory. On a 16GB host already running a few claude sessions and spl serve, even max_concurrent = 2 may be too many. If MemAvailable < 4GB × N_workers, drop to max_concurrent = 1.
Close idle agent sessions before dispatching parallel. Each background agent-CLI session commonly consumes 200-500MB resident.
Watch dmesg after a SIGKILL incident — dmesg | grep -i 'killed process' confirms whether the OOM-killer was the culprit (vs. other SIGKILL sources like signal injection or container limits).
Set generous timeouts. OOM doesn't care about timeout, but a longer timeout gives the worker more chance to commit before any other failure mode fires.

Multi-instance fleet operations

This section covers multi-instance deployments. If you're running a single instance, skip it.

A "fleet" is two or more spl instances that coordinate via the instance registry. One instance is the coordinator (holds the registry thread, receives heartbeats, processes fan-out barriers). The others are workers (execute dispatched work, report heartbeats, POST completion records back to the coordinator). Any instance can play either role; the distinction is config, not code.

This section covers the operations you run day-2 on a fleet. For the first-time walkthrough, see the Parallel Dev Tutorial.

Booting a fleet

spl fleet start --workers 2

Starting local fleet: 1 coordinator + 2 workers

  [coordinator] spawned PID 414153 on :9100 (home=~/.syncro)
  [worker-a]  spawned PID 414154 on :9201 (home=~/.syncro-worker-a)
  [worker-b]  spawned PID 414161 on :9202 (home=~/.syncro-worker-b)

Waiting for fleet convergence (3 live)...
  ✓ fleet converged

Fleet ready. Inspect with:
  spl fleet list
  spl fleet status

This boots the coordinator on ~/.syncro port 9100 plus N workers on ~/.syncro-worker-{a,b,...} ports 9201, 9202, ... Each worker inherits the coordinator URL from the [fleet] config section and begins emitting heartbeats within ~15 seconds. The wrapper polls /v1/fleet/status until all expected instances are live or 30 seconds pass; if convergence times out, fall through to Worker not registering.

Verify the fleet is healthy:

spl fleet list

  DID                                      ENDPOINT                  STATUS   VERSION  UPTIME     HEALTH
  did:sync:instance:923a4e6d10646e27       http://127.0.0.1:9201     live     0.X.Y    0s         healthy
  did:sync:instance:dfe40ff4272ad900       http://127.0.0.1:9202     live     0.X.Y    0s         healthy
  did:sync:instance:e6f09984f9ccd634       http://127.0.0.1:9100     live     0.X.Y    0s         healthy

  3 live · 0 stale · 0 archived

All declared instances should appear with status live and a recent last_heartbeat. If any show stale or are missing, see Worker not registering.

Observing the fleet

spl fleet status         # snapshot of live/stale/archived counts
spl fleet status --live  # continuously refreshing view
spl fleet show <did>     # detailed view of one instance
spl fleet ping <did>     # reachability + latency check via HTTP /health

spl fleet status aggregates everything an operator usually wants to see at a glance:

Fleet status
  Coordinator URL:  http://127.0.0.1:9100
  Heartbeat:        every 5s
  Instances:        3 live · 0 stale · 0 archived

  Active freezes:   (none)

  Emergency stop:   inactive

spl fleet show <did> drills into one instance's details — endpoint, version, uptime, current dispatch count, store size, and last heartbeat:

Instance: did:sync:instance:923a4e6d10646e27
  Endpoint:         http://127.0.0.1:9201
  Status:           live
  Version:          0.X.Y
  Uptime:           5s
  Health:           healthy
  Active dispatches: 0
  Store records:    7
  Last heartbeat:   1776119018 (unix)

spl fleet ping <did> measures reachability + latency via the worker's /health endpoint:

  ✓ did:sync:instance:923a4e6d10646e27 reachable in 0ms (200 OK)

For deep inspection of a specific instance, SSH into its host and run spl doctor + spl status locally against that instance's port. The fleet-level commands aggregate over HTTP; the per-instance commands hit the local socket.

Coordinator replacement (manual failover)

The coordinator is a single writer for the registry thread. If it dies, workers continue operating locally but cannot register or coordinate fan-out until a new coordinator is nominated. Automatic failover is not yet supported.

To replace a dead coordinator:

Pick a surviving worker to promote. Any worker can play the coordinator role — they differ only in config.
Stop the chosen worker cleanly: spl fleet stop --instance worker-a (or send SIGTERM directly if spl fleet is unreachable).
Edit its config at ~/.syncro-worker-a/config.toml: remove the [fleet] coordinator_url line (so it no longer reports to an external coordinator), or leave it pointing at its own endpoint.
Restart: SYNCROPEL_HOME=~/.syncro-worker-a spl serve --port 9201. The promoted instance now holds the registry.
Update remaining workers to point their [fleet] coordinator_url at the new coordinator's endpoint. Restart each.
Verify: spl fleet list against the new coordinator should show all workers live.

On this page