SSyncropel Docs

Cookbook: recovering from a partial dispatch failure

A working operator's walkthrough of what to do when a dispatched task fails mid-flight — diagnose, decide salvage vs fresh, resume, merge.

The scenario

You dispatched a task and came back an hour later to find:

$ spl task show TASK-0042
Task: TASK-0042 — [your task here]
Status: failed
Cost: $0.1847
Worktree: /home/you/projects/myproject-TASK-0042
Sub-thread: th_8c51e9c8...

The dispatch consumed ~$0.18 of real API cost. Work almost certainly happened. The question now is — can you recover it without redoing everything?

Step 1 — what happened?

Start with spl task diagnose. Never with tail ~/.syncro/logs/spl.log — structured records tell you in one screen what raw logs tell you in hundreds.

spl task diagnose TASK-0042

You're looking for three things:

  1. The completion codepath. Which of the four exit paths fired? result_event / line_timeout / budget_deadline / stream_eof_fallback.
  2. The failure reason. A specific value (subprocess_exited_nonzero, budget_exceeded, …) or None if it actually succeeded.
  3. Whether the worktree still exists. spl task diagnose includes the path. Check it's still on disk.

Step 2 — the decision: salvage vs fresh

Not every failed dispatch is worth resuming. The question is whether the prior work has durable artifacts (commits in the worktree) that would be wasteful to throw away, or whether the subprocess died so early that resuming buys nothing.

Use this rubric:

SituationRecommendation
Completion codepath = result_event, failure_reason = result_reported_errorRead the result text → fix the task brief → fresh dispatch.
stream_eof_fallback + subprocess_exited_nonzero + no commits in worktreeFresh dispatch. Subprocess died before anything shippable.
stream_eof_fallback + subprocess_exited_nonzero + commits in worktreeSalvage via --resume.
line_timeout on a long research/writing taskSalvage via --resume — bump --budget / --timeout.
budget_exceeded on a task that was under-budgetedSalvage via --resume with a larger budget.
budget_exceeded on a task that spiraled (agent chased the wrong thing)Fresh dispatch with a narrower brief.

Step 3 — the salvage workflow (spl task dispatch --resume)

The important thing about --resume: it does not restart the agent from scratch. It preserves the worktree, preserves the branch, preserves all prior commits, and prepends a "prior commits" notice to the agent briefing so the agent knows what has already been done.

cd /home/you/projects/myproject
spl task dispatch --resume TASK-0042

The resume notice surfaces the commits via git log:

You are resuming a previously-interrupted dispatch. The following commits
already exist on the task branch — do NOT redo this work:

  abc1234 partial: wired up enum variants
  def5678 partial: added unit tests for new variants

Continue from where the prior session left off. Read the commit messages
and diffs to understand what was done before proceeding.

The agent reads the log, understands the prior work, and continues. When it finishes, it creates a regular final commit (not --amend), and you merge normally.

What if the worktree was manually modified?

If you touched the worktree after the failure (pulled main, made a fix, tested a theory), --resume still works — the agent sees whatever commits are actually there. The resume notice reflects the current git log, not the state at the time of failure.

What if you deleted the worktree?

You can't --resume without a worktree. Re-run without --resume, and a fresh worktree + branch will be created. You've lost the prior partial commits (but spl task diagnose still shows they existed — the records survive).

Step 4 — verify + merge

Once the resumed dispatch completes:

spl task show TASK-0042   # status should be 'review'
spl task diagnose TASK-0042   # final completion record should be success=true
cd ../myproject-TASK-0042
git log --oneline                # review all commits (prior + new)
cargo test                       # run gates

Then approve + merge as usual:

cd /home/you/projects/myproject
SPL_ACTOR=did:sync:agent:director spl task approve TASK-0042 --merge

The approval creates a KNOW verdict record (with judged_by != the executor, per the self-eval-prevention gate), then merges the worktree branch back to main.

Anti-pattern — don't reach for git first

Some operators' instinct is to cd into the worktree and manually inspect git log / git diff before running spl task diagnose. That's a trap.

The commits are only one of many artifacts — the records also tell you about the agent CLI's internal session, the tokens consumed, the tool_result records for each Bash / Read / Edit, and critically the completion codepath. You lose all of that if you jump straight to git.

The record log is the full narrative. git log is one thread of it.

When --resume isn't enough

If the resumed dispatch also fails (same codepath, same reason), something more fundamental is wrong. File a comment on the task with the spl task diagnose output from both runs, then consult the dispatch observability guide for the hypothesis table — each completion-codepath / failure-reason combination has a remediation pathway.

Recurrent stream_eof_fallback + subprocess_exited_nonzero with signal=9 is usually OS-level resource pressure (OOM from too many concurrent workers, too-large prompts). Recurrent line_timeout is usually a tool-call that takes longer than the configured per_line_timeout. Recurrent budget_exceeded is almost always an under-budgeted task.

None of those require code changes — they require tuning. Each maps to a record signature in the dispatch-observability stream that you can pattern-match for operational remediation.

On this page