Inference Overview

The mental model behind infer.query.v1 — query the substrate like you would a model, but over a heterogeneous pool of LLMs, patterns, systems, and humans with pluggable fold and soft-ranking relevance.

Overview

Syncropel introduces inference as a first-class record. Any record with body.kind: "infer.query.v1" becomes a query — the engine routes it to a pool of responders, collects their answers, folds them into a single result, and commits the result as a KNOW record on the query's thread.

You already know the substrate as a coordination medium. Records flow in, the engine folds threads, trust accumulates, tasks resolve. Inference re-uses that substrate. Nothing new gets added — queries are just INTEND records with a specific shape, responders are actors or system adapters you already register, fold is declarative, and the answer is just another KNOW record that other records can cite as a parent.

If you've used a chat model, you have a mental model for querying one model at a time. Inference generalises that to many responders of different kinds — LLMs, patterns, system adapters, and human actors — in a single query, with a declarative rule for how their answers combine.

Why queries are records

The 8-field record shape (parents, thread, actor, act, body, clock, data_type, judged_by) is the only thing the substrate ingests. That applies to inference too.

A query is an INTEND record whose body follows the infer.query.v1 schema. The executor listens for that body kind via an INGEST-time event trigger, runs the pipeline (responder filter → relevance-score → dispatch CALLs → await DOs → fold → validate → commit KNOW), and the answer lands as a new KNOW record on the same thread. No new HTTP endpoint. No orchestration daemon. The substrate does the work because the substrate already knows how to do this work.

This has practical consequences:

Everything is auditable — every CALL dispatched, every DO returned, every fold contributor is a record you can spl thread records and inspect.
Everything is replayable — re-ingest the query's thread and you re-derive the same answer, because fold is deterministic over a canonical ordering.
Everything composes — a query's KNOW can be an input.record_id for another query. Inference chains are just record chains.

Query anatomy

An infer.query.v1 body has eight optional shape components on top of two required ones.

infer.query.v1
├── input              (required) — what you want answered
├── responders         (required) — who may answer
├── answer_shape       (required) — what the answer must look like
├── fold               (required) — how multiple answers become one
├── dial               (optional) — how much novelty is acceptable
├── orchestration      (optional) — multi-step pattern: verify, waterfall, escalate…
├── side_effects       (optional) — reversibility, cost ceiling, timeout
├── relevance          (optional) — top-k filter over the responder pool
└── metadata           (optional) — trace/correlation string bag

See Query Anatomy for the full schema with every field.

Heterogeneous responders

A responder is anything that can emit a DO record in response to a CALL. The engine doesn't care whether an answer comes from an LLM subprocess, a compiled pattern library, a system adapter, or a human typing in a form. The four responder kinds:

Kind	What it is	Typical latency
`pattern`	Compiled pattern match (L1/L2/L3 hash lookup)	~100ms
`system`	System adapter — a registered binary or HTTP adapter	seconds
`llm`	An LLM provider (Anthropic, OpenAI, etc.) via the proxy	seconds to minutes
`actor`	A human actor reached via SSE, webhook, digest email, or polling	hours to days

You select responders with predicates — each entry in the responders array is an OR-of-classes predicate. kind: "llm", model: "~sonnet", budget_usd: 0.05 selects any Claude Sonnet variant with budget under 5 cents. Predicates can carry a CEL expression for complex cases; see the Responder Predicates section.

Relevance — the softmax analogue

A query can target many responders, but you usually don't want to actually dispatch to every one. The relevance scorer ranks candidates after the hard filter has run, picks the top-k by score, and discards the rest.

The default is a heuristic with an additive-with-floor formula:

score = 0.35·max(trust, 0.05)
      + 0.25·pattern_match
      + 0.15·recency
      + 0.25·semantic_sim

The scores are on [0, 1], the weights sum to 1.0, and all four signals are always-on. The 0.05 floor is deliberate — multiplicative scoring caused 71% of real responders to score exactly 0.0 in the pre-implementation poll, which made cold-start candidates invisible forever. Additive-with-floor degrades gracefully.

The scorer is behind an Arc<dyn RelevanceScorer> trait. A future release will swap in a learned two-tower + cross-encoder without changing any query bodies. You configure weights and threshold via a syncropel.config.relevance_scorer.v1 config record; see the relevance section for details.

Fold vs orchestration — pure vs policy

Two concepts people conflate. Keeping them separate is the whole reason this design works.

Fold is a pure function. Given a list of responses and a canonical order, it returns a single answer. Fold has no I/O, no side effects, no retries, no state. The 5 fold functions — consensus, best_of, waterfall_first, ensemble_weighted, expression — each have a one-line contract and a deterministic output. See Fold Functions.

Orchestration is a policy. It decides when to dispatch, to whom, how many times, under what escalation rules. Patterns like verify (primary + verifier + tiebreaker), waterfall (stages tried in cost order), escalate (tiers of increasing authority with human fallback), or ensemble_with_audit (parallel fan-out with a trusted auditor) all orchestrate dispatch. They then call fold to combine the answers they collected. See Orchestration Patterns.

You can use fold without orchestration — the default single_shot orchestration is just "dispatch to top-k and fold once". You cannot use orchestration without fold — every pattern resolves to a fold call somewhere, even if it's best_of with n=1.

When to reach for inference

If you're deciding whether a request should be an infer.query.v1 or a plain dispatch, the split is:

Use a query when you care about which answer is best across multiple responders. Consensus of 3 LLMs, verify with a human tiebreaker, cost waterfall from cheap to expensive — all queries.
Use plain dispatch when you know exactly who answers and you just want their output. A task routed to a specific dev agent via a routing rule is plain dispatch.

Inference is declarative: you write what you want, the engine picks the participants and aggregates. Dispatch is imperative: you pick the participant and get back what they say.

What inference doesn't do

It doesn't make the responders faster. If you pool 3 LLMs at 5s each, you pay at least 5s (plus engine overhead). Fold is cheap; the bottleneck is the slowest responder that counts toward your min_quorum.
It doesn't magically aggregate open-ended text. consensus keyed on canonical-JSON assumes bodies are structured. For long-form prose, use best_of with trust or ensemble_weighted with a custom weight expression.
It doesn't replace your evaluation loop. The fold is the aggregation; the trust substrate is what tells you which responders to weight. You still need KNOW records with verdicts to feed the trust model.