Running an async-federation relay
Install, configure, monitor, and troubleshoot a Syncropel async-federation relay. Covers Docker and systemd deployment, Prometheus metrics, bearer-token auth for receivers, and the failure modes you'll hit in practice.
Audience
This guide is for operators running a relay for their own organization, a small community, or an infrastructure provider serving multiple tenants. It assumes you've read the async federation guide and understand what the relay is (a dumb store-and-forward mailbox) and what it isn't (an identity provider, a consensus layer, a trust substrate).
If you're a user configuring a client to use a relay that already exists, you don't need this page — use spl config relay set <url>.
What a relay is
An async-federation relay is a single HTTP service that accepts signed envelopes from senders and lets receivers poll for their mailbox contents when they come online. It implements four endpoints:
POST /v1/mailbox/{did}— deposit an envelope addressed to a receiver DIDGET /v1/mailbox/{did}/receive— list queued envelopes for a receiverPOST /v1/mailbox/{did}/ack— remove acknowledged envelopesGET /health— liveness (no auth, for load balancers / Kubernetes)
Plus:
GET /metrics— Prometheus text format
The relay does not verify envelope signatures. Verification is end-to-end, performed by receivers after dequeue. The relay sees sender/receiver DIDs, envelope sizes, and timing; it does not see the records inside MLS-encrypted envelopes.
When to run your own
You want a relay of your own when any of the following apply:
- Sovereignty — you want to guarantee no third party sees your federation traffic metadata (even when envelopes are MLS-encrypted, timing and DID pairs are visible).
- Latency — your fleet is geographically clustered and
relay.syncropel.comadds too much round-trip time. - Self-hosting policy — your organization prohibits external dependencies for production coordination.
- Custom retention — default TTL is 30 days; if you need longer (or shorter for privacy), run your own.
If none of those apply, use relay.syncropel.com — it's free and operated by the Syncropel team.
Install
Docker (recommended)
docker run -d \
--name syncropel-relay \
--restart unless-stopped \
-p 8080:8080 \
ghcr.io/syncropic/syncropel-relay:latestThe image is ~50 MB, runs as non-root (UID 10001), and exposes a HEALTHCHECK that hits /health every 10 seconds.
Verify it came up:
curl -fsS http://localhost:8080/health
# {"status":"ok"}systemd (manual binary install)
Download the binary for your platform from releases.syncropic.com (Linux x86_64 and aarch64 are built; other platforms can build from source with cargo build --release --bin syncropel-relay).
Install the binary and create a systemd unit:
sudo install -m 0755 syncropel-relay /usr/local/bin/syncropel-relay
sudo useradd --system --home-dir /var/lib/syncropel-relay relay
sudo tee /etc/systemd/system/syncropel-relay.service <<'EOF'
[Unit]
Description=Syncropel async-federation relay
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=relay
Group=relay
ExecStart=/usr/local/bin/syncropel-relay --bind 0.0.0.0:8080
Restart=on-failure
RestartSec=5s
# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/syncropel-relay
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now syncropel-relay
sudo systemctl status syncropel-relayConfiguration
The current release ships an in-memory queue and a single CLI flag. Durable SQLite-backed storage and runtime config reload land in a continuation task.
| Flag | Default | Description |
|---|---|---|
--bind <addr:port> | 0.0.0.0:8080 | HTTP listener address |
Log levels via RUST_LOG:
RUST_LOG=info syncropel-relay --bind 0.0.0.0:8080 # default
RUST_LOG=syncropel_relay=debug syncropel-relay ... # verboseQueue backend
The relay uses an in-memory queue (MemoryQueue). Process restart clears queued envelopes — this is acceptable because senders retain an outbound queue locally and will redeliver. Deployment implications:
- Do not run two relay instances behind a load balancer. Each has its own queue; envelopes deposited at one won't be visible at the other.
- Rolling restarts lose in-flight envelopes. Senders recover, but receivers see a gap until redelivery completes.
- Per-receiver cap is 10,000 envelopes. Deposits past the cap return
507 Insufficient Storageand hitrelay_rate_limit_hits_total{kind="queue_full"}.
The SQLite-backed durable queue lands in a roadmap continuation task; Redis support tracked in a future ADR if demand shows up.
TTL
Envelopes default to 30 days. Senders can override per-envelope via the ttl_seconds field. The relay's expiry sweep is not yet wired in this release (the MemoryQueue.prune_expired method exists but runs on demand only) — a background sweep task lands in the continuation task alongside SQLite.
Auth
The relay runs unauthenticated on the deposit side (envelopes are signed end-to-end, so deposit authority is not load-bearing for correctness — only for abuse control). For receive and ack:
- Path-parameter DID is the receiver identity; there's no server-side challenge.
- The full challenge/response flow lands in a continuation task.
For production use before the challenge/response flow ships, front the relay with a reverse proxy that enforces bearer-token auth per receiver DID. Minimal nginx example:
location ~ ^/v1/mailbox/(did:sync:[^/]+) {
if ($http_authorization != "Bearer $reserved_token_for_$1") {
return 401;
}
proxy_pass http://127.0.0.1:8080;
}This is intentionally coarse — production operators should wait for D8 or run the relay inside a VPN with mutual TLS.
Rate limits
The relay does not enforce per-sender or per-receiver rate limits at the application layer. Use a reverse proxy (nginx limit_req_zone, Cloudflare, or an API gateway) until native limits land.
Monitoring
Prometheus metrics
GET /metrics returns the standard Prometheus text exposition format:
curl -fsS http://localhost:8080/metricsMetrics exposed:
| Name | Type | Labels | Meaning |
|---|---|---|---|
relay_deposits_total | counter | from_did, to_did | Envelopes accepted per sender → receiver pair |
relay_receives_total | counter | did | Receive calls served per receiver |
relay_acks_total | counter | did | Envelopes acknowledged per receiver |
relay_rate_limit_hits_total | counter | kind | Abuse-control rejections by kind (queue_full, payload_too_large) |
relay_envelopes_expired_total | counter | — | Envelopes dropped by TTL sweep |
relay_queue_size | gauge | did | Current queue length per receiver |
relay_oldest_envelope_seconds | gauge | did | Age in seconds of the oldest queued envelope per receiver |
relay_unique_senders | gauge | — | Distinct senders represented in the current queue state |
relay_unique_receivers | gauge | — | Distinct receivers with at least one queued envelope |
Scrape config
Add a scrape target to Prometheus:
scrape_configs:
- job_name: 'syncropel-relay'
scrape_interval: 30s
static_configs:
- targets: ['relay.example.com:8080']Example Grafana panel queries
Deposits per minute, by receiver:
sum by (to_did) (rate(relay_deposits_total[1m]))Queue backlog (top 10 receivers with the longest-queued envelope):
topk(10, relay_oldest_envelope_seconds)Capacity pressure (rate of queue_full rejections):
rate(relay_rate_limit_hits_total{kind="queue_full"}[5m])Delivery lag (ack rate vs deposit rate):
sum(rate(relay_acks_total[5m])) / sum(rate(relay_deposits_total[5m]))A healthy ratio is near 1.0 — values persistently below 1.0 mean receivers are falling behind.
Troubleshooting
Queue is not draining
Symptom: relay_queue_size{did="..."} climbing, relay_acks_total{did="..."} flat.
Common causes:
- Receiver is offline or its relay URL is wrong. On the receiver, check
spl config relay show. If the URL points elsewhere, the receiver isn't polling this relay. - Receiver's long-poll loop is dead. Check
spl fleet sync statuson the receiver — look for a recentlast_receive_mstimestamp per relay. - Network path from receiver to relay is broken. From the receiver,
curl https://<relay>/v1/mailbox/<did>/receive. A timeout or connection-refused points at network, not relay logic.
/health returns 200 but deposits fail with 507
You've hit the per-receiver cap (default 10,000). This is a back-pressure signal — the receiver is offline long enough that its mailbox has overflowed.
Immediate action: the sender MUST handle 507 gracefully (retain outbound, retry later with exponential backoff). No operator intervention on the relay is required; once the receiver comes online and drains, capacity frees up.
Root-cause action: if 507s are chronic for a particular receiver, that receiver has either disappeared (permanent) or needs its poll cadence investigated.
Envelopes deposited but never received
- Verify the
to_didon the envelope matches what the receiver is polling for. Mismatched DIDs queue the envelope under a DID the receiver never asks about. - Check
relay_queue_size{did="<receiver>"}— if it's non-zero, the envelope is queued correctly and the problem is on the receiver side. - Check
relay_oldest_envelope_seconds{did="<receiver>"}— very large values mean the receiver has been offline since that timestamp.
Relay restarts lose in-flight envelopes
Expected behavior with the in-memory queue. The in-memory queue doesn't persist across restarts. Senders retain outbound envelopes locally (default 7 days) and redeliver on their next federation sweep. No data is lost at the system level — only the in-flight window between deposit and receive is reset.
To eliminate this, wait for the SQLite-backed durable queue (on the roadmap).
High cardinality on relay_deposits_total
If you have many distinct sender DIDs, the {from_did, to_did} label pair can produce a cardinality explosion in Prometheus storage. Mitigations:
- Use Prometheus
metric_relabel_configsto drop thefrom_didlabel and keep onlyto_did. - Run the relay inside a trust boundary (single-tenant deployments have a bounded set of sender DIDs).
- For multi-tenant public-relay deployments, consider switching to a histogram-based approach — a continuation task will add this.
Backup and restore
The in-memory queue has no durable state to back up. A persistent SQLite backend is on the roadmap; when that ships:
- Snapshot with
sqlite3 <path> ".backup <dest>"— online-safe. - Restore by stopping the relay, copying the backup file into place, and starting again.
Senders retain outbound envelopes independently of the relay, so even total queue loss at the relay is recoverable from the sender side (within each sender's retention window).
Upgrading
# Docker
docker pull ghcr.io/syncropic/syncropel-relay:latest
docker stop syncropel-relay && docker rm syncropel-relay
docker run -d --name syncropel-relay --restart unless-stopped \
-p 8080:8080 ghcr.io/syncropic/syncropel-relay:latest
# systemd
curl -fsSL https://releases.syncropic.com/spl-rust/latest/syncropel-relay-linux-x86_64 \
-o /tmp/syncropel-relay
sudo install -m 0755 /tmp/syncropel-relay /usr/local/bin/syncropel-relay
sudo systemctl restart syncropel-relayIn-flight envelopes are lost during restart (in-memory queue). For zero-downtime upgrades, wait for the SQLite backend and run two instances behind a load balancer with sticky-per-DID routing.
Privacy posture — what you can and can't see as operator
You can see (by virtue of serving traffic):
- Sender DID, receiver DID, envelope size, deposit timestamp, delivery timestamp
- Whether an envelope is
plainormlsencoded (one-bit leak) - Queue depth per receiver, which implies rough liveness patterns
You cannot see (by protocol design):
- Envelope body contents when
encoding: mls - Record-level semantics (what threads, what actors, what acts)
- Correlation between envelopes and higher-level coordination — the relay is a dumb forwarder
Operators running public relays are expected to publish a transparency statement describing what metadata is retained and for how long. No default policy — this is your call.
See also
- Async federation (user-facing) — the "why" and "when" from the sender/receiver perspective
- Federation guide — the synchronous (no-relay) counterpart
Windows Service
Run `spl serve` as a Windows Service so the daemon starts at boot, survives logoff, and integrates with services.msc + the Event Log. Install, start, stop, uninstall, and troubleshooting.
Workspace Lifecycle
Draft, published, and archived — the three lifecycle states a workspace manifest moves through, what each means, and how to transition between them.