Files
wz-phone/docs/PRD-relay-federation.md
Siavash Sameni b97f32ce46
Some checks failed
Mirror to GitHub / mirror (push) Failing after 36s
Build Release Binaries / build-amd64 (push) Failing after 1m53s
docs: PRD for relay federation (multi-relay mesh) + identity fix
Documents the relay TLS identity bug (cert regenerates on restart
because server_config() creates a new keypair every time, ignoring
the persisted Ed25519 seed) and the full federation design:

- YAML config with mutual peer trust (url + fingerprint)
- QUIC connections between peers, fingerprint verification
- Room bridging: media forwarding for shared room names
- Merged participant presence across relays
- Helpful log message for unconfigured peer connection attempts
- No transcoding, no re-encryption, no central coordinator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 21:33:05 +04:00

7.8 KiB

PRD: Relay Federation (Multi-Relay Mesh)

Problem

Currently all participants in a call must connect to the same relay. This creates:

  • Single point of failure — if the relay goes down, the entire call drops
  • Geographic latency — users far from the relay get high RTT
  • Capacity limits — one relay handles all traffic

Users should be able to connect to their nearest/preferred relay and still talk to users on other relays, as long as the relays are federated.

Prerequisite: Fix Relay Identity Persistence

Bug: TLS certificate regenerates on every restart

Root cause: wzp-transport/src/config.rs:17 calls rcgen::generate_simple_self_signed() which creates a new keypair every time. The relay's Ed25519 identity seed IS persisted to ~/.wzp/relay-identity, but the TLS certificate is not derived from it.

Impact: Clients see a different server fingerprint after every relay restart, triggering the "Server Key Changed" warning. This also breaks federation since relays identify each other by certificate fingerprint.

Fix: Derive the TLS certificate from the persisted relay seed:

  1. Add server_config_from_seed(seed: &[u8; 32]) to wzp-transport
  2. Use the seed to create a deterministic keypair (e.g., derive an ECDSA key via HKDF from the Ed25519 seed)
  3. Generate a self-signed cert with that keypair — same seed = same cert = same fingerprint
  4. The relay passes its loaded seed to server_config_from_seed() instead of server_config()

Effort: 0.5 day

Federation Design

Core Concept

Two or more relays form a federation mesh. Each relay is an independent SFU. When relays are configured to trust each other, they bridge rooms with matching names — participants on relay A in room "podcast" hear participants on relay B in room "podcast" as if everyone were on the same relay.

Configuration

Each relay reads a YAML config file (e.g., ~/.wzp/relay.yaml or --config relay.yaml):

# Relay identity (auto-generated if missing)
listen: 0.0.0.0:4433

# Federation peers — other relays we trust and bridge rooms with
# Both sides must configure each other for federation to work
peers:
  - url: "193.180.213.68:4433"
    fingerprint: "a5d6:e3c6:5ae7:185c:4eb1:af89:daed:4a43"
    label: "Pangolin EU"

  - url: "10.0.0.5:4433"
    fingerprint: "7f2a:b391:0c44:..."
    label: "Office LAN"

Key rules:

  • Both relays must configure each other — mutual trust required
  • A relay that receives a connection from an unknown peer logs: "Relay a5d6:e3c6:... (193.180.213.68) wants to federate. To accept, add to peers config: url: 193.180.213.68:4433, fingerprint: a5d6:e3c6:..."
  • Fingerprints are verified via the TLS certificate (requires the identity fix above)

Protocol

Peer Connection

  1. On startup, each relay attempts QUIC connections to all configured peers
  2. The connection uses SNI "_federation" (reserved room name prefix) to distinguish from client connections
  3. After QUIC handshake, verify the peer's certificate fingerprint matches the configured fingerprint
  4. If fingerprint mismatch → reject, log warning
  5. If peer connects but isn't in our config → log the helpful "add to config" message, reject

Room Bridging

Once two relays are connected:

  1. Room discovery: When a local participant joins room "T", the relay sends a FederationRoomJoin { room: "T" } signal to all connected peers
  2. Room leave: When the last local participant leaves room "T", send FederationRoomLeave { room: "T" }
  3. Media forwarding: For each room that exists on both relays:
    • Relay A forwards all media packets from its local participants to relay B
    • Relay B forwards all media packets from its local participants to relay A
    • Each relay then fans out received federated media to its local participants (same as local SFU forwarding)
  4. Participant presence: RoomUpdate signals are merged — local participants + federated participants from all peers
Relay A (2 local users)          Relay B (1 local user)
┌─────────────────────┐          ┌─────────────────────┐
│ Room "T"            │          │ Room "T"            │
│  Alice (local)  ────┼──media──►│  Charlie (local)    │
│  Bob   (local)  ────┼──media──►│                     │
│                     │◄──media──┼── Charlie           │
│  Charlie (federated)│          │  Alice (federated)  │
│                     │          │  Bob   (federated)  │
└─────────────────────┘          └─────────────────────┘

Signal Messages (new)

enum FederationSignal {
    /// A room exists on this relay with active participants
    RoomJoin { room: String, participants: Vec<ParticipantInfo> },
    /// Room is empty on this relay
    RoomLeave { room: String },
    /// Participant update for a federated room
    ParticipantUpdate { room: String, participants: Vec<ParticipantInfo> },
}

Media Forwarding

Federated media is forwarded as raw QUIC datagrams — the relay doesn't decode/re-encode. Each packet is prefixed with a room identifier so the receiving relay knows which room to fan it out to:

[room_hash: 8 bytes][original_media_packet]

The 8-byte room hash is computed once when the federation room bridge is established.

What Relays DON'T Do

  • No transcoding — media passes through as-is. If Alice sends Opus 64k, Charlie receives Opus 64k
  • No re-encryption — packets are already encrypted end-to-end between participants. Relays just forward opaque bytes
  • No central coordinator — each relay independently connects to its configured peers. No master/slave, no consensus protocol
  • No automatic peer discovery — peers must be explicitly configured in YAML

Failure Handling

  • If a peer relay goes down, the federation link drops. Local rooms continue to work. Federated participants disappear from presence.
  • Reconnection: attempt every 30 seconds with exponential backoff up to 5 minutes
  • If a peer relay restarts with a new identity (bug not fixed), the fingerprint check fails and federation is rejected with a clear error log

Implementation Plan

Phase 0: Fix Relay Identity (prerequisite)

  • Derive TLS cert from persisted seed
  • Same seed → same cert → same fingerprint across restarts

Phase 1: YAML Config + Peer Connection

  • Add --config relay.yaml CLI flag
  • Parse peers config
  • On startup, connect to all configured peers via QUIC
  • Verify certificate fingerprints
  • Log helpful message for unconfigured peers
  • Reconnect on disconnect

Phase 2: Room Bridging

  • Track which rooms exist on each peer
  • Forward media for shared rooms
  • Merge participant presence across peers
  • Handle room join/leave signals

Phase 3: Resilience

  • Graceful handling of peer disconnect/reconnect
  • Don't duplicate packets if a participant is reachable via multiple paths
  • Rate limiting on federation links (prevent amplification)
  • Metrics: federated rooms, packets forwarded, peer latency

Effort Estimates

Phase Scope Effort
0 Fix relay TLS identity from seed 0.5 day
1 YAML config + peer QUIC connections 2 days
2 Room bridging + media forwarding + presence merge 3-4 days
3 Resilience + metrics 2 days

Non-Goals (v1)

  • Automatic peer discovery (mDNS, DHT, etc.)
  • Cascading federation (relay A ↔ B ↔ C where A doesn't know C)
  • Load balancing across relays
  • Encryption between relays (QUIC provides transport encryption; e2e encryption between participants is orthogonal)
  • Different rooms on different relays (all federated rooms are bridged by name)