Documents the relay TLS identity bug (cert regenerates on restart because server_config() creates a new keypair every time, ignoring the persisted Ed25519 seed) and the full federation design: - YAML config with mutual peer trust (url + fingerprint) - QUIC connections between peers, fingerprint verification - Room bridging: media forwarding for shared room names - Merged participant presence across relays - Helpful log message for unconfigured peer connection attempts - No transcoding, no re-encryption, no central coordinator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.8 KiB
PRD: Relay Federation (Multi-Relay Mesh)
Problem
Currently all participants in a call must connect to the same relay. This creates:
- Single point of failure — if the relay goes down, the entire call drops
- Geographic latency — users far from the relay get high RTT
- Capacity limits — one relay handles all traffic
Users should be able to connect to their nearest/preferred relay and still talk to users on other relays, as long as the relays are federated.
Prerequisite: Fix Relay Identity Persistence
Bug: TLS certificate regenerates on every restart
Root cause: wzp-transport/src/config.rs:17 calls rcgen::generate_simple_self_signed() which creates a new keypair every time. The relay's Ed25519 identity seed IS persisted to ~/.wzp/relay-identity, but the TLS certificate is not derived from it.
Impact: Clients see a different server fingerprint after every relay restart, triggering the "Server Key Changed" warning. This also breaks federation since relays identify each other by certificate fingerprint.
Fix: Derive the TLS certificate from the persisted relay seed:
- Add
server_config_from_seed(seed: &[u8; 32])towzp-transport - Use the seed to create a deterministic keypair (e.g., derive an ECDSA key via HKDF from the Ed25519 seed)
- Generate a self-signed cert with that keypair — same seed = same cert = same fingerprint
- The relay passes its loaded seed to
server_config_from_seed()instead ofserver_config()
Effort: 0.5 day
Federation Design
Core Concept
Two or more relays form a federation mesh. Each relay is an independent SFU. When relays are configured to trust each other, they bridge rooms with matching names — participants on relay A in room "podcast" hear participants on relay B in room "podcast" as if everyone were on the same relay.
Configuration
Each relay reads a YAML config file (e.g., ~/.wzp/relay.yaml or --config relay.yaml):
# Relay identity (auto-generated if missing)
listen: 0.0.0.0:4433
# Federation peers — other relays we trust and bridge rooms with
# Both sides must configure each other for federation to work
peers:
- url: "193.180.213.68:4433"
fingerprint: "a5d6:e3c6:5ae7:185c:4eb1:af89:daed:4a43"
label: "Pangolin EU"
- url: "10.0.0.5:4433"
fingerprint: "7f2a:b391:0c44:..."
label: "Office LAN"
Key rules:
- Both relays must configure each other — mutual trust required
- A relay that receives a connection from an unknown peer logs:
"Relay a5d6:e3c6:... (193.180.213.68) wants to federate. To accept, add to peers config: url: 193.180.213.68:4433, fingerprint: a5d6:e3c6:..." - Fingerprints are verified via the TLS certificate (requires the identity fix above)
Protocol
Peer Connection
- On startup, each relay attempts QUIC connections to all configured peers
- The connection uses SNI
"_federation"(reserved room name prefix) to distinguish from client connections - After QUIC handshake, verify the peer's certificate fingerprint matches the configured fingerprint
- If fingerprint mismatch → reject, log warning
- If peer connects but isn't in our config → log the helpful "add to config" message, reject
Room Bridging
Once two relays are connected:
- Room discovery: When a local participant joins room "T", the relay sends a
FederationRoomJoin { room: "T" }signal to all connected peers - Room leave: When the last local participant leaves room "T", send
FederationRoomLeave { room: "T" } - Media forwarding: For each room that exists on both relays:
- Relay A forwards all media packets from its local participants to relay B
- Relay B forwards all media packets from its local participants to relay A
- Each relay then fans out received federated media to its local participants (same as local SFU forwarding)
- Participant presence:
RoomUpdatesignals are merged — local participants + federated participants from all peers
Relay A (2 local users) Relay B (1 local user)
┌─────────────────────┐ ┌─────────────────────┐
│ Room "T" │ │ Room "T" │
│ Alice (local) ────┼──media──►│ Charlie (local) │
│ Bob (local) ────┼──media──►│ │
│ │◄──media──┼── Charlie │
│ Charlie (federated)│ │ Alice (federated) │
│ │ │ Bob (federated) │
└─────────────────────┘ └─────────────────────┘
Signal Messages (new)
enum FederationSignal {
/// A room exists on this relay with active participants
RoomJoin { room: String, participants: Vec<ParticipantInfo> },
/// Room is empty on this relay
RoomLeave { room: String },
/// Participant update for a federated room
ParticipantUpdate { room: String, participants: Vec<ParticipantInfo> },
}
Media Forwarding
Federated media is forwarded as raw QUIC datagrams — the relay doesn't decode/re-encode. Each packet is prefixed with a room identifier so the receiving relay knows which room to fan it out to:
[room_hash: 8 bytes][original_media_packet]
The 8-byte room hash is computed once when the federation room bridge is established.
What Relays DON'T Do
- No transcoding — media passes through as-is. If Alice sends Opus 64k, Charlie receives Opus 64k
- No re-encryption — packets are already encrypted end-to-end between participants. Relays just forward opaque bytes
- No central coordinator — each relay independently connects to its configured peers. No master/slave, no consensus protocol
- No automatic peer discovery — peers must be explicitly configured in YAML
Failure Handling
- If a peer relay goes down, the federation link drops. Local rooms continue to work. Federated participants disappear from presence.
- Reconnection: attempt every 30 seconds with exponential backoff up to 5 minutes
- If a peer relay restarts with a new identity (bug not fixed), the fingerprint check fails and federation is rejected with a clear error log
Implementation Plan
Phase 0: Fix Relay Identity (prerequisite)
- Derive TLS cert from persisted seed
- Same seed → same cert → same fingerprint across restarts
Phase 1: YAML Config + Peer Connection
- Add
--config relay.yamlCLI flag - Parse peers config
- On startup, connect to all configured peers via QUIC
- Verify certificate fingerprints
- Log helpful message for unconfigured peers
- Reconnect on disconnect
Phase 2: Room Bridging
- Track which rooms exist on each peer
- Forward media for shared rooms
- Merge participant presence across peers
- Handle room join/leave signals
Phase 3: Resilience
- Graceful handling of peer disconnect/reconnect
- Don't duplicate packets if a participant is reachable via multiple paths
- Rate limiting on federation links (prevent amplification)
- Metrics: federated rooms, packets forwarded, peer latency
Effort Estimates
| Phase | Scope | Effort |
|---|---|---|
| 0 | Fix relay TLS identity from seed | 0.5 day |
| 1 | YAML config + peer QUIC connections | 2 days |
| 2 | Room bridging + media forwarding + presence merge | 3-4 days |
| 3 | Resilience + metrics | 2 days |
Non-Goals (v1)
- Automatic peer discovery (mDNS, DHT, etc.)
- Cascading federation (relay A ↔ B ↔ C where A doesn't know C)
- Load balancing across relays
- Encryption between relays (QUIC provides transport encryption; e2e encryption between participants is orthogonal)
- Different rooms on different relays (all federated rooms are bridged by name)