diff --git a/docs/PRD-relay-federation.md b/docs/PRD-relay-federation.md new file mode 100644 index 0000000..5f00fc9 --- /dev/null +++ b/docs/PRD-relay-federation.md @@ -0,0 +1,170 @@ +# PRD: Relay Federation (Multi-Relay Mesh) + +## Problem + +Currently all participants in a call must connect to the same relay. This creates: +- **Single point of failure** — if the relay goes down, the entire call drops +- **Geographic latency** — users far from the relay get high RTT +- **Capacity limits** — one relay handles all traffic + +Users should be able to connect to their nearest/preferred relay and still talk to users on other relays, as long as the relays are federated. + +## Prerequisite: Fix Relay Identity Persistence + +### Bug: TLS certificate regenerates on every restart + +**Root cause:** `wzp-transport/src/config.rs:17` calls `rcgen::generate_simple_self_signed()` which creates a new keypair every time. The relay's Ed25519 identity seed IS persisted to `~/.wzp/relay-identity`, but the TLS certificate is not derived from it. + +**Impact:** Clients see a different server fingerprint after every relay restart, triggering the "Server Key Changed" warning. This also breaks federation since relays identify each other by certificate fingerprint. + +**Fix:** Derive the TLS certificate from the persisted relay seed: +1. Add `server_config_from_seed(seed: &[u8; 32])` to `wzp-transport` +2. Use the seed to create a deterministic keypair (e.g., derive an ECDSA key via HKDF from the Ed25519 seed) +3. Generate a self-signed cert with that keypair — same seed = same cert = same fingerprint +4. The relay passes its loaded seed to `server_config_from_seed()` instead of `server_config()` + +**Effort:** 0.5 day + +## Federation Design + +### Core Concept + +Two or more relays form a **federation mesh**. Each relay is an independent SFU. When relays are configured to trust each other, they bridge rooms with matching names — participants on relay A in room "podcast" hear participants on relay B in room "podcast" as if everyone were on the same relay. + +### Configuration + +Each relay reads a YAML config file (e.g., `~/.wzp/relay.yaml` or `--config relay.yaml`): + +```yaml +# Relay identity (auto-generated if missing) +listen: 0.0.0.0:4433 + +# Federation peers — other relays we trust and bridge rooms with +# Both sides must configure each other for federation to work +peers: + - url: "193.180.213.68:4433" + fingerprint: "a5d6:e3c6:5ae7:185c:4eb1:af89:daed:4a43" + label: "Pangolin EU" + + - url: "10.0.0.5:4433" + fingerprint: "7f2a:b391:0c44:..." + label: "Office LAN" +``` + +**Key rules:** +- Both relays must configure each other — **mutual trust** required +- A relay that receives a connection from an unknown peer logs: `"Relay a5d6:e3c6:... (193.180.213.68) wants to federate. To accept, add to peers config: url: 193.180.213.68:4433, fingerprint: a5d6:e3c6:..."` +- Fingerprints are verified via the TLS certificate (requires the identity fix above) + +### Protocol + +#### Peer Connection + +1. On startup, each relay attempts QUIC connections to all configured peers +2. The connection uses SNI `"_federation"` (reserved room name prefix) to distinguish from client connections +3. After QUIC handshake, verify the peer's certificate fingerprint matches the configured fingerprint +4. If fingerprint mismatch → reject, log warning +5. If peer connects but isn't in our config → log the helpful "add to config" message, reject + +#### Room Bridging + +Once two relays are connected: + +1. **Room discovery**: When a local participant joins room "T", the relay sends a `FederationRoomJoin { room: "T" }` signal to all connected peers +2. **Room leave**: When the last local participant leaves room "T", send `FederationRoomLeave { room: "T" }` +3. **Media forwarding**: For each room that exists on both relays: + - Relay A forwards all media packets from its local participants to relay B + - Relay B forwards all media packets from its local participants to relay A + - Each relay then fans out received federated media to its local participants (same as local SFU forwarding) +4. **Participant presence**: `RoomUpdate` signals are merged — local participants + federated participants from all peers + +``` +Relay A (2 local users) Relay B (1 local user) +┌─────────────────────┐ ┌─────────────────────┐ +│ Room "T" │ │ Room "T" │ +│ Alice (local) ────┼──media──►│ Charlie (local) │ +│ Bob (local) ────┼──media──►│ │ +│ │◄──media──┼── Charlie │ +│ Charlie (federated)│ │ Alice (federated) │ +│ │ │ Bob (federated) │ +└─────────────────────┘ └─────────────────────┘ +``` + +#### Signal Messages (new) + +```rust +enum FederationSignal { + /// A room exists on this relay with active participants + RoomJoin { room: String, participants: Vec }, + /// Room is empty on this relay + RoomLeave { room: String }, + /// Participant update for a federated room + ParticipantUpdate { room: String, participants: Vec }, +} +``` + +#### Media Forwarding + +Federated media is forwarded as raw QUIC datagrams — the relay doesn't decode/re-encode. Each packet is prefixed with a room identifier so the receiving relay knows which room to fan it out to: + +``` +[room_hash: 8 bytes][original_media_packet] +``` + +The 8-byte room hash is computed once when the federation room bridge is established. + +### What Relays DON'T Do + +- **No transcoding** — media passes through as-is. If Alice sends Opus 64k, Charlie receives Opus 64k +- **No re-encryption** — packets are already encrypted end-to-end between participants. Relays just forward opaque bytes +- **No central coordinator** — each relay independently connects to its configured peers. No master/slave, no consensus protocol +- **No automatic peer discovery** — peers must be explicitly configured in YAML + +### Failure Handling + +- If a peer relay goes down, the federation link drops. Local rooms continue to work. Federated participants disappear from presence. +- Reconnection: attempt every 30 seconds with exponential backoff up to 5 minutes +- If a peer relay restarts with a new identity (bug not fixed), the fingerprint check fails and federation is rejected with a clear error log + +## Implementation Plan + +### Phase 0: Fix Relay Identity (prerequisite) +- Derive TLS cert from persisted seed +- Same seed → same cert → same fingerprint across restarts + +### Phase 1: YAML Config + Peer Connection +- Add `--config relay.yaml` CLI flag +- Parse peers config +- On startup, connect to all configured peers via QUIC +- Verify certificate fingerprints +- Log helpful message for unconfigured peers +- Reconnect on disconnect + +### Phase 2: Room Bridging +- Track which rooms exist on each peer +- Forward media for shared rooms +- Merge participant presence across peers +- Handle room join/leave signals + +### Phase 3: Resilience +- Graceful handling of peer disconnect/reconnect +- Don't duplicate packets if a participant is reachable via multiple paths +- Rate limiting on federation links (prevent amplification) +- Metrics: federated rooms, packets forwarded, peer latency + +## Effort Estimates + +| Phase | Scope | Effort | +|-------|-------|--------| +| 0 | Fix relay TLS identity from seed | 0.5 day | +| 1 | YAML config + peer QUIC connections | 2 days | +| 2 | Room bridging + media forwarding + presence merge | 3-4 days | +| 3 | Resilience + metrics | 2 days | + +## Non-Goals (v1) + +- Automatic peer discovery (mDNS, DHT, etc.) +- Cascading federation (relay A ↔ B ↔ C where A doesn't know C) +- Load balancing across relays +- Encryption between relays (QUIC provides transport encryption; e2e encryption between participants is orthogonal) +- Different rooms on different relays (all federated rooms are bridged by name)