docs: PRD for relay federation (multi-relay mesh) + identity fix
Some checks failed
Mirror to GitHub / mirror (push) Failing after 36s
Build Release Binaries / build-amd64 (push) Failing after 1m53s

Documents the relay TLS identity bug (cert regenerates on restart
because server_config() creates a new keypair every time, ignoring
the persisted Ed25519 seed) and the full federation design:

- YAML config with mutual peer trust (url + fingerprint)
- QUIC connections between peers, fingerprint verification
- Room bridging: media forwarding for shared room names
- Merged participant presence across relays
- Helpful log message for unconfigured peer connection attempts
- No transcoding, no re-encryption, no central coordinator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Siavash Sameni
2026-04-07 21:33:05 +04:00
parent d66d583583
commit b97f32ce46

View File

@@ -0,0 +1,170 @@
# PRD: Relay Federation (Multi-Relay Mesh)
## Problem
Currently all participants in a call must connect to the same relay. This creates:
- **Single point of failure** — if the relay goes down, the entire call drops
- **Geographic latency** — users far from the relay get high RTT
- **Capacity limits** — one relay handles all traffic
Users should be able to connect to their nearest/preferred relay and still talk to users on other relays, as long as the relays are federated.
## Prerequisite: Fix Relay Identity Persistence
### Bug: TLS certificate regenerates on every restart
**Root cause:** `wzp-transport/src/config.rs:17` calls `rcgen::generate_simple_self_signed()` which creates a new keypair every time. The relay's Ed25519 identity seed IS persisted to `~/.wzp/relay-identity`, but the TLS certificate is not derived from it.
**Impact:** Clients see a different server fingerprint after every relay restart, triggering the "Server Key Changed" warning. This also breaks federation since relays identify each other by certificate fingerprint.
**Fix:** Derive the TLS certificate from the persisted relay seed:
1. Add `server_config_from_seed(seed: &[u8; 32])` to `wzp-transport`
2. Use the seed to create a deterministic keypair (e.g., derive an ECDSA key via HKDF from the Ed25519 seed)
3. Generate a self-signed cert with that keypair — same seed = same cert = same fingerprint
4. The relay passes its loaded seed to `server_config_from_seed()` instead of `server_config()`
**Effort:** 0.5 day
## Federation Design
### Core Concept
Two or more relays form a **federation mesh**. Each relay is an independent SFU. When relays are configured to trust each other, they bridge rooms with matching names — participants on relay A in room "podcast" hear participants on relay B in room "podcast" as if everyone were on the same relay.
### Configuration
Each relay reads a YAML config file (e.g., `~/.wzp/relay.yaml` or `--config relay.yaml`):
```yaml
# Relay identity (auto-generated if missing)
listen: 0.0.0.0:4433
# Federation peers — other relays we trust and bridge rooms with
# Both sides must configure each other for federation to work
peers:
- url: "193.180.213.68:4433"
fingerprint: "a5d6:e3c6:5ae7:185c:4eb1:af89:daed:4a43"
label: "Pangolin EU"
- url: "10.0.0.5:4433"
fingerprint: "7f2a:b391:0c44:..."
label: "Office LAN"
```
**Key rules:**
- Both relays must configure each other — **mutual trust** required
- A relay that receives a connection from an unknown peer logs: `"Relay a5d6:e3c6:... (193.180.213.68) wants to federate. To accept, add to peers config: url: 193.180.213.68:4433, fingerprint: a5d6:e3c6:..."`
- Fingerprints are verified via the TLS certificate (requires the identity fix above)
### Protocol
#### Peer Connection
1. On startup, each relay attempts QUIC connections to all configured peers
2. The connection uses SNI `"_federation"` (reserved room name prefix) to distinguish from client connections
3. After QUIC handshake, verify the peer's certificate fingerprint matches the configured fingerprint
4. If fingerprint mismatch → reject, log warning
5. If peer connects but isn't in our config → log the helpful "add to config" message, reject
#### Room Bridging
Once two relays are connected:
1. **Room discovery**: When a local participant joins room "T", the relay sends a `FederationRoomJoin { room: "T" }` signal to all connected peers
2. **Room leave**: When the last local participant leaves room "T", send `FederationRoomLeave { room: "T" }`
3. **Media forwarding**: For each room that exists on both relays:
- Relay A forwards all media packets from its local participants to relay B
- Relay B forwards all media packets from its local participants to relay A
- Each relay then fans out received federated media to its local participants (same as local SFU forwarding)
4. **Participant presence**: `RoomUpdate` signals are merged — local participants + federated participants from all peers
```
Relay A (2 local users) Relay B (1 local user)
┌─────────────────────┐ ┌─────────────────────┐
│ Room "T" │ │ Room "T" │
│ Alice (local) ────┼──media──►│ Charlie (local) │
│ Bob (local) ────┼──media──►│ │
│ │◄──media──┼── Charlie │
│ Charlie (federated)│ │ Alice (federated) │
│ │ │ Bob (federated) │
└─────────────────────┘ └─────────────────────┘
```
#### Signal Messages (new)
```rust
enum FederationSignal {
/// A room exists on this relay with active participants
RoomJoin { room: String, participants: Vec<ParticipantInfo> },
/// Room is empty on this relay
RoomLeave { room: String },
/// Participant update for a federated room
ParticipantUpdate { room: String, participants: Vec<ParticipantInfo> },
}
```
#### Media Forwarding
Federated media is forwarded as raw QUIC datagrams — the relay doesn't decode/re-encode. Each packet is prefixed with a room identifier so the receiving relay knows which room to fan it out to:
```
[room_hash: 8 bytes][original_media_packet]
```
The 8-byte room hash is computed once when the federation room bridge is established.
### What Relays DON'T Do
- **No transcoding** — media passes through as-is. If Alice sends Opus 64k, Charlie receives Opus 64k
- **No re-encryption** — packets are already encrypted end-to-end between participants. Relays just forward opaque bytes
- **No central coordinator** — each relay independently connects to its configured peers. No master/slave, no consensus protocol
- **No automatic peer discovery** — peers must be explicitly configured in YAML
### Failure Handling
- If a peer relay goes down, the federation link drops. Local rooms continue to work. Federated participants disappear from presence.
- Reconnection: attempt every 30 seconds with exponential backoff up to 5 minutes
- If a peer relay restarts with a new identity (bug not fixed), the fingerprint check fails and federation is rejected with a clear error log
## Implementation Plan
### Phase 0: Fix Relay Identity (prerequisite)
- Derive TLS cert from persisted seed
- Same seed → same cert → same fingerprint across restarts
### Phase 1: YAML Config + Peer Connection
- Add `--config relay.yaml` CLI flag
- Parse peers config
- On startup, connect to all configured peers via QUIC
- Verify certificate fingerprints
- Log helpful message for unconfigured peers
- Reconnect on disconnect
### Phase 2: Room Bridging
- Track which rooms exist on each peer
- Forward media for shared rooms
- Merge participant presence across peers
- Handle room join/leave signals
### Phase 3: Resilience
- Graceful handling of peer disconnect/reconnect
- Don't duplicate packets if a participant is reachable via multiple paths
- Rate limiting on federation links (prevent amplification)
- Metrics: federated rooms, packets forwarded, peer latency
## Effort Estimates
| Phase | Scope | Effort |
|-------|-------|--------|
| 0 | Fix relay TLS identity from seed | 0.5 day |
| 1 | YAML config + peer QUIC connections | 2 days |
| 2 | Room bridging + media forwarding + presence merge | 3-4 days |
| 3 | Resilience + metrics | 2 days |
## Non-Goals (v1)
- Automatic peer discovery (mDNS, DHT, etc.)
- Cascading federation (relay A ↔ B ↔ C where A doesn't know C)
- Load balancing across relays
- Encryption between relays (QUIC provides transport encryption; e2e encryption between participants is orthogonal)
- Different rooms on different relays (all federated rooms are bridged by name)