Files
wz-phone/docs/PRD-p2p-direct.md
Siavash Sameni 7b4bce69d5
Some checks failed
Mirror to GitHub / mirror (push) Failing after 28s
Build Release Binaries / build-amd64 (push) Failing after 3m36s
docs: update all docs for hard NAT detection + relay wiring
- PROGRESS.md: hard NAT Phase A, relay cross-wiring, 588 tests
- ARCHITECTURE.md: hard NAT port prediction diagram + pattern table
- PRD-p2p-direct.md: Phase 8.6 split into a/b/c/d with status
- PRD-hard-nat.md: Phase A done, B signal ready, effort table updated
- PRD-netcheck.md: port_allocation field + probe documented

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:33:12 +04:00

213 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PRD: Peer-to-Peer Direct Calls (No Relay)
## Problem
All calls currently route through a relay, even 1-on-1 calls between clients that could reach each other directly. This adds latency (2x hop), creates a single point of failure, and requires trusting the relay operator (even though media is encrypted, the relay sees metadata).
## Solution
For 1-on-1 calls, clients attempt a direct QUIC connection using STUN-discovered addresses. If NAT traversal succeeds, media flows directly between peers. If it fails, fall back to relay-assisted mode (current behavior).
## Architecture
```
Preferred (P2P):
Client A ←──QUIC direct──→ Client B
(no relay in media path, true E2E)
Fallback (Relay):
Client A ──→ Relay ──→ Client B
(current model)
Hybrid discovery:
Client A → Relay (signaling only) → Client B
↓ ↓
STUN server STUN server
↓ ↓
Discover public IP:port Discover public IP:port
↓ ↓
Exchange candidates via relay signaling
↓ ↓
Attempt direct QUIC connection ←──→
```
## Why P2P = True E2E
- QUIC TLS handshake establishes encrypted tunnel directly between A and B
- No third party sees the traffic
- Certificate pinning via identity fingerprints: each client derives their TLS cert from their Ed25519 seed (same as relay identity). During QUIC handshake, both sides verify the peer's cert fingerprint against the known identity
- MITM elimination: if A knows B's fingerprint (from prior call, QR code, or identity server), any interceptor presents a different cert → fingerprint mismatch → connection rejected
- Stronger guarantee than relay-assisted: user doesn't need to trust relay operator
## Requirements
### Phase 1: STUN Discovery
1. **STUN client**: lightweight UDP-based STUN client to discover public IP:port
- Use existing public STUN servers (stun.l.google.com:19302, etc.)
- Or run a STUN server alongside the relay
- Discover: local addresses, server-reflexive addresses (STUN), relay candidates (TURN/relay fallback)
2. **Candidate gathering**: on call initiation, gather all candidates:
- Host candidates: local network interfaces
- Server-reflexive: STUN-discovered public IP:port
- Relay candidate: the relay's address (fallback)
3. **Candidate exchange**: via relay signaling channel (existing `IceCandidate` signal message)
- A sends candidates to relay → relay forwards to B
- B sends candidates to relay → relay forwards to A
### Phase 2: Direct Connection
1. **QUIC hole punching**: both clients simultaneously attempt QUIC connections to each other's candidates
- Quinn supports connecting to multiple addresses
- First successful connection wins
- Timeout after 3 seconds, fall back to relay
2. **Identity verification**: during QUIC handshake, verify peer's TLS cert fingerprint
- `server_config_from_seed()` already exists — derive client cert from identity seed
- Both sides present certs (mutual TLS)
- Verify fingerprint matches expected identity
3. **Media flow**: once connected, use existing `QuinnTransport` for media + signals
- Same `send_media()` / `recv_media()` API
- Same codec pipeline, FEC, jitter buffer
- No code changes needed in the call engine
### Phase 3: Adaptive Quality (P2P)
P2P connections have direct quality visibility — no relay middleman:
1. Both clients observe RTT, loss, jitter directly from QUIC stats
2. Adapt codec quality based on direct observations
3. Since only 2 participants, coordinated switching is simple: propose → ack → switch
This is the simplest case for adaptive quality. Once proven, backport the logic to relay-assisted mode.
### Phase 4: Hybrid Mode
1. **Call initiation**: always connect to relay for signaling
2. **Parallel attempt**: while relay call is active, attempt P2P in background
3. **Seamless migration**: if P2P succeeds, migrate media path from relay to direct
- Both clients switch simultaneously
- Relay connection kept alive for signaling (presence, room updates)
4. **Fallback**: if P2P connection drops, seamlessly fall back to relay
## Security Properties
| Property | Relay Mode | P2P Mode |
|----------|-----------|----------|
| Encryption | ChaCha20-Poly1305 (app layer) | QUIC TLS 1.3 + ChaCha20-Poly1305 |
| Key exchange | Via relay signaling | Direct QUIC handshake |
| Identity verification | TOFU (server fingerprint) | Mutual TLS cert pinning |
| Metadata privacy | Relay sees who talks to whom | No third party sees anything |
| MITM resistance | Depends on relay trust | Strong (cert pinning) |
| Forward secrecy | ECDH ephemeral keys | QUIC built-in + app-layer rekey |
## Implementation Notes
### STUN in Rust
Use `stun-rs` or `webrtc-rs` crate for STUN client. Minimal: just need Binding Request/Response to discover server-reflexive address.
### Quinn Hole Punching
Quinn's `Endpoint` can both listen and connect. For hole punching:
```rust
let endpoint = create_endpoint(bind_addr, Some(server_config))?;
// Send connect to peer's address (opens NAT pinhole)
let conn = connect(&endpoint, peer_addr, "peer", client_config).await?;
// Simultaneously, peer connects to our address
// First successful handshake wins
```
### Client TLS Certificate
Already have `server_config_from_seed()` for relays. Create `client_config_from_seed()` that presents a TLS client certificate derived from the identity seed. The peer verifies this cert's fingerprint.
### Signaling via Relay
The existing relay connection carries `IceCandidate` signals. No new infrastructure needed — just use the relay as a dumb signaling pipe for candidate exchange.
## Non-Goals (v1)
- SFU over P2P (P2P is 1-on-1 only; multi-party uses relay SFU)
- TURN server (relay acts as the fallback, no separate TURN)
- mDNS local discovery (future)
- Mesh P2P for multi-party (future, complex)
## Milestones
| Phase | Scope | Effort | Status |
|-------|-------|--------|--------|
| 1 | STUN client + candidate gathering | 2 days | Done |
| 2 | QUIC hole punching + identity verification | 3 days | Done |
| 3 | Adaptive quality on P2P connection | 2 days | Done (#23) |
| 4 | Hybrid mode (relay + P2P, seamless migration) | 3 days | Done |
| 5 | Single-socket Nebula (shared signal+direct endpoint) | 2 days | Done |
| 6 | ICE path negotiation + dual-path race | 3 days | Done |
| 7 | IPv6 dual-socket | 2 days | Done (but `dual_path.rs` integration tests broken — missing `ipv6_endpoint` arg) |
| 8.1 | Public STUN client (RFC 5389) | 1 day | Done |
| 8.2 | PCP/PMP/UPnP port mapping | 2 days | Done |
| 8.3 | Mid-call ICE re-gathering + CandidateUpdate signal | 2 days | Done (signal plane; transport hot-swap TODO) |
| 8.4 | Netcheck diagnostic | 1 day | Done |
| 8.5 | Region-based relay selection (data model) | 1 day | Done |
| 8.6a | Hard NAT: port allocation detection | 1 day | Done |
| 8.6b | Hard NAT: sequential port prediction signal | 1 day | Done (signal + prediction fn; dial integration pending) |
| 8.6c | Hard NAT: birthday attack (256×1024 probes) | 3 days | Not started |
| 8.6d | Hard NAT: hybrid waterfall + background upgrade | 2 days | Not started |
## Implementation Status (2026-04-13)
Phases 1-2, 4-7 are implemented. First P2P call completed 2026-04-12.
### Known regression
Phase 7 added `ipv6_endpoint: Option<Endpoint>` parameter to `race()` in `crates/wzp-client/src/dual_path.rs` but the 3 test call sites in `crates/wzp-client/tests/dual_path.rs` (lines 111, 153, 191) were not updated — they pass 6 args instead of 7. Fix: add `None,` after the `shared_endpoint` arg in each call.
## Update (2026-04-13)
P2P adaptive quality (#23) now implemented:
- Both peers self-observe network quality from QUIC path stats
- Quality reports generated every ~1s and attached to outgoing packets
- AdaptiveQualityController drives codec switching on both P2P and relay calls
## Update (2026-04-14): Phase 8 — Tailscale-Inspired Enhancements
Added 5 new modules to bring NAT traversal capability close to Tailscale's:
### Phase 8.1: Public STUN Client (Done)
- `stun.rs`: RFC 5389 Binding Request/Response over raw UDP
- Independent reflexive discovery via public STUN servers (Google, Cloudflare)
- `detect_nat_type_with_stun()` combines relay + STUN probes for higher confidence
- STUN fallback in desktop's `try_reflect_own_addr()` when relay reflection fails
### Phase 8.2: PCP/PMP/UPnP Port Mapping (Done)
- `portmap.rs`: NAT-PMP (RFC 6886), PCP (RFC 6887), UPnP IGD
- Gateway discovery (macOS + Linux), try NAT-PMP → PCP → UPnP in sequence
- New candidate type: `PeerCandidates.mapped` + signal fields `caller_mapped_addr`/`callee_mapped_addr`/`peer_mapped_addr`
- Dial order: host → mapped → reflexive (mapped helps on symmetric NATs)
### Phase 8.3: Mid-Call ICE Re-Gathering (Done — signal plane)
- `ice_agent.rs`: `IceAgent` with `gather()`, `re_gather()`, `apply_peer_update()`
- `SignalMessage::CandidateUpdate` with monotonic generation counter
- Relay forwards `CandidateUpdate` like `MediaPathReport`
- Desktop handles and emits to JS frontend
- Transport hot-swap: designed but not yet wired into live call engine
### Phase 8.4: Netcheck Diagnostic (Done)
- `netcheck.rs`: comprehensive network diagnostic (NAT type, reflexive addr, IPv4/v6, port mapping, relay latencies)
- CLI: `wzp-client --netcheck <relay>`
### Phase 8.5: Region-Based Relay Selection (Done — data model)
- `relay_map.rs`: `RelayMap` sorted by RTT with `preferred()` selection
- `RegisterPresenceAck` extended with `relay_region` + `available_relays`
### Phase 8.6: Hard NAT Traversal (Phase A done, B-D pending)
- **Phase A (Done)**: Port allocation pattern detection — `PortAllocation` enum (`PortPreserving`/`Sequential{delta}`/`Random`/`Unknown`), `detect_port_allocation()` probes N STUN servers from single socket, `classify_port_allocation()` with wraparound + jitter tolerance, `predict_ports()` for sequential NATs
- **Phase B (signal ready)**: `HardNatProbe` signal message carries `port_sequence`, `allocation`, `external_ip` — relay forwarding implemented. Actual dial-to-predicted-ports integration into `dual_path::race()` pending.
- **Phase C (not started)**: Birthday attack (256 sockets × 1024 probes) for random NATs
- **Phase D (not started)**: Hybrid waterfall with background relay-to-direct upgrade
- `NetcheckReport.port_allocation` populated automatically from `detect_port_allocation()`
- See `docs/PRD-hard-nat.md` for full design