docs: protocol audit 2026-05-25, update architecture + Obsidian vault

Audit: - docs/AUDIT-2026-05-25.md: full protocol audit covering 8 findings (4 critical, 2 high, 5 medium, 4 low) with code references and fix effort estimates - vault/Audit/Tasks.md: Obsidian Tasks plugin file tracking all audit items with priorities, due dates, and per-step checklists Architecture docs updated for Wire format v2 and Wave 5/6 features: - ARCHITECTURE.md: adds wzp-video to dependency graph and project structure; wire format updated to v2 (16B header, 5B MiniHeader); relay concurrency section corrected (DashMap+RwLock is current, not a future optimization); test count 571→702; Android note - PROGRESS.md: Wave 5 and Wave 6 sections appended; test count 372→702; current status and open blockers as of 2026-05-25 - ROAD-TO-VIDEO.md: implementation status table inserted (✅/🟡/🔴/🔲 per phase); 6-step critical path to first video call - WZP-SPEC.md: MediaHeader updated to v2 (16B byte-aligned); MiniHeader updated to 5B with seq_delta; codec IDs 9-12 added (H.264/H.265/AV1); version negotiation section added Obsidian vault (vault/): - 114 files across Architecture/, PRDs/, Reports/, Android/, Reference/, Audit/ with YAML frontmatter - 00 - Home.md index note with wiki links - .obsidian/app.json config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 06:00:17 +04:00
parent 12b0d9738f
commit ed8a7ae5aa
120 changed files with 22781 additions and 65 deletions
--- a/vault/PRDs/PRD-p2p-direct.md
+++ b/vault/PRDs/PRD-p2p-direct.md
@@ -0,0 +1,217 @@
+---
+tags: [prd, wzp]
+type: prd
+---
+
+# PRD: Peer-to-Peer Direct Calls (No Relay)
+
+## Problem
+
+All calls currently route through a relay, even 1-on-1 calls between clients that could reach each other directly. This adds latency (2x hop), creates a single point of failure, and requires trusting the relay operator (even though media is encrypted, the relay sees metadata).
+
+## Solution
+
+For 1-on-1 calls, clients attempt a direct QUIC connection using STUN-discovered addresses. If NAT traversal succeeds, media flows directly between peers. If it fails, fall back to relay-assisted mode (current behavior).
+
+## Architecture
+
+```
+Preferred (P2P):
+  Client A ←──QUIC direct──→ Client B
+  (no relay in media path, true E2E)
+
+Fallback (Relay):
+  Client A ──→ Relay ──→ Client B
+  (current model)
+
+Hybrid discovery:
+  Client A → Relay (signaling only) → Client B
+       ↓                                    ↓
+    STUN server                        STUN server
+       ↓                                    ↓
+    Discover public IP:port          Discover public IP:port
+       ↓                                    ↓
+    Exchange candidates via relay signaling
+       ↓                                    ↓
+    Attempt direct QUIC connection ←──→
+```
+
+## Why P2P = True E2E
+
+- QUIC TLS handshake establishes encrypted tunnel directly between A and B
+- No third party sees the traffic
+- Certificate pinning via identity fingerprints: each client derives their TLS cert from their Ed25519 seed (same as relay identity). During QUIC handshake, both sides verify the peer's cert fingerprint against the known identity
+- MITM elimination: if A knows B's fingerprint (from prior call, QR code, or identity server), any interceptor presents a different cert → fingerprint mismatch → connection rejected
+- Stronger guarantee than relay-assisted: user doesn't need to trust relay operator
+
+## Requirements
+
+### Phase 1: STUN Discovery
+
+1. **STUN client**: lightweight UDP-based STUN client to discover public IP:port
+   - Use existing public STUN servers (stun.l.google.com:19302, etc.)
+   - Or run a STUN server alongside the relay
+   - Discover: local addresses, server-reflexive addresses (STUN), relay candidates (TURN/relay fallback)
+
+2. **Candidate gathering**: on call initiation, gather all candidates:
+   - Host candidates: local network interfaces
+   - Server-reflexive: STUN-discovered public IP:port
+   - Relay candidate: the relay's address (fallback)
+
+3. **Candidate exchange**: via relay signaling channel (existing `IceCandidate` signal message)
+   - A sends candidates to relay → relay forwards to B
+   - B sends candidates to relay → relay forwards to A
+
+### Phase 2: Direct Connection
+
+1. **QUIC hole punching**: both clients simultaneously attempt QUIC connections to each other's candidates
+   - Quinn supports connecting to multiple addresses
+   - First successful connection wins
+   - Timeout after 3 seconds, fall back to relay
+
+2. **Identity verification**: during QUIC handshake, verify peer's TLS cert fingerprint
+   - `server_config_from_seed()` already exists — derive client cert from identity seed
+   - Both sides present certs (mutual TLS)
+   - Verify fingerprint matches expected identity
+
+3. **Media flow**: once connected, use existing `QuinnTransport` for media + signals
+   - Same `send_media()` / `recv_media()` API
+   - Same codec pipeline, FEC, jitter buffer
+   - No code changes needed in the call engine
+
+### Phase 3: Adaptive Quality (P2P)
+
+P2P connections have direct quality visibility — no relay middleman:
+
+1. Both clients observe RTT, loss, jitter directly from QUIC stats
+2. Adapt codec quality based on direct observations
+3. Since only 2 participants, coordinated switching is simple: propose → ack → switch
+
+This is the simplest case for adaptive quality. Once proven, backport the logic to relay-assisted mode.
+
+### Phase 4: Hybrid Mode
+
+1. **Call initiation**: always connect to relay for signaling
+2. **Parallel attempt**: while relay call is active, attempt P2P in background
+3. **Seamless migration**: if P2P succeeds, migrate media path from relay to direct
+   - Both clients switch simultaneously
+   - Relay connection kept alive for signaling (presence, room updates)
+4. **Fallback**: if P2P connection drops, seamlessly fall back to relay
+
+## Security Properties
+
+| Property | Relay Mode | P2P Mode |
+|----------|-----------|----------|
+| Encryption | ChaCha20-Poly1305 (app layer) | QUIC TLS 1.3 + ChaCha20-Poly1305 |
+| Key exchange | Via relay signaling | Direct QUIC handshake |
+| Identity verification | TOFU (server fingerprint) | Mutual TLS cert pinning |
+| Metadata privacy | Relay sees who talks to whom | No third party sees anything |
+| MITM resistance | Depends on relay trust | Strong (cert pinning) |
+| Forward secrecy | ECDH ephemeral keys | QUIC built-in + app-layer rekey |
+
+## Implementation Notes
+
+### STUN in Rust
+
+Use `stun-rs` or `webrtc-rs` crate for STUN client. Minimal: just need Binding Request/Response to discover server-reflexive address.
+
+### Quinn Hole Punching
+
+Quinn's `Endpoint` can both listen and connect. For hole punching:
+```rust
+let endpoint = create_endpoint(bind_addr, Some(server_config))?;
+// Send connect to peer's address (opens NAT pinhole)
+let conn = connect(&endpoint, peer_addr, "peer", client_config).await?;
+// Simultaneously, peer connects to our address
+// First successful handshake wins
+```
+
+### Client TLS Certificate
+
+Already have `server_config_from_seed()` for relays. Create `client_config_from_seed()` that presents a TLS client certificate derived from the identity seed. The peer verifies this cert's fingerprint.
+
+### Signaling via Relay
+
+The existing relay connection carries `IceCandidate` signals. No new infrastructure needed — just use the relay as a dumb signaling pipe for candidate exchange.
+
+## Non-Goals (v1)
+
+- SFU over P2P (P2P is 1-on-1 only; multi-party uses relay SFU)
+- TURN server (relay acts as the fallback, no separate TURN)
+- mDNS local discovery (future)
+- Mesh P2P for multi-party (future, complex)
+
+## Milestones
+
+| Phase | Scope | Effort | Status |
+|-------|-------|--------|--------|
+| 1 | STUN client + candidate gathering | 2 days | Done |
+| 2 | QUIC hole punching + identity verification | 3 days | Done |
+| 3 | Adaptive quality on P2P connection | 2 days | Done (#23) |
+| 4 | Hybrid mode (relay + P2P, seamless migration) | 3 days | Done |
+| 5 | Single-socket Nebula (shared signal+direct endpoint) | 2 days | Done |
+| 6 | ICE path negotiation + dual-path race | 3 days | Done |
+| 7 | IPv6 dual-socket | 2 days | Done (but `dual_path.rs` integration tests broken — missing `ipv6_endpoint` arg) |
+| 8.1 | Public STUN client (RFC 5389) | 1 day | Done |
+| 8.2 | PCP/PMP/UPnP port mapping | 2 days | Done |
+| 8.3 | Mid-call ICE re-gathering + CandidateUpdate signal | 2 days | Done (signal plane; transport hot-swap TODO) |
+| 8.4 | Netcheck diagnostic | 1 day | Done |
+| 8.5 | Region-based relay selection (data model) | 1 day | Done |
+| 8.6a | Hard NAT: port allocation detection | 1 day | Done |
+| 8.6b | Hard NAT: sequential port prediction signal | 1 day | Done (signal + prediction fn; dial integration pending) |
+| 8.6c | Hard NAT: birthday attack (256×1024 probes) | 3 days | Not started |
+| 8.6d | Hard NAT: hybrid waterfall + background upgrade | 2 days | Not started |
+
+## Implementation Status (2026-04-13)
+
+Phases 1-2, 4-7 are implemented. First P2P call completed 2026-04-12.
+
+### Known regression
+
+Phase 7 added `ipv6_endpoint: Option<Endpoint>` parameter to `race()` in `crates/wzp-client/src/dual_path.rs` but the 3 test call sites in `crates/wzp-client/tests/dual_path.rs` (lines 111, 153, 191) were not updated — they pass 6 args instead of 7. Fix: add `None,` after the `shared_endpoint` arg in each call.
+
+## Update (2026-04-13)
+
+P2P adaptive quality (#23) now implemented:
+- Both peers self-observe network quality from QUIC path stats
+- Quality reports generated every ~1s and attached to outgoing packets
+- AdaptiveQualityController drives codec switching on both P2P and relay calls
+
+## Update (2026-04-14): Phase 8 — Tailscale-Inspired Enhancements
+
+Added 5 new modules to bring NAT traversal capability close to Tailscale's:
+
+### Phase 8.1: Public STUN Client (Done)
+- `stun.rs`: RFC 5389 Binding Request/Response over raw UDP
+- Independent reflexive discovery via public STUN servers (Google, Cloudflare)
+- `detect_nat_type_with_stun()` combines relay + STUN probes for higher confidence
+- STUN fallback in desktop's `try_reflect_own_addr()` when relay reflection fails
+
+### Phase 8.2: PCP/PMP/UPnP Port Mapping (Done)
+- `portmap.rs`: NAT-PMP (RFC 6886), PCP (RFC 6887), UPnP IGD
+- Gateway discovery (macOS + Linux), try NAT-PMP → PCP → UPnP in sequence
+- New candidate type: `PeerCandidates.mapped` + signal fields `caller_mapped_addr`/`callee_mapped_addr`/`peer_mapped_addr`
+- Dial order: host → mapped → reflexive (mapped helps on symmetric NATs)
+
+### Phase 8.3: Mid-Call ICE Re-Gathering (Done — signal plane)
+- `ice_agent.rs`: `IceAgent` with `gather()`, `re_gather()`, `apply_peer_update()`
+- `SignalMessage::CandidateUpdate` with monotonic generation counter
+- Relay forwards `CandidateUpdate` like `MediaPathReport`
+- Desktop handles and emits to JS frontend
+- Transport hot-swap: designed but not yet wired into live call engine
+
+### Phase 8.4: Netcheck Diagnostic (Done)
+- `netcheck.rs`: comprehensive network diagnostic (NAT type, reflexive addr, IPv4/v6, port mapping, relay latencies)
+- CLI: `wzp-client --netcheck <relay>`
+
+### Phase 8.5: Region-Based Relay Selection (Done — data model)
+- `relay_map.rs`: `RelayMap` sorted by RTT with `preferred()` selection
+- `RegisterPresenceAck` extended with `relay_region` + `available_relays`
+
+### Phase 8.6: Hard NAT Traversal (Phase A done, B-D pending)
+- **Phase A (Done)**: Port allocation pattern detection — `PortAllocation` enum (`PortPreserving`/`Sequential{delta}`/`Random`/`Unknown`), `detect_port_allocation()` probes N STUN servers from single socket, `classify_port_allocation()` with wraparound + jitter tolerance, `predict_ports()` for sequential NATs
+- **Phase B (signal ready)**: `HardNatProbe` signal message carries `port_sequence`, `allocation`, `external_ip` — relay forwarding implemented. Actual dial-to-predicted-ports integration into `dual_path::race()` pending.
+- **Phase C (not started)**: Birthday attack (256 sockets × 1024 probes) for random NATs
+- **Phase D (not started)**: Hybrid waterfall with background relay-to-direct upgrade
+- `NetcheckReport.port_allocation` populated automatically from `detect_port_allocation()`
+- See `docs/PRD-hard-nat.md` for full design