docs: protocol audit 2026-05-25, update architecture + Obsidian vault

Audit:
- docs/AUDIT-2026-05-25.md: full protocol audit covering 8 findings
  (4 critical, 2 high, 5 medium, 4 low) with code references and fix
  effort estimates
- vault/Audit/Tasks.md: Obsidian Tasks plugin file tracking all audit
  items with priorities, due dates, and per-step checklists

Architecture docs updated for Wire format v2 and Wave 5/6 features:
- ARCHITECTURE.md: adds wzp-video to dependency graph and project
  structure; wire format updated to v2 (16B header, 5B MiniHeader);
  relay concurrency section corrected (DashMap+RwLock is current, not
  a future optimization); test count 571→702; Android note
- PROGRESS.md: Wave 5 and Wave 6 sections appended; test count 372→702;
  current status and open blockers as of 2026-05-25
- ROAD-TO-VIDEO.md: implementation status table inserted (/🟡/🔴/🔲
  per phase); 6-step critical path to first video call
- WZP-SPEC.md: MediaHeader updated to v2 (16B byte-aligned); MiniHeader
  updated to 5B with seq_delta; codec IDs 9-12 added (H.264/H.265/AV1);
  version negotiation section added

Obsidian vault (vault/):
- 114 files across Architecture/, PRDs/, Reports/, Android/,
  Reference/, Audit/ with YAML frontmatter
- 00 - Home.md index note with wiki links
- .obsidian/app.json config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Siavash Sameni
2026-05-25 06:00:17 +04:00
parent 12b0d9738f
commit ed8a7ae5aa
120 changed files with 22781 additions and 65 deletions

View File

@@ -0,0 +1,219 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Adaptive Quality Control (Auto Codec)
## Problem
When a user selects "Auto" quality, the system currently just starts at Opus 24k (GOOD) and never changes. There is no runtime adaptation — if the network degrades mid-call, audio breaks up instead of gracefully stepping down to a lower bitrate codec. Conversely, if the network is excellent, the user stays on 24k when they could have studio-quality 64k.
The relay already sends `QualityReport` messages with loss % and RTT, and a `QualityAdapter` exists in `call.rs` that classifies network conditions into GOOD/DEGRADED/CATASTROPHIC — but none of this is wired into the Android or desktop engines.
## Solution
Wire the existing `QualityAdapter` into both engines so that "Auto" mode continuously monitors network quality and switches codecs mid-call. The full quality range should be used:
```
Excellent network → Studio 64k (best quality)
Good network → Opus 24k (default)
Degraded network → Opus 6k (lower bitrate, more FEC)
Poor network → Codec2 3.2k (vocoder, heavy FEC)
Catastrophic → Codec2 1.2k (minimum viable voice)
```
## Architecture
```
┌─────────────────────┐
Relay ──────────► │ QualityReport │ loss %, RTT, jitter
│ (every ~1s) │
└────────┬────────────┘
┌─────────────────────┐
│ QualityAdapter │ classify + hysteresis
│ (3-report window) │
└────────┬────────────┘
│ recommend new profile
┌──────────────┴──────────────┐
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Encoder │ │ Decoder │
│ set_profile() │ │ (auto-switch │
│ + FEC update │ │ already works)│
└────────────────┘ └────────────────┘
```
## Existing Infrastructure
### What already exists (in `crates/wzp-client/src/call.rs`)
1. **`QualityAdapter`** (lines 97-196):
- Sliding window of `QualityReport` messages
- `classify()`: loss > 15% or RTT > 200ms → CATASTROPHIC, loss > 5% or RTT > 100ms → DEGRADED, else → GOOD
- `should_switch()`: hysteresis — requires 3 consecutive reports recommending the same profile before switching
- Prevents oscillation between profiles
2. **`QualityReport`** (in `wzp-proto/src/packet.rs`):
- Sent by relay piggy-backed on media packets
- Fields: `loss_pct` (u8, 0-255 scaled), `rtt_4ms` (u8, RTT in 4ms units), `jitter_ms`, `bitrate_cap_kbps`
3. **`CallEncoder::set_profile()`** / **`CallDecoder` auto-switch**:
- Encoder can switch codec mid-stream
- Decoder already auto-detects incoming codec from packet headers
### What's been implemented since PRD was written
1. **QualityReport ingestion**~~neither Android engine nor desktop engine reads quality reports from the relay~~ **Done**: both Android (`crates/wzp-android/src/engine.rs`) and desktop (`desktop/src-tauri/src/engine.rs`) recv tasks ingest quality reports and feed `AdaptiveQualityController`
2. **Profile switch loop**~~no periodic check~~ **Done**: `pending_profile` AtomicU8 bridges recv→send task in both engines; send task applies profile switch at frame boundary
3. **Notification to UI**~~when quality changes, the UI should show the current active codec~~ **Done**: `tx_codec`/`rx_codec` in desktop `EngineStatus`; `currentCodec`/`peerCodec` in Android `CallStats`
### What's still missing
1. **Upward adaptation**`QualityAdapter` only classifies into 3 tiers (GOOD/DEGRADED/CATASTROPHIC). Needs extension to recommend studio tiers when conditions are excellent (loss < 1%, RTT < 50ms). See Phase 2 below.
2. **Relay QualityDirective handling** — relay broadcasts coordinated quality directives but neither engine processes them (signals are silently discarded). See PRD-coordinated-codec.md for details.
## Requirements
### Phase 1: Basic Adaptive (3-tier)
**Both Android and Desktop:**
1. **Ingest QualityReports**: In the recv loop, extract `quality_report` from incoming `MediaPacket`s when present. Feed to `QualityAdapter`.
2. **Periodic quality check**: Every 1 second (or on each QualityReport), call `adapter.should_switch(&current_profile)`. If it returns `Some(new_profile)`:
- Switch the encoder: `encoder.set_profile(new_profile)`
- Update FEC encoder: `fec_enc = create_encoder(&new_profile)`
- Update frame size if changed (e.g., 20ms → 40ms)
- Log the switch
3. **Frame size adaptation on switch**: When switching from 20ms to 40ms frames (or vice versa):
- Android: update `frame_samples` variable, resize `capture_buf`
- Desktop: same — the send loop reads `frame_samples` dynamically
4. **UI indicator**: Show current active codec in the call screen stats line.
- Android: add to `CallStats` and display in stats text
- Desktop: add to `get_status` response and display in stats div
5. **Only in Auto mode**: Adaptive switching should only happen when the user selected "Auto". If they manually selected a profile, respect their choice.
### Phase 2: Extended Range (5-tier)
Extend `QualityAdapter::classify()` to use the full codec range:
| Condition | Profile | Codec |
|-----------|---------|-------|
| loss < 1% AND RTT < 30ms | STUDIO_64K | Opus 64k |
| loss < 1% AND RTT < 50ms | STUDIO_48K | Opus 48k |
| loss < 2% AND RTT < 80ms | STUDIO_32K | Opus 32k |
| loss < 5% AND RTT < 100ms | GOOD | Opus 24k |
| loss < 15% AND RTT < 200ms | DEGRADED | Opus 6k |
| loss >= 15% OR RTT >= 200ms | CATASTROPHIC | Codec2 1.2k |
With hysteresis:
- **Downgrade**: 3 consecutive reports (fast reaction to degradation)
- **Upgrade**: 5 consecutive reports (slow, cautious improvement)
- **Studio upgrade**: 10 consecutive reports (very conservative — avoid bouncing to 64k on brief good patches)
### Phase 3: Bandwidth Probing
Rather than relying solely on loss/RTT:
1. Start at GOOD
2. After 10 seconds of stable call, probe upward by switching to STUDIO_32K
3. If no quality degradation after 5 seconds, probe to STUDIO_48K
4. If degradation detected, immediately fall back
5. This discovers the true available bandwidth rather than guessing from loss stats
## Implementation Plan
### Android (`crates/wzp-android/src/engine.rs`)
```rust
// In the recv loop, after decoding:
if let Some(ref qr) = pkt.quality_report {
quality_adapter.ingest(qr);
}
// Periodic check (every 50 frames ≈ 1 second):
if auto_profile && frames_decoded % 50 == 0 {
if let Some(new_profile) = quality_adapter.should_switch(&current_profile) {
info!(from = ?current_profile.codec, to = ?new_profile.codec, "auto: switching quality");
let _ = encoder_ref.lock().set_profile(new_profile);
fec_enc_ref.lock() = create_encoder(&new_profile);
current_profile = new_profile;
frame_samples = frame_samples_for(&new_profile);
// Resize capture buffer if needed
}
}
```
**Challenge**: The encoder is in the send task and the quality reports arrive in the recv task. Need shared state (AtomicU8 for profile index, or a channel).
**Recommended approach**: Use an `AtomicU8` that the recv task writes and the send task reads:
```rust
let pending_profile = Arc::new(AtomicU8::new(0xFF)); // 0xFF = no change
// Recv task: when adapter recommends switch
pending_profile.store(new_profile_index, Ordering::Release);
// Send task: check at frame boundary
let p = pending_profile.swap(0xFF, Ordering::Acquire);
if p != 0xFF { /* apply switch */ }
```
### Desktop (`desktop/src-tauri/src/engine.rs`)
Same pattern. The desktop engine already has separate send/recv tasks with shared atomics for mic_muted, etc. Add a `pending_profile: Arc<AtomicU8>` following the same pattern.
### Desktop CLI (`crates/wzp-client/src/call.rs`)
The `CallEncoder` already has `set_profile()`. The `CallDecoder` already auto-switches. Just need to:
1. Add `QualityAdapter` to `CallDecoder`
2. Feed quality reports in `ingest()`
3. Check `should_switch()` in `decode_next()`
4. Emit the recommendation via a callback or return value
## Testing
1. **Local test with tc/netem**: Use Linux traffic control to simulate loss/latency:
```bash
# Simulate 10% loss, 150ms RTT
tc qdisc add dev lo root netem loss 10% delay 75ms
# Run 2 clients in auto mode, verify they switch to DEGRADED
```
2. **CLI test**: Run `wzp-client --profile auto` between two instances with simulated network conditions
3. **Relay quality reports**: Verify the relay actually sends QualityReport messages. If it doesn't yet, that needs to be implemented first (check relay code).
## Open Questions
1. **Does the relay currently send QualityReports?** If not, Phase 1 is blocked until the relay implements per-client loss/RTT tracking and report generation. The relay sees all packets and can compute loss % per sender.
2. **Codec2 3.2k placement**: Should auto mode use Codec2 3.2k between DEGRADED and CATASTROPHIC? It's 20ms frames (lower latency than Opus 6k's 40ms) but speech-only quality.
3. **Cross-client adaptation**: If client A is on GOOD and client B auto-adapts to CATASTROPHIC, client A still sends Opus 24k. Client B can decode it fine (auto-switch on recv). But should A also be told to lower quality to save B's bandwidth? This requires signaling between clients.
## Milestones
| Phase | Scope | Effort | Status |
|-------|-------|--------|--------|
| 0 | Verify relay sends QualityReports | 0.5 day | Done |
| 1a | Wire QualityAdapter in Android engine | 1 day | Done |
| 1b | Wire QualityAdapter in desktop engine | 1 day | Done |
| 1c | UI indicator (current codec) | 0.5 day | Done |
| 2 | Extended 5-tier classification (Studio64k→Catastrophic) | 0.5 day | Done (2026-04-13) |
| 3 | Bandwidth probing | 2 days | Pending (task #10) |
## Implementation Status Update (2026-04-13)
All phases implemented:
- Phase 1: QualityAdapter with 3-tier classification — DONE
- Phase 2: Extended 5-tier (Studio 64k/48k/32k + GOOD + DEGRADED + CATASTROPHIC) — DONE
- Phase 3: Bandwidth probing — NOT DONE (see remaining tasks)
- P2P adaptive quality: QualityReport::from_path_stats() + self-observation from quinn stats — DONE
- Both relay and P2P calls now have full adaptive quality switching

View File

@@ -0,0 +1,110 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Bluetooth Audio Routing
> Phase: Implemented
> Status: Ready for testing
> Platforms: Android (native Kotlin app + Tauri desktop app)
## Problem
WarzonePhone had `AudioRouteManager.kt` with complete Bluetooth SCO support, but it was disconnected from both UIs. Users with Bluetooth headsets had no way to route call audio to them.
## Solution
Wire Bluetooth SCO routing end-to-end through both app variants, replacing the binary speaker toggle with a 3-way audio route cycle: **Earpiece → Speaker → Bluetooth**.
## Architecture
```
┌─────────────────────────────────────────────────────┐
│ Native Kotlin App (com.wzp) │
│ │
│ InCallScreen ──► CallViewModel ──► AudioRouteManager
│ (Compose UI) cycleAudioRoute() setSpeaker() │
│ "Ear/Spk/BT" audioRoute Flow setBluetoothSco()
│ isBluetoothAvailable()
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tauri Desktop App (com.wzp.desktop) │
│ │
│ main.ts ──► Tauri Commands ──► android_audio.rs │
│ cycleAudioRoute() set_bluetooth_sco() JNI calls │
│ "Ear/Spk/BT" is_bluetooth_available() │
│ get_audio_route() │
│ │
│ After each route change: Oboe stop + start │
│ (spawn_blocking to avoid stalling tokio) │
└─────────────────────────────────────────────────────┘
```
## Components Modified
### Native Kotlin App
| File | Change |
|------|--------|
| `CallViewModel.kt` | Added `audioRoute: StateFlow<AudioRoute>`, `cycleAudioRoute()`, wired `onRouteChanged` callback |
| `InCallScreen.kt` | `ControlRow` now takes `audioRoute: AudioRoute` + `onCycleRoute`, displays Ear/Spk/BT with distinct colors |
### Tauri App
| File | Change |
|------|--------|
| `android_audio.rs` | `setCommunicationDevice()` (API 31+) with `startBluetoothSco()` fallback; `set_audio_mode_communication/normal()` for call lifecycle |
| `lib.rs` | `set_bluetooth_sco`, `is_bluetooth_available`, `get_audio_route` Tauri commands; SCO polling + 500ms route delay |
| `wzp_native.rs` | Added `audio_start_bt()` for BT-mode Oboe (skips 48kHz + VoiceCommunication preset) |
| `oboe_bridge.cpp` | `bt_active` flag: capture skips sample rate + input preset; playout uses `Usage::Media`; both use `Shared` mode + `SampleRateConversionQuality::Best` |
| `engine.rs` | `set_audio_mode_communication()` before `audio_start()`; `set_audio_mode_normal()` after `audio_stop()` |
| `MainActivity.kt` | Removed `MODE_IN_COMMUNICATION` from app launch — deferred to call start |
| `main.ts` | Replaced `speakerphoneOn` toggle with `currentAudioRoute` cycling logic |
| `style.css` | Added `.bt-on` CSS class (blue-400 highlight) |
## Audio Route Lifecycle
1. **App launch**`MODE_NORMAL` (other apps' audio unaffected — BT A2DP music keeps playing)
2. **Call starts**`MODE_IN_COMMUNICATION` set via JNI, Oboe opens with earpiece routing
3. **User taps route button** → cycles to next available route
4. **Route changes**`setCommunicationDevice()` (API 31+) + Oboe restart in BT mode or normal mode
5. **BT device disconnects mid-call**`AudioDeviceCallback.onAudioDevicesRemoved` fires → auto-fallback to Earpiece/Speaker
6. **Call ends** → route reset, `MODE_NORMAL` restored
## Route Cycling Logic
```
Available routes = [Earpiece, Speaker] + [Bluetooth] if SCO device connected
Tap cycle:
Earpiece → Speaker → Bluetooth (if available) → Earpiece → ...
If BT not available:
Earpiece → Speaker → Earpiece → ...
```
## Permissions
- `BLUETOOTH_CONNECT` (Android 12+) — already in `AndroidManifest.xml`
- `MODIFY_AUDIO_SETTINGS` — already in manifest
## Known Limitations
- **SCO only** — no A2DP (stereo music profile). SCO is correct for VoIP (bidirectional mono).
- **API 31+ required for modern path** — `setCommunicationDevice()` is the primary BT routing API. Fallback to deprecated `startBluetoothSco()` on API < 31 (untested).
- **BT SCO capture at 8/16kHz** — Oboe resamples to 48kHz via `SampleRateConversionQuality::Best`. Quality is inherently limited by the SCO codec (CVSD at 8kHz or mSBC at 16kHz).
- **No auto-switch on BT connect** — when a BT device connects mid-call, user must tap the route button.
- **500ms route switch delay** — after `setCommunicationDevice()` returns, the audio policy needs time to apply the bt-sco route. We wait 500ms before restarting Oboe.
## Testing
1. Pair a Bluetooth SCO headset with Android device
2. Start call → verify Earpiece is default
3. Tap route → Speaker (audio moves to loudspeaker, button shows "Spk")
4. Tap route → BT (audio moves to headset, button shows "BT", blue highlight)
5. Tap route → Earpiece (audio back to earpiece, button shows "Ear")
6. Disconnect BT mid-call → verify auto-fallback
7. Verify both app variants work identically
8. Verify no audio glitches during route transitions

View File

@@ -0,0 +1,226 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Coordinated Codec Switching (Relay-Judged Quality)
## Problem
The current adaptive quality system (`QualityAdapter` in call.rs) exists but isn't wired into either engine. Clients encode at a fixed quality chosen at call start. When network conditions change mid-call, audio degrades instead of gracefully stepping down. When conditions improve, clients stay on low quality unnecessarily.
Additionally, in SFU mode with multiple participants, uncoordinated codec switching creates asymmetry: if client A upgrades to 64k while B stays on 24k, bandwidth is wasted. Participants should switch together.
## Solution
The **relay acts as the quality judge** since it sees both sides of every connection. It monitors packet loss, jitter, and RTT per participant, then signals quality recommendations. Clients react to these signals with coordinated codec switches.
## Architecture
```
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Client A │◄──────►│ Relay │◄──────►│ Client B │
│ │ │ (judge) │ │ │
│ Encoder │ │ │ │ Encoder │
│ Decoder │ │ Monitor │ │ Decoder │
└─────────┘ │ per-peer│ └─────────┘
│ quality │
└────┬────┘
Quality Signals:
- StableSignal (conditions good)
- DegradeSignal (conditions bad)
- UpgradeProposal (try higher quality?)
- UpgradeConfirm (all agreed, switch at T)
```
## Quality Classification (Relay-Side)
The relay monitors each participant's connection quality:
| Condition | Classification | Action |
|-----------|---------------|--------|
| loss >= 15% OR RTT >= 200ms | Critical | Immediate downgrade signal |
| loss >= 5% OR RTT >= 100ms | Degraded | Downgrade signal after 3 reports |
| loss < 2% AND RTT < 80ms | Good | Stable signal |
| loss < 1% AND RTT < 50ms for 30s | Excellent | Upgrade proposal |
| loss < 0.5% AND RTT < 30ms for 60s | Studio | Studio upgrade proposal |
## Coordinated Switching Protocol
### Downgrade (fast, safety-first)
1. Relay detects degradation for ANY participant
2. Relay sends `QualityUpdate { recommended_profile: DEGRADED }` to ALL participants
3. ALL participants immediately switch encoder to the recommended profile
4. No negotiation — downgrade is mandatory and instant
### Upgrade (slow, consensual)
1. Relay detects sustained good conditions for ALL participants (threshold: 30s stable)
2. Relay sends `UpgradeProposal { target_profile, switch_timestamp }` to all
3. Each client responds: `UpgradeAccept` or `UpgradeReject`
4. If ALL accept within 5s → Relay sends `UpgradeConfirm { profile, switch_at_ms }`
5. All clients switch encoder at the agreed timestamp (relative to session clock)
6. If ANY rejects or times out → upgrade cancelled, stay on current profile
### Asymmetric Encoding (SFU optimization)
In SFU mode, each client encodes independently. The relay could allow:
- Client A (strong connection): encode at 64k
- Client B (weak connection): encode at 6k
- Relay forwards A's 64k to B's decoder (auto-switch handles it)
- B benefits from A's quality without needing to send at 64k
This requires NO protocol changes — just each client independently following the relay's recommendation for their own encoding quality. The decoder already handles any codec.
### Split Network Consideration
If participant A has great quality but participant C has terrible quality:
- Option 1: **Match weakest link** — everyone encodes at C's level (current approach, simple)
- Option 2: **Per-participant recommendations** — A encodes at 64k, C encodes at 6k. B (good connection) receives and decodes both. Works because decoders auto-switch per packet.
- Option 3: **Relay transcoding** — relay re-encodes A's 64k as 6k for C. Adds CPU on relay, but saves bandwidth for C. Future feature.
Recommended: start with Option 1 (match weakest), add Option 2 later.
## Signal Messages (New/Modified)
```rust
/// Quality signal from relay to client
QualityDirective {
/// Recommended profile to use for encoding
recommended_profile: QualityProfile,
/// Reason for the recommendation
reason: QualityReason,
}
enum QualityReason {
/// Network conditions require this quality level
NetworkCondition,
/// Coordinated upgrade — all participants agreed
CoordinatedUpgrade,
/// Coordinated downgrade — weakest link determines level
CoordinatedDowngrade,
}
/// Upgrade proposal from relay
UpgradeProposal {
target_profile: QualityProfile,
/// Milliseconds from now when the switch would happen
switch_delay_ms: u32,
}
/// Client response to upgrade proposal
UpgradeResponse {
accepted: bool,
}
/// Confirmed upgrade — all clients switch at this time
UpgradeConfirm {
profile: QualityProfile,
/// Session-relative timestamp to switch (ms since call start)
switch_at_session_ms: u64,
}
```
## Relay-Side Implementation
### Per-Participant Quality Tracking
```rust
struct ParticipantQuality {
/// Sliding window of recent observations
loss_samples: VecDeque<f32>, // last 30 seconds
rtt_samples: VecDeque<u32>, // last 30 seconds
jitter_samples: VecDeque<u32>,
/// Current classification
classification: QualityClass,
/// How long current classification has been stable
stable_since: Instant,
}
```
### Quality Monitor Task (on relay)
Runs alongside the SFU forwarding loop:
1. Every 1 second, compute per-participant quality from QUIC connection stats
2. Classify each participant
3. If ANY participant degrades → send downgrade to ALL
4. If ALL participants stable for threshold → propose upgrade
5. Track upgrade negotiation state
### Integration with Existing Code
The relay already has access to:
- `QuinnTransport::path_quality()` → loss, RTT, jitter, bandwidth estimates
- `QualityReport` embedded in media packet headers
- Per-session metrics in `RelayMetrics`
The quality monitor just needs to read these existing metrics and produce signals.
## Client-Side Implementation
### Handling Quality Signals
In the recv loop (both Android engine and desktop engine):
```rust
SignalMessage::QualityDirective { recommended_profile, .. } => {
// Immediate: switch encoder to recommended profile
encoder.set_profile(recommended_profile)?;
fec_enc = create_encoder(&recommended_profile);
frame_samples = frame_samples_for(&recommended_profile);
info!(codec = ?recommended_profile.codec, "quality directive: switched");
}
```
### P2P Quality (simpler case)
For P2P calls (no relay), both clients directly observe quality:
1. Each client runs its own `QualityAdapter` on the direct connection
2. When quality changes, client proposes to peer via signal
3. Simpler negotiation: only 2 parties, no relay middleman
4. Same coordinated switching logic, just peer-to-peer signals
## Backporting P2P → Relay
The quality monitoring and codec switching logic is identical:
- **P2P**: client observes quality directly → proposes switch to peer
- **Relay**: relay observes quality → proposes switch to all clients
The only difference is WHO makes the decision (client vs relay) and HOW many participants need to agree (2 vs N).
Implementation strategy: build for P2P first (simpler, 2 parties), then wrap the same logic with relay-mediated signals for SFU mode.
## Milestones
| Phase | Scope | Effort |
|-------|-------|--------|
| 1 | Relay-side quality monitor (per-participant tracking) | 1 day |
| 2 | Downgrade signal (immediate, match weakest) | 1 day |
| 3 | Client handling of QualityDirective | 1 day (both engines) |
| 4 | Upgrade proposal + negotiation protocol | 2 days |
| 5 | P2P quality adaptation (direct observation) | 1 day |
| 6 | Per-participant asymmetric encoding (Option 2) | 1 day |
## Implementation Status (2026-04-13)
Phases 1-2 are implemented. Phase 3 has a critical gap.
### What was built
- **`QualityDirective` signal** (`crates/wzp-proto/src/packet.rs`): New `SignalMessage` variant with `recommended_profile` and optional `reason`
- **`ParticipantQuality`** (`crates/wzp-relay/src/room.rs`): Per-participant quality tracking using `AdaptiveQualityController`, created on join, removed on leave
- **Weakest-link broadcast**: `observe_quality()` method computes room-wide worst tier, broadcasts `QualityDirective` to all participants when tier changes
- **Desktop engine handling** (`desktop/src-tauri/src/engine.rs`): `AdaptiveQualityController` in recv task, `pending_profile` AtomicU8 bridge to send task, auto-mode profile switching based on **inbound quality reports**
### Phase 3 completed (2026-04-13)
Both engines now handle `QualityDirective` signals from the relay:
- **Desktop** (`engine.rs`): both P2P and relay signal tasks match `QualityDirective`, extract `recommended_profile`, store index via `sig_pending_profile.store(idx, Release)`. Send task picks it up at the next frame boundary.
- **Android** (`engine.rs`): signal task matches `QualityDirective`, stores via `pending_profile_recv.store(idx, Release)`.
Relay-coordinated codec switching is now end-to-end: relay monitors → broadcasts directive → clients switch.
### Phase remaining
- Phase 4: Upgrade proposal/negotiation protocol for quality recovery (task #28)

View File

@@ -0,0 +1,175 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Delegated Trust for Relay Federation
## Problem
In the current federation model, when Relay 1 trusts Relay 2, and Relay 2 forwards media from Relay 3, Relay 1 has no way to know or control that Relay 3's traffic is reaching it. This is a trust gap — any relay in the chain can introduce untrusted traffic.
**Example:** Relay 1 (trusted zone) ←→ Relay 2 (hub) ←→ Relay 3 (unknown)
Relay 1 explicitly trusts Relay 2. But Relay 2 forwards Relay 3's media to Relay 1 without Relay 1's consent. Relay 1 receives media that originated from an entity it never approved.
## Solution
Add a `delegate` flag to `[[trusted]]` entries. When `delegate = true`, the relay accepts media forwarded through the trusted peer from relays that the trusted peer vouches for. When `delegate = false` (default), only media originating from explicitly trusted/peered relays is accepted.
## Trust Levels
| Config | Meaning |
|--------|---------|
| `[[peers]]` | "I connect to you and trust your identity" |
| `[[trusted]]` | "I accept connections from you" |
| `[[trusted]] delegate = true` | "I accept connections from you AND from relays you vouch for" |
| No entry | "I reject your connections and drop your forwarded media" |
## Configuration
```toml
# Relay 1: trusts Relay 2 and delegates trust
[[trusted]]
fingerprint = "relay-2-tls-fingerprint"
label = "Relay 2 (Hub)"
delegate = true # Accept relays that Relay 2 forwards from
# Without delegate (default = false):
[[trusted]]
fingerprint = "relay-4-tls-fingerprint"
label = "Relay 4"
# delegate = false (implicit default)
# Only direct media from Relay 4 is accepted
```
## Protocol Changes
### Relay-to-Relay Media Authorization
When Relay 2 forwards media from Relay 3 to Relay 1, the datagram needs to carry origin information so Relay 1 can decide whether to accept it.
**Option A: Origin tag in datagram** (recommended)
Extend the federation datagram format:
```
[room_hash: 8 bytes][origin_relay_fp: 8 bytes][media_packet]
```
The 8-byte origin fingerprint identifies which relay originally produced the media. The forwarding relay (Relay 2) sets this to the source relay's fingerprint. Relay 1 checks:
1. Is the origin relay directly trusted? → accept
2. Is the forwarding relay trusted with `delegate = true`? → accept
3. Otherwise → drop
**Option B: Trust announcement signal**
When Relay 2 connects to Relay 1, it sends a `FederationTrustChain` signal listing which relays it will forward from:
```rust
FederationTrustChain {
/// Fingerprints of relays this peer may forward media from
vouched_relays: Vec<String>,
}
```
Relay 1 checks each fingerprint against its policy:
- If Relay 2 has `delegate = true` in Relay 1's config → accept all listed relays
- If Relay 2 has `delegate = false` → reject, only accept direct media from Relay 2
Option B is simpler to implement (no datagram format change) but less granular.
### Recommended: Option B for v1, Option A for v2
Option B is simpler — the trust chain is established at connection time, not per-datagram. The forwarding relay announces what it will forward, and the receiving relay approves or rejects upfront.
## Implementation
### Config Changes
```rust
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct TrustedConfig {
pub fingerprint: String,
#[serde(default)]
pub label: Option<String>,
/// When true, also accept media forwarded through this relay from
/// relays it vouches for. Default: false.
#[serde(default)]
pub delegate: bool,
}
```
### Federation Signal
```rust
/// Sent after FederationHello — lists relays this peer will forward from.
FederationTrustChain {
/// TLS fingerprints of relays whose media may be forwarded through us.
vouched_relays: Vec<String>,
}
```
### Forwarding Authorization
In `handle_datagram`, before forwarding media to local participants:
```rust
// Check if we should accept this forwarded media
let is_authorized = if source_is_direct_peer {
true // Direct peer, always accepted
} else {
// Check if the forwarding peer has delegate=true
let forwarding_peer = fm.find_trusted_by_fingerprint(forwarding_peer_fp);
forwarding_peer.map(|t| t.delegate).unwrap_or(false)
};
if !is_authorized {
warn!("dropping forwarded media from unauthorized relay chain");
return;
}
```
### Relay 2 (Hub) Behavior
When Relay 2 receives `FederationTrustChain` queries from peers:
1. Collect all directly connected peer fingerprints
2. Send `FederationTrustChain { vouched_relays }` to each peer
3. When a new relay connects, update all peers' trust chains
### Anti-Spam Properties
| Attack | Mitigation |
|--------|-----------|
| Unknown relay connects to hub | Hub rejects (not in `[[trusted]]`) |
| Hub forwards spam relay's media | Receiving relay checks delegate flag, drops if false |
| Relay spoofs origin fingerprint | Origin tag is set by the forwarding relay, not the source. The forwarding relay is trusted, so if it lies about origin, the trust is misplaced at the config level. |
| Chain amplification (A→B→C→D→...) | TTL on forwarded datagrams (decrement at each hop, drop at 0). Default TTL=2 (one intermediate relay). |
## TTL for Chain Length
Add a TTL byte to the federation datagram to limit chain depth:
```
[room_hash: 8 bytes][ttl: 1 byte][media_packet]
```
- Default TTL = 2 (allows one intermediate relay: A→B→C)
- Each forwarding relay decrements TTL
- When TTL = 0, don't forward further (only deliver to local participants)
- Configurable per-relay: `max_federation_hops = 2`
## Milestones
| Phase | Scope | Effort |
|-------|-------|--------|
| 1 | Add `delegate` field to `TrustedConfig` | 0.5 day |
| 2 | `FederationTrustChain` signal + announcement | 1 day |
| 3 | Authorization check in `handle_datagram` | 0.5 day |
| 4 | TTL in federation datagrams | 0.5 day |
| 5 | Testing: authorized vs unauthorized forwarding | 0.5 day |
## Non-Goals (v1)
- Per-room trust policies (trust Relay X only for room "android")
- Dynamic trust negotiation (relays negotiate trust level at runtime)
- Revocation (removing a relay from trust chain requires config edit + restart)
- Cryptographic proof of origin (signed datagrams from source relay)

View File

@@ -0,0 +1,407 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: DRED Integration & Opus-Tier FEC Simplification
## Problem
WarzonePhone's audio loss-recovery stack is built around classical Opus + application-level RaptorQ FEC. It was the right answer when WZP was designed, but libopus 1.5 (December 2023) introduced **Deep REDundancy (DRED)** — a neural speech-recovery feature that is strictly better than classical FEC for the loss patterns VoIP calls actually experience. We are paying real latency, bitrate, and complexity costs for protection that DRED now does better and cheaper.
Concretely, on every Opus call today we pay:
- **~40100 ms of receiver-side latency** waiting for RaptorQ block completion before decode
- **1020% bitrate overhead** from RaptorQ repair symbols (more on studio profiles)
- **~2040% codec-internal overhead** from Opus inband FEC (LBRR)
- Classical Opus PLC on loss bursts exceeding the RaptorQ block size — which sounds robotic and gap-ridden
…in exchange for bit-exact recovery of isolated single-frame losses, which is perceptually indistinguishable from classical Opus PLC for 20 ms of speech. The protection is misaligned with the failure modes.
DRED delivers:
- **Zero added receive latency** — reconstruction runs only on detected loss
- **~1 kbps flat bitrate overhead** regardless of base bitrate
- **Plausible reconstruction of bursts up to ~1 second** — DRED's headline capability, exactly the regime RaptorQ can't touch
- Neural PLC that sounds like continuous speech, not a gap
We also have a second, unrelated problem blocking adoption: our FFI crate `audiopus_sys 0.2.2` vendors **libopus 1.3**, predating DRED entirely. We cannot enable DRED without first swapping the FFI layer. The naïve choice (`opus` crate from SpaceManiac) is a trap — it depends on the same dead `audiopus_sys`. The real target is `opusic-c 1.5.5` by DoumanAsh, which vendors libopus 1.5.2 with full DRED support and documents Android NDK cross-compile.
This PRD covers the FFI swap, DRED enablement, the decision to **remove RaptorQ and Opus inband FEC from the Opus tiers entirely** (keeping RaptorQ only for Codec2 where DRED is N/A), and the jitter buffer refactor that the DRED lookahead/backfill pattern requires.
## Goals
- Replace `audiopus 0.3.0-rc.0` + `audiopus_sys 0.2.2` (dead upstream, libopus 1.3) with `opusic-c 1.5.5` + `opusic-sys 0.6.0` (active upstream, libopus 1.5.2)
- Enable DRED on every Opus profile with a tiered duration policy, lower at studio bitrates and higher at degraded bitrates
- Disable Opus inband FEC (LBRR) on all Opus profiles — opusic-c's own docs recommend this, and it overlaps DRED's job
- Remove `wzp-fec` (RaptorQ) from the Opus tiers entirely — the latency and bitrate savings are real, and DRED strictly dominates it on speech
- Keep RaptorQ + current FEC ratios on the Codec2 tiers unchanged — DRED is libopus-only, Codec2 has no neural equivalent
- Refactor `wzp-transport::jitter` to a lookahead/backfill pattern that lets DRED reconstruct loss windows when the next packet arrives, instead of the current "wait for block completion or fall through to classical PLC" policy
- Ship behind a runtime escape hatch (`AUDIO_USE_LEGACY_FEC`) for the first rollout window so we can revert to RaptorQ if DRED has surprises in real-world conditions
## Non-goals
- Changing Codec2 at all. Codec2 1200 / 3200 are outside the DRED lineage and keep their current RaptorQ protection, block sizes, and PLC path.
- Adding new Opus bitrate tiers or changing the quality adaptation thresholds. This PRD is about the protection layer, not the bitrate ladder.
- Enabling OSCE (Opus Speech Coding Enhancement — a separate libopus 1.5 neural post-processor that opusic-c exposes via an `osce` feature flag). Valuable, complementary, and free once opusic-c is in — but out of scope here to keep the PRD focused. Track as follow-up.
- Video, audio-over-MoQ, or any protocol-layer changes discussed in prior conversations.
- Touching the wzp-web / browser client. Browser Opus is a separate codepath via WebAudio / WASM libopus and is not affected by the native FFI swap.
## Background
### How the three protection mechanisms actually differ
| | Opus inband FEC (LBRR) | RaptorQ (wzp-fec) | DRED |
|---|---|---|---|
| Layer | codec-internal | application, across Opus packets | codec-internal |
| What it sends | low-bitrate copy of the *previous* frame, embedded in every packet | fountain-code repair symbols across a block | neural-coded history of the recent past |
| Protection horizon | 1 packet back | block duration (currently 100 ms, proposed 40 ms) | configurable, 01040 ms |
| Recovery granularity | 1 frame (lower quality) | 1 frame (bit-exact) | 10 ms frames (plausible reconstruction) |
| Latency cost | 0 ms | block duration on receive | 0 ms |
| Bitrate cost | ~2040% of base | `fec_ratio × base` (currently +20% GOOD, +50% DEGRADED) | ~1 kbps flat |
| Effective loss tolerance | ~single-packet losses | up to `(repair symbols / block)` losses, cliff beyond | bursts up to the configured duration |
| Content assumption | any Opus audio | any | speech (DRED model is speech-trained) |
### Why DRED dominates on the Opus tiers
Loss-scenario walkthrough (verified against opusic-c and libopus 1.5 docs):
- **1-frame loss (20 ms)**: RaptorQ recovers bit-exactly, DRED wouldn't run (classical Opus PLC is perceptually indistinguishable for single 20 ms frames). RaptorQ "wins" on paper but not on ears.
- **23 frame burst (4060 ms)**: RaptorQ at current ratio 0.2 hits its tolerance cliff. DRED handles this trivially — well within a 200 ms window.
- **510 frame burst (100200 ms)**: RaptorQ completely overwhelmed at any reasonable ratio. DRED's sweet spot.
- **10+ frame burst (>200 ms)**: RaptorQ useless. DRED at 5001000 ms still recovers.
The only scenario where RaptorQ strictly beats DRED is bit-exact recovery of isolated single-frame losses — which is perceptually irrelevant for speech. In every other scenario DRED either ties or wins.
### Why Codec2 keeps RaptorQ
DRED lives inside libopus — it does not help Codec2 at all. Codec2's classical PLC is a parametric-vocoder interpolation that produces noticeably robotic artifacts on loss. On the Codec2 tiers, RaptorQ is the only protection we have, and it should stay at current ratios (1.0 on CATASTROPHIC, 0.5 on the Codec2 3200 tier).
### The opusic-c / opusic-sys situation
- `opusic-sys 0.6.0` — FFI crate, published 2026-03-17, vendors libopus 1.5.2 via its `bundled` feature (on by default), documents Android NDK cross-compile via `ANDROID_NDK_HOME` (which our `wzp-android/build.rs` already sets). Exposes raw bindings to `opus_dred_parse`, `opus_decoder_dred_decode`, and the `OpusDRED` state struct.
- `opusic-c 1.5.5` — high-level safe wrapper. Its **encoder** side is fine: exposes `Encoder::set_dred_duration(value: u8) -> Result<(), ErrorCode>` with range `0..=104` (each unit is 10 ms, so 01040 ms configurable). Also exposes `set_bitrate`, `set_inband_fec`, `set_dtx`, `set_packet_loss`, `set_signal`, `set_complexity`, `set_bandwidth`, `set_application` on the encoder.
- **opusic-c's decoder-side DRED wrapper is NOT sufficient for our architecture.** Confirmed by reading the source of `opusic-c/src/dred.rs`:
1. `Dred::decode_to` ignores the `dred_end` output of `opus_dred_parse` (prefixed `_dred_end`), so the caller cannot know how much DRED history a given packet actually carried.
2. In `opus_decoder_dred_decode(decoder, dred, dred_offset, pcm, frame_size)`, the wrapper passes `frame_size` to BOTH the `dred_offset` and `frame_size` arguments. This looks like a bug — it means reconstruction always starts at offset `frame_size` into the DRED window, not at an arbitrary caller-chosen offset. Arbitrary-gap reconstruction (which we need for the lookahead/backfill pattern) requires proper offset control.
3. `DredPacket` is owned internally by a `Dred` instance; its internal buffer is overwritten on every `decode_to` call. We cannot hold a ring of parsed DredPackets from multiple recent arrivals — which is exactly what the lookahead/backfill jitter buffer pattern requires.
- **Decision**: use opusic-c for the encoder path (its wrapper is correct and saves work), and drop to `opusic-sys` raw FFI for the entire decoder path AND the DRED reconstruction path. Both use a single shared `DecoderHandle` so internal decoder state stays consistent. **Verified at pre-flight**: `opusic_c::Decoder.inner` is `pub(crate)`, so there is no way to reach the raw `*mut OpusDecoder` from outside opusic-c. Running two parallel decoders (one from opusic-c for audio, one from opusic-sys for DRED) would cause state drift because the DRED-only decoder wouldn't see the normal decode calls. Single unified decoder via opusic-sys is the only correct architecture.
- **Three FFI handles required** per decode session: `opusic_c::Encoder` (encoder side, unchanged), our own `DecoderHandle` wrapping `*mut OpusDecoder` from opusic-sys (for normal decode AND for the `OpusDecoder` pointer passed to `opus_decoder_dred_decode`), and a new `DredDecoderHandle` wrapping `*mut OpusDREDDecoder` from opusic-sys (passed to `opus_dred_parse`). Note: `OpusDREDDecoder` is a **separate struct** from `OpusDecoder` in libopus 1.5 — verified from opus.h. Allocation via `opus_dred_decoder_create()` (confirm exact symbol name at Phase 3a start).
- The `opus` crate from SpaceManiac (0.3.1, published 2026-01-03) is a trap: it depends on `audiopus_sys ^0.2.0` — the same dead FFI crate we're trying to get away from. Do not use.
- **Follow-up (out of scope for this PRD)**: upstream the fixes to `opusic-c/src/dred.rs` (preserve `dred_end`, fix the `dred_offset` double-pass, expose `DredPacket` externally). Worth a GitHub PR once our own implementation has proven correct. Would let us eventually delete our internal FFI wrapper.
### Critical note from opusic-c docs
From the `dred` module documentation: *"The documentation recommends disabling in-band FEC and using `Application::Voip` for optimal results."* This applies to the **codec-internal** Opus inband FEC (LBRR), not our application-level RaptorQ. The two are independent layers. This PRD disables both on Opus tiers, but for different reasons — inband FEC per upstream recommendation, RaptorQ per the analysis above.
### The libopus 1.5 loss-percentage gating quirk
In libopus 1.5, both inband FEC and DRED are gated on `OPUS_SET_PACKET_LOSS_PERC` being non-zero. If the encoder thinks loss is 0%, it will not emit DRED data even when `set_dred_duration` is configured. We must plumb a meaningful loss percentage into the encoder continuously, floored at a small non-zero value so DRED stays active even when the network is perfect. Planned floor: **5%**, overridden upward by the real `QualityReport` loss value when it exceeds the floor.
## Solution
### High-level architecture change
**Before** (per Opus frame encode path):
```
PCM → AdaptiveEncoder.encode (Opus)
→ inband FEC embedded in packet
→ wzp-fec FEC encoder (accumulate into block, generate repair symbols)
→ DATAGRAM out
```
**Before** (per Opus frame decode path):
```
DATAGRAM in → wzp-fec block assembly (wait for block, recover if possible)
→ AdaptiveDecoder.decode (Opus) / decode_lost (classical PLC)
→ PCM
```
**After** (Opus tiers):
```
PCM → OpusEncoder.encode (opusic-c, DRED enabled via set_dred_duration, inband FEC off)
→ DATAGRAM out directly (no RaptorQ block)
```
```
DATAGRAM in → jitter buffer (lookahead/backfill)
→ on frame arrival: OpusDecoder.decode
→ on detected gap: if next packet has DRED state → dred::Dred.reconstruct(gap)
else → OpusDecoder.decode_lost (classical PLC)
→ PCM
```
**After** (Codec2 tiers): unchanged. RaptorQ block encoding + classical Codec2 decode path stay exactly as they are today.
### New per-profile protection matrix
| Profile | Codec | Inband FEC | RaptorQ ratio | DRED duration | Total overhead |
|---|---|---|---|---|---|
| `STUDIO_64K` | Opus 64k | **off** | **none** | **10 frames (100 ms)** | +1 kbps |
| `STUDIO_48K` | Opus 48k | **off** | **none** | **10 frames (100 ms)** | +1 kbps |
| `STUDIO_32K` | Opus 32k | **off** | **none** | **10 frames (100 ms)** | +1 kbps |
| `GOOD` | Opus 24k | **off** | **none** | **20 frames (200 ms)** | +1 kbps |
| `NORMAL_16K` | Opus 16k | **off** | **none** | **20 frames (200 ms)** | +1 kbps |
| `DEGRADED` | Opus 6k | **off** | **none** | **50 frames (500 ms)** | +1 kbps |
| `CODEC2_3200` | Codec2 3200 | N/A | **0.5 (unchanged)** | N/A | +50% |
| `CATASTROPHIC` | Codec2 1200 | N/A | **1.0 (unchanged)** | N/A | +100% |
| `COMFORT_NOISE` | CN | — | — | — | — |
DRED duration rationale:
- **Studio tiers (100 ms)**: loss is rare on the networks where users pick studio quality. Short DRED window keeps decode-side CPU modest. Still covers multi-frame bursts that classical PLC can't touch.
- **Normal tiers (200 ms)**: balanced baseline. Handles the common VoIP loss pattern (20150 ms bursts from wifi roam, transient congestion).
- **Degraded tier (500 ms)**: users on Opus 6k are by definition on a bad link. Long DRED window buys maximum burst resilience where it matters most. Still well under the 1040 ms cap.
### Runtime escape hatch
Ship with a single environment variable / settings flag: **`AUDIO_USE_LEGACY_FEC`**. When set, the entire Opus-tier path reverts to the pre-PRD behavior: RaptorQ re-enabled at the old ratios, Opus inband FEC re-enabled, DRED disabled (`set_dred_duration(0)`). This is the rollback safety valve for the first production window.
Escape hatch semantics:
- Read once at `CallEncoder::new` / `CallDecoder::new` time. Call-scoped, not re-read mid-call.
- Exposed via Android Settings UI as a hidden "Legacy FEC (debug)" toggle, and as a CLI flag `--legacy-fec` on the desktop client.
- Logged in `DebugReporter` so we can tell which mode a call was in when diagnosing.
- Removed entirely after 2 months of stable production with no regressions reported. Removal is a follow-up PR, not part of this PRD's scope.
## Detailed design
### Phase 0 — FFI crate swap (prerequisite, no behavior change)
**Files touched:**
- `Cargo.toml` (workspace root) — replace `audiopus = "0.3.0-rc.0"` with `opusic-c = { version = "1.5.5", features = ["bundled", "dred"] }` and `opusic-sys = { version = "0.6.0", features = ["bundled"] }`. The `opusic-sys` direct dep is for the DRED decoder path below.
- `crates/wzp-codec/Cargo.toml` — update `audiopus = { workspace = true }` to `opusic-c = { workspace = true }`, add `opusic-sys = { workspace = true }`, add `bytemuck = "1"` for the i16↔u16 slice cast.
- `crates/wzp-codec/src/opus_enc.rs` — rewrite against opusic-c. API mapping:
- `audiopus::coder::Encoder::new(SampleRate::Hz48000, Channels::Mono, Application::Voip)``opusic_c::Encoder::new(Channels::Mono, SampleRate::Hz48000, Application::Voip)` (argument order swapped)
- `set_bitrate(Bitrate::BitsPerSecond(bps))``set_bitrate(Bitrate::Bits(bps))` or equivalent variant — verify at implementation time
- `set_inband_fec(true/false)``set_inband_fec(InbandFec::On/Off)` (now an enum)
- `set_packet_loss_perc(u8)``set_packet_loss(u8)` (method renamed)
- `set_dtx(bool)`, `set_signal(Signal::Voice)`, `set_complexity(u8)` — names match
- `encode(&[i16], &mut [u8])``encode_to_slice(&[u16], &mut [u8])` with `bytemuck::cast_slice::<i16, u16>(pcm)` at the call site
- `crates/wzp-codec/src/opus_dec.rs` — same-style rewrite for the `Decoder` path. Note that opusic-c's decoder methods take `decode_fec: bool` as a parameter directly (not a separate ctl).
- `vendor/audiopus_sys/` — delete the directory (only exists on `feat/desktop-audio-rewrite`, not on `android-rewrite`, so this is a no-op on the current branch but do remove the `[patch.crates-io]` block from Cargo.toml when merging back).
**Acceptance criteria:**
- `cargo check --workspace` passes on Linux x86_64, macOS, and Android NDK cross-compile.
- All existing codec unit tests in `crates/wzp-codec/src/adaptive.rs` pass unchanged. DRED is still disabled at this phase (default `set_dred_duration(0)`), so behavior is equivalent to pre-swap libopus 1.3 for call quality purposes.
- A short real-call smoke test produces audio identical to current behavior (no audible regression).
- `opusic_c::version()` at startup logs libopus version containing `1.5.2` — hard signal that the swap landed correctly.
### Phase 1 — DRED encoder enable on all Opus profiles
**Files touched:**
- `crates/wzp-codec/src/opus_enc.rs`:
- Add `fn dred_duration_for(codec: CodecId) -> u8` returning the per-profile value from the matrix above (10 / 20 / 50 frames).
- In `OpusEncoder::new`, after the existing `set_bitrate`/`set_signal`/`set_complexity` block: call `inner.set_inband_fec(InbandFec::Off)`, then `inner.set_dred_duration(dred_duration_for(profile.codec))`, then `inner.set_packet_loss(5)` as the default floor.
- Add `pub fn set_dred_duration(&mut self, frames: u8)` to allow the adaptive ladder to update DRED duration on profile switch.
- In the existing `set_profile` impl, call `set_dred_duration(dred_duration_for(profile.codec))` after `apply_bitrate`.
- `crates/wzp-codec/src/adaptive.rs`:
- `AdaptiveEncoder::set_profile` already delegates to `self.opus.set_profile` — no changes needed. DRED update rides along.
- `crates/wzp-client/src/call.rs` (and equivalent on `wzp-android/src/pipeline.rs`):
- In the `QualityReport` handler (wherever we currently call `set_expected_loss` / `set_packet_loss_perc`), also ensure the loss value is floored at 5% before passing to the Opus encoder. This is a 1-line change.
**Acceptance criteria:**
- Encoder produces DRED-enabled Opus packets. Verifiable via libopus's reference decoder in debug mode, or by wire capture + inspection — a DRED-bearing Opus packet has a larger `opus_packet_get_nb_frames` footprint than a non-DRED one of the same nominal bitrate.
- Total outgoing bitrate on Opus 24k is ~25 kbps (up from ~24 kbps) — confirms ~1 kbps DRED overhead.
- On a lossless path, decoder output is audibly identical to Phase 0.
- Escape hatch `AUDIO_USE_LEGACY_FEC=1` cleanly reverts the DRED enable (calls `set_dred_duration(0)` and `set_inband_fec(InbandFec::On)` instead).
### Phase 2 — RaptorQ removal on Opus tiers
**Files touched:**
- `crates/wzp-client/src/call.rs`:
- In `CallEncoder::encode_frame` (or wherever `wzp_fec::Encoder::add_source_symbol` is called), gate the RaptorQ path on `!profile.codec.is_opus()` — Opus frames go straight to DATAGRAM emit, Codec2 frames continue through RaptorQ.
- When a profile switch crosses the Opus↔Codec2 boundary, flush/reset the RaptorQ encoder state.
- `crates/wzp-android/src/pipeline.rs`:
- Mirror the same gate in the Android encode path.
- `crates/wzp-proto/src/packet.rs`:
- `MediaHeader.fec_block` and `fec_symbol` are still valid fields on the wire. For Opus packets we emit `fec_block = 0`, `fec_symbol = 0`, `fec_ratio_encoded = 0`. No wire format change; the receiver just sees all-zeros in the FEC fields for Opus packets and skips the FEC decoder path.
- Bump protocol version to v1 → v2? **No** — the change is semantically backward compatible because existing RaptorQ decoders handle a zero ratio correctly (ratio 0.0 means "no repair symbols expected"). Old receivers can still decode new Opus packets; they just won't see any DRED benefit because their libopus is old. This is a property we want: the opposite (new receiver, old sender) is the more common mixed-version case during rollout and also Just Works.
- `crates/wzp-client/src/call.rs``CallDecoder`:
- Symmetric change: Opus frames bypass the RaptorQ block assembly, go straight to the decoder. Only Codec2 frames (`codec_id.is_codec2()`) feed through `wzp-fec` block decoding.
**Acceptance criteria:**
- Outgoing Opus packets have `fec_ratio_encoded == 0` (verifiable with the existing wire capture tooling in `wzp-client/src/echo_test.rs`).
- On a clean network, receiver latency (measured as encode-to-playout one-way delay) drops by ~40 ms versus Phase 1. This is the primary win and should be directly measurable with the existing telemetry.
- Codec2 calls show no latency change and no packet-format change. Regression-test Codec2 3200 and Codec2 1200 specifically.
- Total outgoing bitrate on Opus 24k drops from ~28.8 kbps (24k base + 0.2 RaptorQ ratio) to ~25 kbps (24k base + ~1 kbps DRED). Direct savings observable in network telemetry.
### Phase 3 — DRED reconstruction wrapper + jitter buffer lookahead/backfill refactor
This phase is larger than originally estimated because opusic-c's decoder-side DRED wrapper is unusable for our architecture (see Background). We write our own safe wrapper over `opusic-sys` raw FFI first, then plumb it through the jitter buffer.
**Step 3a — Safe DRED reconstruction wrapper in `wzp-codec`:**
New file `crates/wzp-codec/src/dred_ffi.rs`. Wraps the raw libopus 1.5 DRED API:
- `pub struct DredState` — owns an `OpusDRED` buffer (allocated via `opusic_sys::opus_dred_alloc` or equivalent; size is fixed at 10,592 bytes per libopus 1.5). `Clone` is intentionally NOT implemented — the state is heap-owned and non-trivial to copy.
- `pub fn parse_from_packet(&mut self, decoder: &opusic_c::Decoder, packet: &[u8], max_dred_samples: i32) -> Result<DredParseResult, DredError>` — wraps `opus_dred_parse`, preserves the `dred_end` output (number of samples of history the packet carried), returns it in `DredParseResult { samples_available: i32, frames_available: u8 }`.
- `pub fn reconstruct_into(&self, decoder: &mut opusic_c::Decoder, dred_offset_samples: i32, output: &mut [i16]) -> Result<usize, DredError>` — wraps `opus_decoder_dred_decode`, takes the offset explicitly, decodes `output.len()` samples starting from that offset in the DRED window.
- All `unsafe` contained here, strict bounds checking on offsets, Rust-level panic safety. Unit tests use a reference encoder + known-good reference decoder to verify that reconstruction at specific offsets produces expected output.
- Depends on `opusic-sys` directly and on `opusic-c::Decoder` for the decoder handle. The Decoder handle must be reachable as a raw pointer; opusic-c exposes this via an unstable internal or we wrap the pointer ourselves. **Verify at implementation time** — if opusic-c doesn't expose the raw decoder pointer safely, we create our own thin Decoder wrapper in `dred_ffi.rs` using raw opusic-sys, losing the convenience of opusic-c's decoder but keeping its encoder. This is the smaller-risk fallback.
New `pub trait DredReconstructor` in `wzp-codec/src/lib.rs`:
```rust
pub trait DredReconstructor: Send {
/// Parse DRED state from an arriving Opus packet into `state`.
/// Returns number of 48 kHz samples of history available, or 0 if the packet has no DRED.
fn parse(&mut self, state: &mut DredState, packet: &[u8]) -> Result<i32, DredError>;
/// Reconstruct `output.len()` samples from `state`, starting at the given
/// sample offset (measured from the end of the DRED window going backward).
fn reconstruct(&mut self, state: &DredState, offset_samples: i32, output: &mut [i16]) -> Result<usize, DredError>;
}
```
Implement `DredReconstructor` over the `dred_ffi::DredState` + opusic-c Decoder combination. This is the clean boundary the jitter buffer will talk to.
**Step 3b — Jitter buffer refactor in `crates/wzp-transport/src/jitter.rs`:**
- Current behavior: buffer waits a fixed number of frames of jitter before emitting; on a missing slot, after a timeout it gives up and signals the decoder to run `decode_lost()` (classical Opus PLC or Codec2 PLC).
- New behavior on Opus tiers: when a frame arrives (in-order or late), first call `DredReconstructor::parse` on it to update a rolling ring of `DredState` instances tagged with their originating sequence number. When a gap is detected (missing sequence number between last-emitted and current arrival), and the ring contains a `DredState` from a nearby packet that covers the gap's sample offset, call `DredReconstructor::reconstruct` with the correct offset to synthesize the missing frames, splice them into playout, then continue normal decode.
- If no DRED state covers the gap (e.g., gap too far back, or every nearby packet was dropped), fall through to classical PLC exactly as today. The classical path stays intact as the ultimate fallback.
- Codec2 packets bypass the entire DRED ring. They are not inspected for DRED state and take the unchanged classical PLC path.
- Ring sizing: `max_dred_duration_frames` + `jitter_depth_frames` worth of `DredState` instances. At 500 ms DRED on degraded tier + 60 ms jitter depth, that's ~28 DredState instances × 10,592 bytes ≈ 300 KB. Acceptable. On studio tier with 100 ms DRED it's only ~80 KB.
- The jitter buffer takes a `Box<dyn DredReconstructor>` at construction, passed in by the call engine. `wzp-transport` does NOT take a direct dep on `opusic-c` or `opusic-sys` — it only knows about the trait defined in `wzp-codec`.
**Files touched:**
- `crates/wzp-codec/src/dred_ffi.rs` (new, ~150300 lines)
- `crates/wzp-codec/src/lib.rs` — expose `DredReconstructor`, `DredState`, `DredError` types
- `crates/wzp-codec/Cargo.toml` — add `opusic-sys = { workspace = true }` as a direct dep (already done in Phase 0)
- `crates/wzp-transport/src/jitter.rs` — lookahead/backfill refactor, DRED ring
- `crates/wzp-transport/Cargo.toml` — add `wzp-codec = { workspace = true }` (likely already present) for the trait import
- `crates/wzp-client/src/call.rs` — construct a `DredReconstructor` and pass into `CallDecoder`'s jitter buffer
- `crates/wzp-android/src/pipeline.rs` — same on Android
**Acceptance criteria:**
- Unit tests in `dred_ffi.rs`: round-trip a known speech waveform through an encoder with DRED enabled, parse the resulting packets, reconstruct at several different offsets, verify the reconstructed samples are within an energy/spectral threshold of the original. (Not bit-exact — DRED reconstruction is lossy by design.)
- Synthetic loss test on the full pipeline: inject 200 ms bursts at 10% rate into a looped call, verify the DRED reconstruction rate on receiver telemetry is ≥95% of all loss events whose gaps fall within the configured DRED duration window.
- Reconstructed audio is audibly continuous on 40200 ms bursts — no gaps, no classical-PLC robot artifact. Verified on real voice samples (not just sine tones), and on at least two distinct speaker profiles (male, female) because DRED can have voice-dependent quality.
- End-to-end latency metric is unchanged versus Phase 2 (no regression from adding the lookahead path). The DRED ring insertion on packet arrival must be O(1) in practice.
- Existing `echo_test.rs` and `drift_test.rs` pass with the new jitter buffer.
- Codec2 path uses classical PLC exclusively (no DRED invocation) because Codec2 packets don't carry DRED state. Verify by injecting loss on a Codec2 call and confirming zero DRED reconstruction telemetry events during that call.
- `wzp-transport` has no direct dependency on `opusic-sys` or `opusic-c` in its `Cargo.toml` after the refactor — only on `wzp-codec`. Verify by grepping the Cargo.toml file.
### Phase 4 — Telemetry and tooling updates
**Files touched:**
- `crates/wzp-proto/src/packet.rs``QualityReport` or equivalent telemetry message gains `dred_reconstructions: u32` as a new counter (frames reconstructed via DRED this reporting window) and `classical_plc_invocations: u32` (frames filled by Opus/Codec2 classical PLC). These are separate counters because they're different recovery mechanisms.
- `crates/wzp-relay/src/*` — relay telemetry pipeline surfaces both counters in Prometheus metrics: `wzp_dred_reconstructions_total{call_id}`, `wzp_classical_plc_total{call_id}`.
- `docs/grafana-dashboard.json` — new panel: "Loss recovery breakdown" stacked bar, DRED vs classical PLC vs clean decode, per call.
- `android/app/src/main/java/com/wzp/debug/DebugReporter.kt` — surfaces `dredReconstructions` and `classicalPlc` counts in the debug report; also logs active DRED duration and whether legacy-FEC mode is engaged.
**Acceptance criteria:**
- Grafana dashboard shows a clear visual distinction between DRED-recovered and classical-PLC-recovered frames across a test fleet of calls.
- Debug report includes the active protection mode ("DRED 200 ms" / "Legacy RaptorQ") and reconstruction counts, so incidents can be classified unambiguously.
### Phase 5 — Escape hatch removal (follow-up, ~2 months post-ship)
After 2 months of stable production with no rollbacks triggered:
- Delete `AUDIO_USE_LEGACY_FEC` handling in `opus_enc.rs` / `call.rs` / `pipeline.rs`
- Delete the Opus-tier paths of `wzp-fec` (the crate stays for Codec2)
- Delete the Android settings toggle and desktop CLI flag
- Remove the `--legacy-fec` path from smoke tests
## Critical files to modify (summary)
- `Cargo.toml` (workspace) — dep swap (audiopus → opusic-c + opusic-sys)
- `crates/wzp-codec/Cargo.toml` — dep swap + `bytemuck` for slice cast
- `crates/wzp-codec/src/opus_enc.rs` — opusic-c rewrite + DRED enable + inband FEC off
- `crates/wzp-codec/src/opus_dec.rs` — opusic-c rewrite
- `crates/wzp-codec/src/dred_ffi.rs`**new file**, safe wrapper over opusic-sys raw DRED FFI
- `crates/wzp-codec/src/lib.rs` — expose `DredReconstructor` trait, `DredState`, `DredError`
- `crates/wzp-codec/src/adaptive.rs` — verify profile switch carries DRED duration
- `crates/wzp-client/src/call.rs` — Opus/Codec2 gate on RaptorQ path, loss floor, wire DredReconstructor into CallDecoder
- `crates/wzp-android/src/pipeline.rs` — same gate, same loss floor, wire DredReconstructor
- `crates/wzp-transport/src/jitter.rs` — lookahead/backfill refactor, DRED ring, reconstruction dispatch
- `crates/wzp-transport/Cargo.toml` — verify it depends only on `wzp-codec`, not directly on opusic-*
- `crates/wzp-proto/src/packet.rs` — new telemetry counters
- `crates/wzp-relay/` — Prometheus metric exposure
- `android/app/src/main/java/com/wzp/debug/DebugReporter.kt` — debug output
- `docs/grafana-dashboard.json` — loss-recovery panel
- (delete) `vendor/audiopus_sys/` on `feat/desktop-audio-rewrite` when merging back
## Existing utilities to reuse
- `wzp_codec::resample::Downsampler48to8` / `Upsampler8to48` — unchanged, only Codec2 path uses them
- `wzp_codec::adaptive::AdaptiveEncoder` / `AdaptiveDecoder` — existing profile-switching machinery, DRED duration changes ride along
- `wzp_codec::silence::SilenceDetector` / `ComfortNoise` — unchanged
- `wzp_codec::agc::AutoGainControl` — unchanged, runs before encode as today
- `wzp_fec::RaptorQFecEncoder` / decoder — unchanged, still used for Codec2 tiers
- `wzp_client::call::QualityAdapter` — unchanged; drives profile switching, which now also reconfigures DRED duration via the existing `set_profile` path
## Verification
End-to-end testing, in order:
1. **Unit**: `cargo test -p wzp-codec` — Opus encode/decode round-trip at every profile, DRED enabled. Verify `version()` reports libopus 1.5.2.
2. **Unit**: `cargo test -p wzp-transport` — jitter buffer lookahead/backfill behavior with injected loss patterns (0%, 5%, 15%, 30%, 50% loss; isolated losses, 40 ms bursts, 200 ms bursts, 500 ms bursts).
3. **Integration**: `crates/wzp-client/src/echo_test.rs` — existing echo test must pass on all Opus profiles with <5% perceived quality regression (measure via the time-window analysis already built into `echo_test.rs`).
4. **Integration**: `crates/wzp-client/src/drift_test.rs` — latency measurement. Must show ~40 ms reduction on Opus profiles versus pre-PRD baseline. Codec2 profiles unchanged.
5. **Manual**: Android release build, real call over bad wifi (or a shaped network via `tc netem` on Linux). Burst losses of 200 ms should be perceptually continuous speech, not robotic gaps.
6. **Manual**: Same call with `AUDIO_USE_LEGACY_FEC=1` — verify behavior reverts to current production behavior. This is the pre-ship rollback rehearsal.
7. **Cross-compile**: full build matrix — Android arm64-v8a + armeabi-v7a (via `scripts/build-and-notify.sh`), macOS universal, Linux x86_64 (via `scripts/build-linux-docker.sh`). Windows cross-compile via cargo-xwin should also pass — libopus 1.5 upstream fixed the clang-cl SIMD issue that required the vendor patch on `feat/desktop-audio-rewrite`.
8. **Telemetry smoke**: deploy to staging relay, make 10 test calls, verify Grafana's new "Loss recovery breakdown" panel shows DRED reconstruction events firing on injected loss and classical-PLC on packet-loss beyond DRED's window.
## Risks and mitigations
- **Custom DRED FFI wrapper is WZP-maintained code with no second source.** opusic-c's decoder-side DRED wrapper is insufficient (see Background), so we carry our own `dred_ffi.rs` that calls `opus_dred_parse` and `opus_decoder_dred_decode` directly via opusic-sys. Bugs in this wrapper — offset arithmetic off-by-ones, lifetime errors on `OpusDRED` buffers, UB from misuse of the C API — could manifest as silent audio corruption on loss bursts, hard to diagnose. **Mitigation**: extensive unit tests in `dred_ffi.rs` using a reference encoder + reference decoder round-trip with known offsets; strict bounds checking on every `unsafe` boundary; Miri run in CI if feasible; the legacy-FEC escape hatch disables the entire DRED code path including our custom wrapper, giving us a single flag to revert any wrapper bug in production. Long-term: upstream the fixes to opusic-c (follow-up task, not blocking).
- **opusic-c's encoder-side API and internal Decoder pointer access**. Step 3a depends on being able to call opusic-sys raw functions that take an `*mut OpusDecoder` pointer while still using opusic-c's `Decoder` for normal decode. If opusic-c doesn't expose the raw pointer cleanly, we fall back to a thin opusic-sys-direct Decoder wrapper inside `dred_ffi.rs` and lose some of opusic-c's convenience. **Mitigation**: verify at the start of Phase 3 (one afternoon of reading opusic-c source). If the clean path doesn't work, the fallback is not difficult — it's what we'd have built anyway if opusic-c didn't exist.
- **DRED reconstruction quality varies by voice / content**. The neural model is trained on speech; edge cases (shouting, whispering, heavy accents, music-on-hold, cough, laughter) may reconstruct less cleanly than continuous speech. **Mitigation**: escape hatch ships from day one. If production telemetry shows perceptible quality regression on specific voice patterns, flip legacy mode for affected users while tuning. Also: classical Opus PLC remains as the third-tier fallback when DRED state is unavailable.
- **Removing RaptorQ removes bit-exact recovery**. Isolated single-packet losses are now reconstructed plausibly instead of bit-exactly. **Mitigation**: as argued in Background, bit-exactness on a single 20 ms speech frame is perceptually meaningless. The assumption is "speech is the workload" — if we ever add non-speech features (music bot, ringtones over the call path, DTMF-over-audio) we revisit.
- **libopus 1.5 DRED API stability**. **Verified at pre-flight**: opus.h in the upstream xiph/opus repository has no "experimental" marker on the DRED API declarations. The earlier characterization was incorrect. DRED shipped as a first-class feature in libopus 1.5.0 (Dec 2023) and has been iterated in 1.5.1 and 1.5.2. Google Meet and Duo ship it at scale. **Mitigation**: pin `opusic-sys` exactly (no `^` range) to ensure reproducible builds, follow upstream 1.5.x bugfixes as they land. No special stability concerns beyond normal dependency hygiene.
- **Jitter buffer refactor is the largest code change**. Jitter bugs are notoriously subtle (off-by-one on sequence wraparound, clock drift interactions, playout starvation corner cases). **Mitigation**: keep the classical-PLC path intact as the DRED fallback, so jitter bugs degrade to "current behavior" rather than "broken audio". Write targeted unit tests for the buffer at each loss-pattern scenario before touching production paths. Consider shipping Phase 3 behind a sub-flag separate from the main escape hatch, so we can independently toggle "DRED enabled but classical jitter buffer" for bisection.
- **Cross-compile surprises**. `opusic-sys` is actively maintained but our exact combination of Android NDK version / Docker builder environment / Windows cross-compile via cargo-xwin has not been tested by upstream. **Mitigation**: Phase 0 includes the full cross-compile matrix as an acceptance criterion. Any blockers surface before we touch loss-recovery behavior.
- **Wire-format compatibility during rollout**. Mixed-version calls (new sender + old receiver, or vice versa) need to keep working. **Verified at pre-flight**: traced both live receive paths (`wzp-client/src/call.rs::CallDecoder::ingest` and `wzp-android/src/engine.rs` the JNI-driven engine path), and both degrade gracefully: new-sender Opus packets with `fec_ratio_encoded=0` / `fec_block=0` / `fec_symbol=0` flow through to the jitter buffer and decode normally on old receivers. The RaptorQ decoder either ignores zero-FEC packets entirely (Android pipeline.rs gates on non-zero fec_block/fec_symbol) or accumulates them harmlessly until the 2-second staleness eviction (desktop call.rs). Old-sender packets with populated RaptorQ fields are handled by new receivers via the unchanged Codec2 path (new receivers keep wzp-fec for Codec2 tiers and simply ignore RaptorQ fields on Opus packets). **No wire format version bump required.**
- **Pre-existing desktop RaptorQ gap** (incidental finding, NOT caused by this PRD). The desktop `wzp-client/src/call.rs::CallDecoder` feeds packets into `fec_dec.add_symbol` but **never calls `fec_dec.try_decode`** — RaptorQ recovery is effectively dead code on the desktop path today. Main decode reads from the jitter buffer directly, falling through to classical Opus PLC on missing packets. The Android `engine.rs` path properly uses `try_decode` for recovery. This PRD does not fix the desktop gap — it's unrelated — but is noted here so nobody is surprised that removing RaptorQ from Opus tiers on the desktop client causes no measurable recovery regression (there was nothing to lose). Recommend filing a follow-up task to either fix or remove the vestigial desktop RaptorQ wiring independently of this work.
- **`AUDIO_USE_LEGACY_FEC` itself becoming permanent tech debt**. Escape hatches have a way of outliving their intended lifespan. **Mitigation**: put an explicit removal date in a `// TODO(2026-06-15): remove legacy FEC path` comment at the flag-handling site. Track in taskmaster.
## Open questions
- ~~**Does opusic-c expose `opusic_c::Decoder`'s raw inner pointer?**~~ **Resolved at pre-flight**: no, it's `pub(crate)`. We build a unified `DecoderHandle` over raw opusic-sys in `dred_ffi.rs` and use it for both normal decode and DRED reconstruction. Opusic-c is used only for the encoder side.
- **Exact opusic-sys symbol name for DRED decoder allocation**. opus.h documents the `OpusDREDDecoder` type and `opus_dred_parse`/`opus_decoder_dred_decode` functions, but the allocation function name is not in the fetched snippet. Expected to be `opus_dred_decoder_create` / `opus_dred_decoder_destroy` per libopus naming convention, but confirm at the very start of Phase 3a by reading the actual opusic-sys bindings. If the function is not exported by opusic-sys, we file a PR upstream to opusic-sys (small fix, trivially mergeable) and temporarily vendor the function declaration locally.
- **Should the 5% loss floor be configurable per profile?** Currently specified as a constant. A future refinement might make it higher at degraded tiers and lower at studio tiers, but without real telemetry we don't know if the constant is wrong. Keep as a constant for now, revisit after 1 month of production data.
- **OSCE enable**: opusic-c has an `osce` feature flag for Opus Speech Coding Enhancement, a separate libopus 1.5 neural post-processor. Out of scope for this PRD but should be the next audio-quality follow-up. Probably one-line enable once opusic-c is in.
- **Upstream PR to opusic-c**: our own `dred_ffi.rs` wrapper should be proven in production first, then the fixes upstreamed to `opusic-c/src/dred.rs` (preserve `dred_end`, fix `dred_offset` double-pass, expose `DredPacket` externally). Follow-up task, not blocking this PRD.
- **`feat/desktop-audio-rewrite` merge**: the vendored `audiopus_sys` patch on that branch becomes obsolete under this PRD. Coordinate removal with whoever owns that branch.
## Phase A: Continuous DRED Tuning (Implemented 2026-04-12)
Phase A extends the discrete tier-locked DRED durations from Phases 1-3 with continuous, network-driven tuning.
### What was built
- **`DredTuner`** (`crates/wzp-proto/src/dred_tuner.rs`): Maps `(loss_pct, rtt_ms, jitter_ms)``(dred_frames, expected_loss_pct)` continuously
- **Quinn stats exposure** (`crates/wzp-transport/src/quic.rs`): `QuinnPathSnapshot` provides quinn's internal RTT, loss, congestion events — more accurate than sequence-gap heuristics
- **Jitter variance window** (`crates/wzp-transport/src/path_monitor.rs`): 10-sample sliding window for RTT standard deviation, used for spike detection
- **`AudioEncoder` trait extensions** (`crates/wzp-proto/src/traits.rs`): `set_expected_loss()` and `set_dred_duration()` with default no-op, overridden by `OpusEncoder` and `AdaptiveEncoder`
- **Engine integration** (`desktop/src-tauri/src/engine.rs`): Both Android and desktop send tasks poll every 25 frames and apply tuning
### Opus6k DRED extended
`dred_duration_for(Opus6k)` changed from 50 (500ms) to 104 (1040ms) — the maximum libopus 1.5 supports. The RDO-VAE's quality-vs-offset curve makes this nearly free in bitrate terms while doubling burst resilience on the worst links.
### Jitter spike detection ("Sawtooth" prediction)
When instantaneous jitter exceeds the EWMA × 1.3 (asymmetric: fast-up α=0.3, slow-down α=0.05), the tuner enters spike-boost mode:
- DRED immediately jumps to the codec tier's ceiling
- Cooldown: 10 cycles (~5 seconds at 25 packets/cycle)
- Designed for Starlink satellite handover sawtooth jitter pattern
### Test coverage
- 10 unit tests for tuner math (baseline, scaling, spike, cooldown, codec switch, Codec2 no-op)
- 4 integration tests (encoder adjustment, spike boost, Codec2 no-op, profile switch with encode verification)
### Opus6k Frame Starvation Bug (Fixed 2026-04-13)
During testing of the extended 1040ms DRED window on Opus6k, the 40ms codec produced only ~11 frames/s instead of 25 — making audio choppy regardless of DRED quality.
**Root cause:** The Android capture ring read loop did partial reads that consumed samples from the ring but discarded them when retrying:
1. Ring has 960 samples (one Oboe burst)
2. `audio_read_capture(&mut buf[..1920])` reads 960 into `buf[0..960]`, returns 960
3. Loop sees 960 < 1920, sleeps, retries from `buf[0..]` → overwrites the consumed samples
4. ~50% of captured audio thrown away per frame
**Fix:** Added `wzp_native_audio_capture_available()` to check ring fill level before reading (same pattern as the desktop CPAL path's `capture_ring.available()`). Also made `frame_samples` mutable so codec switches update the read size.
**Affected codecs:** Only 40ms frame codecs (Opus6k, Codec2_1200). 20ms codecs (Opus24k, etc.) were unaffected because a single Oboe burst fills the entire request.

View File

@@ -0,0 +1,145 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Engine.rs Deduplication — Extract Shared Send/Recv Helpers
## Problem
`desktop/src-tauri/src/engine.rs` is 1,705 lines with two nearly identical `CallEngine::start()` implementations — one for Android (880 lines) and one for desktop (430 lines). ~350 lines are copy-pasted between them. Every change to the encode/decode/adaptive-quality pipeline requires editing both places, and they've already diverged in subtle ways (Android has extensive first-join diagnostics that desktop lacks).
## Scope
Extract the duplicated logic into shared helper functions. The Android and desktop paths should only differ in their audio I/O mechanism (Oboe ring via wzp-native vs CPAL capture_ring/playout_ring).
## What's Duplicated
| Block | Description | Lines (each) |
|-------|-------------|------|
| `build_call_config()` | Resolve quality string → CallConfig | 23 |
| Codec-to-profile match | Map CodecId → QualityProfile for decoder switch | 19 |
| Adaptive quality switch | Read AtomicU8, index_to_profile, set_profile, update frame_samples + dred_tuner | 15 |
| DRED tuner poll | Check frame counter, poll quinn stats, apply tuning | 15 |
| Quality report ingestion | Extract quality_report, feed to AdaptiveQualityController, store to AtomicU8 | 8 |
| Signal task | Accept signals, handle RoomUpdate/QualityDirective/Hangup | 48 |
| **Total** | | **~128 lines × 2 = 256 lines eliminated** |
## Implementation
### Phase 1: Top-Level Helper Functions
```rust
fn build_call_config(quality: &str) -> CallConfig {
let profile = resolve_quality(quality);
match profile {
Some(p) => CallConfig {
noise_suppression: false,
suppression_enabled: false,
..CallConfig::from_profile(p)
},
None => CallConfig {
noise_suppression: false,
suppression_enabled: false,
..CallConfig::default()
},
}
}
fn codec_to_profile(codec: CodecId) -> QualityProfile {
match codec {
CodecId::Opus24k => QualityProfile::GOOD,
CodecId::Opus6k => QualityProfile::DEGRADED,
CodecId::Opus32k => QualityProfile::STUDIO_32K,
CodecId::Opus48k => QualityProfile::STUDIO_48K,
CodecId::Opus64k => QualityProfile::STUDIO_64K,
CodecId::Codec2_1200 => QualityProfile::CATASTROPHIC,
CodecId::Codec2_3200 => QualityProfile {
codec: CodecId::Codec2_3200,
fec_ratio: 0.5,
frame_duration_ms: 20,
frames_per_block: 5,
},
other => QualityProfile { codec: other, ..QualityProfile::GOOD },
}
}
fn check_adaptive_switch(
pending: &AtomicU8,
encoder: &mut CallEncoder,
tuner: &mut wzp_proto::DredTuner,
frame_samples: &mut usize,
tx_codec: &tokio::sync::Mutex<String>,
) -> bool {
let p = pending.swap(PROFILE_NO_CHANGE, Ordering::Acquire);
if p == PROFILE_NO_CHANGE { return false; }
if let Some(new_profile) = index_to_profile(p) {
let new_fs = (new_profile.frame_duration_ms as usize) * 48;
if encoder.set_profile(new_profile).is_ok() {
*frame_samples = new_fs;
tuner.set_codec(new_profile.codec);
// Caller updates tx_codec display string
return true;
}
}
false
}
```
### Phase 2: Shared Signal Task
Extract the signal task into a standalone async function:
```rust
async fn run_signal_task(
transport: Arc<wzp_transport::QuinnTransport>,
running: Arc<AtomicBool>,
pending_profile: Arc<AtomicU8>,
participants: Arc<Mutex<Vec<ParticipantInfo>>>,
) {
loop {
if !running.load(Ordering::Relaxed) { break; }
match tokio::time::timeout(
Duration::from_millis(SIGNAL_TIMEOUT_MS),
transport.recv_signal(),
).await {
Ok(Ok(Some(msg))) => {
// Handle RoomUpdate, QualityDirective, Hangup...
}
_ => {}
}
}
}
```
### Phase 3: Shared DRED Poll + Quality Ingestion
These are small blocks but appear in both send and recv tasks. Extract as inline helpers or closures.
## Verification
1. `cargo check --workspace` — must compile
2. `cargo test -p wzp-proto -p wzp-relay -p wzp-client --lib` — must pass
3. Manual test: place a call Android↔Desktop, verify audio works in both directions
4. Verify adaptive quality still switches (set one side to auto, degrade network)
## Effort
- Phase 1: 1 hour (extract 3 functions, update 6 call sites)
- Phase 2: 30 min (extract signal task, update 2 spawn sites)
- Phase 3: 30 min (cleanup remaining small duplicates)
- Total: ~2 hours
## Not In Scope
- Audio I/O trait abstraction (Oboe vs CPAL) — different project, different risk profile
- Moving Android-specific diagnostics (first-join, PCM recorder) into a feature flag
- Splitting engine.rs into multiple files
## Implementation Status (2026-04-13)
All phases implemented:
- build_call_config(): shared CallConfig construction — DONE
- codec_to_profile(): shared CodecId → QualityProfile mapping — DONE
- run_signal_task(): shared signal handler — DONE
- Net reduction: ~39 lines, 6 duplicated blocks → single-line calls

225
vault/PRDs/PRD-hard-nat.md Normal file
View File

@@ -0,0 +1,225 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Hard NAT Traversal (Port Prediction + Birthday Attack)
> Phase: Partial implementation
> Status: Phase A done, Phase B signal ready, C-D not started (2026-04-14)
> Crate: wzp-client, wzp-proto, wzp-relay
## Problem
When both peers are behind **symmetric NATs** (endpoint-dependent mapping), standard hole-punching fails because the external port changes per destination. Our Phase 8.2 port mapping (NAT-PMP/PCP/UPnP) solves this when the router supports it (~70% of consumer routers), but the remaining ~30% — plus corporate firewalls, cloud NATs (AWS/Azure), and carrier-grade NATs — fall back to relay.
Tailscale tackles this with two techniques:
1. **Port prediction** for NATs with sequential allocation patterns
2. **Birthday attack** for NATs with random allocation
Both are viable when **at least one peer has a predictable NAT** (easy+hard pair). When **both** peers have fully random symmetric NATs, even Tailscale falls back to relay.
## Background: How Symmetric NATs Allocate Ports
| Pattern | Behavior | Prevalence | Traversal |
|---------|----------|------------|-----------|
| **Sequential** | port N, N+1, N+2... per new flow | ~40% of symmetric NATs (home routers) | Port prediction viable |
| **Random** | truly random port per flow | ~50% (enterprise, cloud, CGNAT) | Birthday attack only |
| **Port-preserving** | same as source port when possible | ~10% (behaves like cone NAT) | Standard hole-punch works |
## Solution Overview
### Phase A: NAT Port Allocation Pattern Detection
Before attempting hard NAT traversal, detect whether the NAT allocates ports sequentially or randomly. This determines which strategy to use.
**Method**: Send 5 STUN Binding Requests from the same source socket to 5 different STUN servers. Collect the 5 observed external ports. Analyze:
```
Ports: [40001, 40002, 40003, 40004, 40005] → Sequential (delta=1)
Ports: [40001, 40003, 40005, 40007, 40009] → Sequential (delta=2)
Ports: [40001, 52847, 19432, 61203, 8847] → Random
Ports: [4433, 4433, 4433, 4433, 4433] → Port-preserving (cone-like)
```
Classification:
- All same port → `PortPreserving` (use standard hole-punch)
- Consistent delta between consecutive ports → `Sequential { delta: i16 }`
- No pattern → `Random`
**New struct**:
```rust
pub enum PortAllocation {
PortPreserving,
Sequential { delta: i16 },
Random,
Unknown,
}
```
Add to `NetcheckReport` and `NatDetection`.
### Phase B: Port Prediction (Sequential NATs)
When the NAT is sequential, we can **predict** the next external port:
1. Client sends a STUN probe → observes external port P
2. Client knows the NAT will assign P+delta for the next outbound flow
3. Client tells peer (via relay or chat): "dial me at `my_ip:(P + delta * N)`" where N is the number of flows the client will open before the peer's packet arrives
4. Client opens a QUIC connection to the peer's predicted port at the same time
5. If the prediction lands within a small window, the QUIC handshake succeeds
**Timing is critical**: both peers must probe, predict, and dial within a tight window (~500ms) so the port prediction doesn't drift.
**Coordination via relay** (or out-of-band chat):
```
SignalMessage::HardNatProbe {
call_id: String,
/// My observed port sequence (last 3 ports, most recent first)
port_sequence: Vec<u16>,
/// My detected allocation pattern
allocation: PortAllocation,
/// Timestamp (ms since epoch) — for synchronization
probe_time_ms: u64,
/// My external IP (from STUN)
external_ip: String,
}
```
Both peers exchange `HardNatProbe`, then simultaneously:
1. Each predicts the other's next port: `peer_ip:(peer_last_port + peer_delta * offset)`
2. Each opens N parallel QUIC connections to predicted port range: `[predicted - 2, predicted + 2]`
3. First successful handshake wins
**Expected success rate**: ~80% for sequential NATs with consistent delta, within 2-3 seconds.
### Phase C: Birthday Attack (Random NATs)
When the NAT is random, port prediction is impossible. Instead, exploit the **birthday paradox**:
**Math**: With N ports open on side A and M probes from side B into a 65536-port space:
- N=256, M=256: P(collision) ≈ 1 - e^(-256*256/65536) ≈ 63%
- N=256, M=512: P(collision) ≈ 1 - e^(-256*512/65536) ≈ 87%
- N=256, M=1024: P(collision) ≈ 1 - e^(-256*1024/65536) ≈ 98%
**Implementation**:
1. **Acceptor side** (easy NAT or the side with more ports available):
- Open 256 UDP sockets bound to random ports
- For each socket, send one STUN probe to learn its external port
- Report all 256 external ports to the peer
2. **Dialer side** (hard NAT):
- Send 1024 QUIC Initial packets to random ports on the Acceptor's external IP
- Rate: 100-200 packets/sec to avoid triggering rate limits
- Duration: ~5-10 seconds
3. **Collision detection**:
- When one of the Dialer's packets hits one of the Acceptor's open ports, the QUIC handshake begins
- The Acceptor sees an incoming Initial on one of its 256 sockets
**Problem for VoIP**: This takes 5-10 seconds even at high probe rates. For a phone call, this means a long "connecting..." phase. Acceptable as a last resort before relay fallback.
### Phase D: Hybrid Strategy
Combine all techniques in a waterfall:
```
1. Port mapping (NAT-PMP/PCP/UPnP) → <100ms [Phase 8.2, done]
↓ failed
2. Standard hole-punch (cone NAT) → <500ms [Phase 3-6, done]
↓ failed (symmetric NAT detected)
3. Port prediction (sequential NAT) → <2s [Phase A+B, new]
↓ failed (random NAT detected)
4. Birthday attack (one side random) → <10s [Phase C, new]
↓ failed (both sides random)
5. Relay fallback → always [Phase 1, done]
```
The relay path starts **immediately in parallel** with all direct attempts (existing 500ms head-start architecture). The user hears audio via relay while the harder traversal techniques probe in the background. If a direct path is found, the call seamlessly upgrades (using the Phase 8.3 transport hot-swap mechanism).
## QUIC-Specific Challenges
### 1. Connection ID Mismatch
QUIC's Initial packet contains a random Destination Connection ID. When birthday-attack probes land on the Acceptor's socket, the CID won't match any expected value. Quinn handles this via its `Endpoint` which accepts any incoming Initial — but we need to ensure the Endpoint is in server mode on all 256 ports.
**Solution**: Use quinn's `Endpoint` with a server config on each socket. Quinn's accept logic handles unknown CIDs correctly.
### 2. Probe Packet Format
Birthday attack probes must be valid QUIC Initial packets (not raw UDP). Quinn's `Endpoint::connect()` sends a proper Initial, so each probe is a real connection attempt. Failed probes time out naturally.
### 3. Stateful Connections
Unlike WireGuard (stateless), each QUIC probe creates connection state. With 1024 probes, that's 1024 half-open connections. Must aggressively abort losers once one succeeds.
**Solution**: Use `JoinSet` (existing pattern in `dual_path.rs`) and `abort_all()` on first success.
### 4. NAT Pinhole Lifetime
QUIC Initial retransmission timer (1s default) may exceed the NAT pinhole lifetime on aggressive NATs. One probe per port may not be enough.
**Solution**: Send 2-3 Initials per predicted port, 200ms apart.
## Signal Protocol
New variants:
```rust
/// Hard NAT probe coordination — exchanged before birthday attack.
HardNatProbe {
call_id: String,
/// Last 5 observed external ports (most recent first).
port_sequence: Vec<u16>,
/// Detected allocation pattern.
allocation: String, // "sequential:1", "sequential:2", "random", "preserving"
/// Probe timestamp for synchronization (ms since epoch).
probe_time_ms: u64,
/// External IP from STUN.
external_ip: String,
}
/// Hard NAT birthday attack coordination.
HardNatBirthdayStart {
call_id: String,
/// Number of ports opened by the acceptor side.
acceptor_port_count: u16,
/// External ports the acceptor has open (for targeted probing).
/// Only sent if port_count is small enough to enumerate.
acceptor_ports: Vec<u16>,
/// "start probing now" timestamp.
start_at_ms: u64,
}
```
## Integration with Existing Architecture
- **Netcheck**: `NetcheckReport` gains `port_allocation: PortAllocation` field
- **IceAgent**: `gather()` includes port allocation detection; `re_gather()` re-probes on network change
- **dual_path**: `race()` extended with hard-NAT probe phase between standard hole-punch timeout and relay commitment
- **Desktop**: `place_call` / `answer_call` exchange `HardNatProbe` when both sides report `SymmetricPort` NAT type
## Effort Estimate
| Phase | Scope | Effort | Status |
|-------|-------|--------|--------|
| A | Port allocation pattern detection | 1 day | **Done**`PortAllocation` enum, `detect_port_allocation()`, `classify_port_allocation()`, `predict_ports()`, 17 tests |
| B | Sequential port prediction + coordination | 2 days | **Signal ready**`HardNatProbe` signal + relay forwarding done. `dual_path::race()` integration pending |
| C | Birthday attack (256 sockets + 1024 probes) | 3 days | Not started |
| D | Hybrid waterfall + background upgrade | 2 days | Not started |
**Total**: ~8 days. Phase A is done and feeds into netcheck. Phase B has signal plumbing complete — needs `dual_path::race()` integration to actually dial predicted ports. Phase C (birthday) is the most complex and lowest ROI.
## Success Criteria
- Port allocation detection correctly classifies sequential vs random on test routers
- Sequential port prediction achieves >70% direct connection rate on sequential-NAT routers
- Birthday attack achieves >90% within 10 seconds when one peer has cone NAT
- Relay-to-direct upgrade is seamless (no audio gap) via Phase 8.3 transport hot-swap
- No regression in call setup time for cone-NAT pairs (the common case)
## References
- [Tailscale: How NAT traversal works](https://tailscale.com/blog/how-nat-traversal-works)
- [Tailscale: NAT traversal improvements pt.1](https://tailscale.com/blog/nat-traversal-improvements-pt-1)
- [Tailscale: NAT traversal improvements pt.2 — cloud environments](https://tailscale.com/blog/nat-traversal-improvements-pt-2-cloud-environments)
- RFC 4787: NAT Behavioral Requirements for Unicast UDP
- RFC 5245: ICE (Interactive Connectivity Establishment)
- Birthday problem: P(collision) = 1 - e^(-n²/2m) where n=probes, m=port space

View File

@@ -0,0 +1,121 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Mid-Call ICE Re-Gathering
> Phase: Implemented (signal plane); transport hot-swap deferred
> Status: Partial (2026-04-14)
> Crate: wzp-client, wzp-proto, wzp-relay
## Problem
When a mobile device transitions between networks (WiFi -> cellular, IP address change), the active QUIC connection dies. The call stays on a dead path until timeout, then the user experiences silence. There is no mechanism to re-discover candidates and re-establish a direct path mid-call.
Android's `NetworkMonitor.onIpChanged` already fires on `onLinkPropertiesChanged`, but nothing consumes it for candidate re-gathering or path migration.
## Solution
Implement an `IceAgent` that manages the full candidate lifecycle — initial gathering, mid-call re-gathering on network change, and peer candidate application. A new `CandidateUpdate` signal message carries refreshed candidates to the peer through the relay.
## Implementation
### New Module: `crates/wzp-client/src/ice_agent.rs`
**IceAgent struct**:
- Owns `IceAgentConfig` (STUN config, portmap toggle, gather timeout, local ports)
- Monotonic `generation: AtomicU32` — incremented on each re-gather, peers reject stale updates
- `peer_generation: AtomicU32` — tracks last-seen peer generation for ordering
**Public API**:
- `gather()` -> `CandidateSet` — runs STUN + portmap + host candidates in parallel with timeout
- `re_gather()` -> `(CandidateSet, SignalMessage)` — increments generation, returns update to send
- `apply_peer_update(signal)` -> `Option<PeerCandidates>` — parses `CandidateUpdate`, rejects if generation <= last-seen
**CandidateSet**:
```rust
pub struct CandidateSet {
pub reflexive: Option<SocketAddr>,
pub local: Vec<SocketAddr>,
pub mapped: Option<SocketAddr>,
pub generation: u32,
}
```
### New Signal: `CandidateUpdate`
```rust
CandidateUpdate {
call_id: String,
reflexive_addr: Option<String>,
local_addrs: Vec<String>,
mapped_addr: Option<String>,
generation: u32,
}
```
- All address fields use `#[serde(default, skip_serializing_if)]` for backward compat
- Generation counter is mandatory — prevents stale updates from network reordering
### Relay Forwarding
`CandidateUpdate` is forwarded to the call peer using the same pattern as `MediaPathReport`:
1. Look up peer fingerprint + `peer_relay_fp` from `CallRegistry`
2. If cross-relay: wrap in `FederatedSignalForward` and forward via federation link
3. If local: send via `signal_hub.send_to()`
### Desktop Handling
Signal recv loop handles `CandidateUpdate`:
- Logs generation, reflexive, mapped, local count
- Emits `recv:CandidateUpdate` debug event
- Emits `signal-event` type `candidate_update` to JS frontend
- TODO: wire into `IceAgent.apply_peer_update()` + `race_upgrade()` for transport hot-swap
### Deferred: Transport Hot-Swap
The actual mid-call transport replacement is not yet wired. The designed approach:
- `Arc<RwLock<Arc<QuinnTransport>>>` — send/recv tasks clone inner Arc per frame
- On upgrade, swap inner Arc under write lock — next frame picks up new transport
- Android: `pending_ice_regather: AtomicBool` polled in recv task, triggers re-gather + swap
- Requires live testing to validate seamless audio continuity during swap
## Signal Flow
```
Network change (WiFi -> cellular)
|
v
IceAgent::re_gather()
|-- stun::discover_reflexive()
|-- portmap::acquire_port_mapping()
|-- local_host_candidates()
|
v
SignalMessage::CandidateUpdate { generation: N+1 }
|
v (via relay)
Peer IceAgent::apply_peer_update()
|
v
PeerCandidates { reflexive, local, mapped }
|
v
dual_path::race() with new candidates [NOT YET WIRED]
```
## Files
| File | Change |
|------|--------|
| `crates/wzp-client/src/ice_agent.rs` | New — IceAgent + CandidateSet |
| `crates/wzp-proto/src/packet.rs` | `CandidateUpdate` variant |
| `crates/wzp-relay/src/main.rs` | Forward `CandidateUpdate` to peer |
| `crates/wzp-client/src/featherchat.rs` | Map `CandidateUpdate` to `IceCandidate` type |
| `desktop/src-tauri/src/lib.rs` | Handle `CandidateUpdate` in signal recv loop |
## Testing
- 10 unit tests: generation monotonicity, apply_peer_update (all fields, empty fields, unparseable addrs, stale rejection, wrong signal type), default config, gather with no STUN, re_gather produces signal with incrementing generation
- 2 protocol roundtrip tests: CandidateUpdate full + minimal

View File

@@ -0,0 +1,146 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Local Recording + Cloud Mixer for Podcast-Quality Interviews
## Problem
WarzonePhone delivers real-time encrypted voice, but the audio quality is limited by network conditions (codec compression, packet loss, jitter). Podcasters and interviewers need pristine, studio-grade recordings of each participant — independent of what the network delivers.
## Solution
**Dual-path architecture**: each client simultaneously (1) participates in the live call at whatever codec quality the network supports, and (2) records their own microphone locally as lossless PCM. After the session, all local recordings are uploaded to a self-hosted mixer service that aligns, normalizes, and outputs a final multi-track or mixed file.
## Architecture
```
┌──────────────────┐
Mic ──┬── Opus/Codec2 ──► Network (live) │ ← real-time call
│ └──────────────────┘
└── WAV 48kHz ────► Local File │ ← pristine recording
(timestamped)
▼ (after hangup)
┌──────────────────┐
│ Mixer Service │ ← self-hosted
│ (align + mix) │
└──────────────────┘
Final MP3/WAV/FLAC
```
## Requirements
### Phase 1: Local Recording (MVP)
**All clients (Desktop, Android, Web):**
1. **Record toggle**: User can enable "Record this call" before or during a call
2. **Recording pipeline**: Tap raw PCM from the microphone capture path *before* it enters the codec encoder
3. **File format**: WAV (48kHz, 16-bit, mono) — simple, universally supported, lossless
4. **Sync markers**: Embed a monotonic timestamp (ms since call start) at the beginning of the recording, and periodically (every 10s) write a sync marker packet into a sidecar JSON file:
```json
{"ts_ms": 30000, "seq": 1500, "wall_clock_utc": "2026-04-07T12:00:30Z"}
```
This allows the mixer to align recordings from different participants even if they join at different times.
5. **Storage**:
- Desktop: `~/.wzp/recordings/{room}_{timestamp}.wav`
- Android: `Documents/WarzonePhone/{room}_{timestamp}.wav`
- Web: IndexedDB blob or File System Access API
6. **File size estimate**: 48kHz * 16-bit * mono = 96 KB/s = ~5.6 MB/min = ~345 MB/hour
7. **UI indicator**: Red dot + timer showing recording is active and file size growing
8. **On hangup**: Close the WAV file, show "Recording saved" with file path/size
### Phase 2: Upload to Mixer
1. **Upload endpoint**: Self-hosted HTTP service (Rust or Go) that accepts WAV uploads with metadata
2. **Chunked/resumable upload**: Large files need resumable uploads (tus protocol or simple chunked POST)
3. **Upload metadata**:
```json
{
"session_id": "uuid",
"participant_fingerprint": "xxxx:xxxx:...",
"alias": "Alice",
"room": "podcast-ep-42",
"duration_secs": 3600,
"sync_markers": [...],
"sample_rate": 48000,
"channels": 1,
"bit_depth": 16
}
```
4. **Upload UI**: Progress bar after hangup, option to upload now or later
5. **Retry on failure**: Queue uploads for retry if network is unavailable
### Phase 3: Mixer Service
1. **Alignment**: Use sync markers (wall clock + sequence numbers) to align recordings from all participants to a common timeline
2. **Silence trimming**: Detect and optionally trim leading/trailing silence
3. **Normalization**: Per-track loudness normalization (LUFS-based)
4. **Noise reduction**: Optional per-track noise gate or RNNoise pass
5. **Output formats**:
- Multi-track: ZIP of individual WAVs (aligned, normalized)
- Mixed: Single stereo or mono WAV/MP3/FLAC with all participants
- Podcast-ready: Loudness-normalized to -16 LUFS (podcast standard)
6. **Web UI**: Simple dashboard to see sessions, download outputs, preview waveforms
7. **Self-hosted**: Docker image, single binary, SQLite for metadata
## Implementation Notes
### Recording tap point
The recording must tap *after* AGC (so levels are normalized) but *before* the codec encoder (to avoid compression artifacts). In the current architecture:
```
Mic → Ring Buffer → AGC → [TAP HERE for recording] → Opus/Codec2 → Network
```
**Desktop** (`engine.rs`): After `capture_agc.process_frame()`, before `encoder.encode()`
**Android** (`engine.rs`): Same location — after AGC, before encode
**CLI** (`call.rs`): After `self.agc.process_frame()` in `CallEncoder::encode_frame()`
### WAV writer
Use a simple streaming WAV writer that:
- Writes the WAV header with placeholder data length
- Appends PCM samples as they come
- On close, seeks back to update the data length in the header
### Sync mechanism
Wall-clock UTC alone is insufficient (clocks drift). The sync strategy:
1. Each participant records their local monotonic time + wall clock at call start
2. Periodically (every 10s), each participant writes: `{local_mono_ms, seq_number, utc_iso}`
3. The mixer uses sequence numbers (which are shared via the wire protocol) as ground truth for alignment, with wall clock as a fallback
### Privacy
- Local recordings never leave the device without explicit user action
- Upload is manual, not automatic
- The mixer service processes files and can delete originals after mixing
- No recording data flows through the relay — only the user's own mic
## Non-Goals (v1)
- Live transcription (future)
- Video recording (audio only)
- Automatic upload without user consent
- Recording other participants' audio (only your own mic)
- Real-time mixing (post-session only)
## Milestones
| Phase | Scope | Effort |
|-------|-------|--------|
| 1a | Local WAV recording on Desktop | 1-2 days |
| 1b | Local WAV recording on Android | 1-2 days |
| 1c | Sync markers + metadata sidecar | 1 day |
| 2a | Upload service (HTTP + storage) | 2-3 days |
| 2b | Upload UI in clients | 1-2 days |
| 3a | Mixer: alignment + normalization | 2-3 days |
| 3b | Mixer: web dashboard | 2-3 days |
| 3c | Docker packaging | 1 day |

View File

@@ -0,0 +1,89 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: QUIC Path MTU Discovery
## Problem
WarzonePhone uses conservative 1200-byte QUIC datagrams. Some network paths support larger MTUs (1400+), wasting bandwidth. Some broken paths (VPNs, tunnels, double-NAT, cellular) have MTU < 1200, causing silent packet drops — this may explain why Opus 64k fails on some paths while 24k works (larger encoded frames + FEC repair packets).
## Solution
Enable Quinn's built-in Path MTU Discovery (PMTUD) and handle edge cases:
1. PMTUD probes larger packet sizes and discovers the actual path MTU
2. Graceful fallback when datagrams exceed discovered MTU
3. Expose MTU in metrics for debugging
## Implementation
### Phase 1: Enable PMTUD in Quinn
`crates/wzp-transport/src/config.rs` — update `transport_config()`:
```rust
// Enable PMTUD (Quinn default is enabled, but we should ensure it)
config.mtu_discovery_config(Some(quinn::MtuDiscoveryConfig::default()));
// Set minimum MTU for safety (some paths can't handle 1200)
// Quinn default min is 1200, which is the QUIC spec minimum
```
Quinn's `MtuDiscoveryConfig` has:
- `interval`: how often to probe (default: 600s)
- `upper_bound`: max MTU to probe (default: 1452 for IPv4)
- `minimum_change`: min MTU increase to be worth probing (default: 20)
### Phase 2: Handle MTU-related Failures
In federation forwarding (`send_raw_datagram`), if the datagram exceeds the connection's current MTU, Quinn returns an error. Handle gracefully:
- Log warning with packet size vs MTU
- Drop the packet (don't crash)
- Track in metrics: `wzp_relay_mtu_exceeded_total`
### Phase 3: Codec-Aware MTU
When the path MTU is small, the relay or client should:
- Prefer lower-bitrate codecs (smaller packets)
- Reduce FEC ratio (fewer repair packets)
- This feeds into the adaptive quality system
### Phase 4: Expose MTU in Stats
- Add `path_mtu` to relay metrics (per peer)
- Add `path_mtu` to client stats (visible in UI)
- Log MTU on connection establishment
## Non-Goals (v1)
- Datagram fragmentation (QUIC datagrams are atomic — either fit or don't)
- Manual MTU override per relay config
- MTU-based codec selection (future, needs adaptive quality)
## Effort: 1 day
## Implementation Status (2026-04-12)
Phase 1 is now implemented:
### What was built
- **Transport config** (`crates/wzp-transport/src/config.rs`):
- `MtuDiscoveryConfig` with `upper_bound=1452`, `interval=300s`, `black_hole_cooldown=30s`
- `initial_mtu=1200` (safe QUIC minimum)
- Quinn's PLPMTUD binary-searches from 1200 up to 1452 automatically
- **`QuinnPathSnapshot::current_mtu`** (`crates/wzp-transport/src/quic.rs`):
- Reads `connection.max_datagram_size()` which reflects the PMTUD-discovered value
- Available to all callers via `transport.quinn_path_stats()`
- **Trunk batcher MTU-aware** (`crates/wzp-relay/src/room.rs`):
- `TrunkedForwarder::new()` initializes `max_bytes` from discovered MTU
- `send()` refreshes `max_bytes` on every call (cheap atomic read in quinn)
- Federation trunk frames grow automatically as PMTUD discovers larger paths
### Phases 2-3 status
- Phase 2 (handle MTU failures): Already handled — `send_media()`/`send_trunk()` check `max_datagram_size()` and return `DatagramTooLarge` errors. These are logged and the packet is dropped gracefully.
- Phase 3 (codec-aware MTU): Not yet implemented. Future video frames will need application-layer fragmentation when they exceed the discovered MTU.

View File

@@ -0,0 +1,82 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Network Diagnostic (Netcheck)
> Phase: Implemented
> Status: Done (2026-04-14)
> Crate: wzp-client
## Problem
When P2P connections fail or call quality is poor, there is no diagnostic tool to understand why. Users and developers must manually probe STUN, check NAT type, test relay connectivity, and verify port mapping support — all separately. Tailscale's `netcheck` consolidates all of this into a single diagnostic report.
## Solution
A comprehensive `run_netcheck()` function that probes all network capabilities in parallel and produces a structured `NetcheckReport`. Exposed as a CLI subcommand (`wzp-client --netcheck`) and available for in-app diagnostics.
## Implementation
### New Module: `crates/wzp-client/src/netcheck.rs`
**NetcheckReport**:
```rust
pub struct NetcheckReport {
pub nat_type: NatType,
pub reflexive_addr: Option<String>,
pub ipv4_reachable: bool,
pub ipv6_reachable: bool,
pub hairpin_works: Option<bool>,
pub port_mapping: Option<PortMapProtocol>,
pub relay_latencies: Vec<RelayLatency>,
pub preferred_relay: Option<String>,
pub stun_latency_ms: Option<u32>,
pub upnp_available: bool,
pub pcp_available: bool,
pub nat_pmp_available: bool,
pub gateway: Option<String>,
pub duration_ms: u32,
pub stun_probes: Vec<NatProbeResult>,
pub port_allocation: Option<PortAllocation>,
}
```
**Probes (all parallel via `tokio::join!`)**:
1. **STUN probes**`probe_stun_servers()` to all configured STUN servers
2. **Relay latencies**`probe_reflect_addr()` to each configured relay
3. **Port mapping**`acquire_port_mapping()` to detect NAT-PMP/PCP/UPnP
4. **Gateway**`default_gateway()` for the router address
5. **IPv6** — attempt to bind `[::]:0` and send to an IPv6 STUN server
6. **Port allocation**`detect_port_allocation()` probes STUN servers from single socket to classify NAT pattern as PortPreserving/Sequential/Random (feeds into hard NAT prediction)
**Derived fields**:
- `nat_type` / `reflexive_addr` — from `classify_nat()` on STUN probes
- `ipv4_reachable` — true if any STUN probe succeeded
- `preferred_relay` — relay with lowest RTT
- `port_mapping` / `nat_pmp_available` / `pcp_available` / `upnp_available` — from portmap result
**Human-readable output**: `format_report()` produces a formatted text report with sections for NAT info, port mapping, STUN probes, relay latencies.
### CLI Integration
`wzp-client --netcheck <relay-addr>` — runs the diagnostic using the specified relay plus default STUN servers, prints the report, and exits.
### Deferred
- **Hairpin test** — send packet from shared endpoint to own reflexive addr to test NAT hairpinning. Architecture is in place (`hairpin_works: Option<bool>`) but the actual probe is not yet implemented.
- **Android/Desktop in-app UI** — expose via JNI (Android) and Tauri command (desktop) for user-facing diagnostics.
## Files
| File | Change |
|------|--------|
| `crates/wzp-client/src/netcheck.rs` | New — NetcheckReport + run_netcheck + format_report |
| `crates/wzp-client/src/lib.rs` | Add `pub mod netcheck` |
| `crates/wzp-client/src/cli.rs` | `--netcheck` flag + handler |
## Testing
- 5 unit tests: default config, report JSON serialization + roundtrip, RelayLatency serialization, format_report with empty relays, format_report with full data (STUN probes, relay latencies, preferred relay, port mapping)
- 1 integration test (`#[ignore]`): full netcheck run

View File

@@ -0,0 +1,144 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Network Awareness
> Phase: Implemented (core path)
> Status: Ready for testing
> Platform: Android native Kotlin app (com.wzp)
## Problem
WarzonePhone's quality controller (`AdaptiveQualityController`) had a `signal_network_change()` API for proactive adaptation to WiFi↔cellular transitions, but nothing called it. Network handoffs during calls were only detected reactively via jitter spikes — by which time the user had already experienced degraded audio.
## Solution
Integrate Android's `ConnectivityManager.NetworkCallback` to detect network transport changes in real-time and feed them to the quality controller. This enables:
1. **Preemptive quality downgrade** when switching from WiFi to cellular
2. **FEC boost** (10-second window with +0.2 ratio) after any network change
3. **Faster downgrade thresholds** on cellular (2 consecutive reports vs 3 on WiFi)
## Architecture
```
┌──────────────────────────────────────────────────────────────┐
│ Android │
│ │
│ ConnectivityManager │
│ │ NetworkCallback │
│ ▼ │
│ NetworkMonitor.kt │
│ │ onNetworkChanged(type, bandwidthKbps) │
│ ▼ │
│ CallViewModel.kt ──► WzpEngine.onNetworkChanged() │
│ │ JNI │
│ ▼ │
│ jni_bridge.rs: nativeOnNetworkChanged(handle, type, bw) │
│ │ │
│ ▼ │
│ engine.rs: state.pending_network_type.store(type) │
│ │ AtomicU8 (lock-free) │
│ ▼ │
│ recv task: quality_ctrl.signal_network_change(ctx) │
│ │ │
│ ├─ Preemptive downgrade (WiFi → cellular) │
│ ├─ FEC boost 10s │
│ └─ Faster cellular thresholds │
└──────────────────────────────────────────────────────────────┘
```
## Network Classification
`NetworkMonitor` classifies the active transport without requiring `READ_PHONE_STATE` permission by using bandwidth heuristics:
| Downstream Bandwidth | Classification | Rust `NetworkContext` |
|----------------------|---------------|----------------------|
| N/A (WiFi transport) | WiFi | `WiFi` |
| >= 100 Mbps | 5G NR | `Cellular5g` |
| >= 10 Mbps | LTE | `CellularLte` |
| < 10 Mbps | 3G or worse | `Cellular3g` |
| Ethernet | WiFi (equivalent) | `WiFi` |
| Network lost | None | `Unknown` |
## Cross-Task Signaling
The network type is communicated from the JNI thread to the recv task via `AtomicU8` — the same pattern used for `pending_profile` (adaptive quality profile switches):
```
JNI thread recv task (tokio)
│ │
│ store(type, Release) │
│──────────────────────────────►│
│ │ swap(0xFF, Acquire)
│ │ if != 0xFF:
│ │ quality_ctrl.signal_network_change(ctx)
│ │
```
Sentinel value `0xFF` means "no change pending". The recv task polls on every received packet (~20-40ms), so latency is bounded by the inter-packet interval.
## Components
### New File
| File | Purpose |
|------|---------|
| `android/.../net/NetworkMonitor.kt` | ConnectivityManager callback, transport classification, deduplication |
### Modified Files
| File | Change |
|------|--------|
| `android/.../engine/WzpEngine.kt` | Added `onNetworkChanged()` method + `nativeOnNetworkChanged` external |
| `android/.../ui/call/CallViewModel.kt` | Instantiates NetworkMonitor, wires callback, register/unregister lifecycle |
| `crates/wzp-android/src/jni_bridge.rs` | Added `Java_com_wzp_engine_WzpEngine_nativeOnNetworkChanged` JNI entry |
| `crates/wzp-android/src/engine.rs` | Added `pending_network_type: AtomicU8` to EngineState, recv task polls it |
### Unchanged (already implemented)
| File | API |
|------|-----|
| `crates/wzp-proto/src/quality.rs` | `AdaptiveQualityController::signal_network_change(NetworkContext)` |
| `crates/wzp-transport/src/path_monitor.rs` | `PathMonitor::detect_handoff()` (available for future use) |
## Deferred Work
### Tauri Desktop App (com.wzp.desktop)
~~The Tauri engine doesn't use `AdaptiveQualityController` — quality is resolved once at call start.~~ **Update (2026-04-13):** Desktop now has `AdaptiveQualityController` wired into the recv task with `pending_profile` AtomicU8 bridge. Network monitoring on desktop is now feasible — the blocker was adaptive quality, which is done. Remaining work: platform-specific network change detection (macOS: `SCNetworkReachability` or `NWPathMonitor`; Linux: `netlink` socket).
### Mid-Call ICE Re-gathering — PARTIALLY IMPLEMENTED (2026-04-14)
When the device's IP address changes, the system now:
1. Re-gather local host candidates (`local_host_candidates()`) ✅
2. Re-probe STUN (`stun::discover_reflexive()` + `portmap::acquire_port_mapping()`) ✅
3. Send updated candidates to the peer (`CandidateUpdate` signal message) ✅
4. Relay forwards `CandidateUpdate` to peer (same pattern as `MediaPathReport`) ✅
5. Peer receives and can parse via `IceAgent::apply_peer_update()`
6. Attempt new dual-path race for path upgrade — **NOT YET WIRED** (transport hot-swap)
`NetworkMonitor.onIpChanged` fires on `onLinkPropertiesChanged` — the hook is ready.
The signaling plane is fully implemented via `IceAgent` + `CandidateUpdate`.
Remaining: wire `onIpChanged` → JNI → `pending_ice_regather` AtomicBool → recv task → `ice_agent.re_gather()` → transport swap.
New modules added in Phase 8 (Tailscale-inspired):
- `crates/wzp-client/src/ice_agent.rs` — candidate lifecycle management
- `crates/wzp-client/src/stun.rs` — public STUN server probing (independent of relay)
- `crates/wzp-client/src/portmap.rs` — NAT-PMP/PCP/UPnP port mapping
- `crates/wzp-client/src/netcheck.rs` — comprehensive network diagnostic
## Testing
1. Build native APK
2. Start a call on WiFi
3. Verify logcat: `quality controller: network context updated` with `ctx=WiFi`
4. Disable WiFi → device falls to cellular
5. Verify logcat: `ctx=CellularLte` (or `Cellular5g`/`Cellular3g`)
6. Verify FEC boost activates (check quality_ctrl logs)
7. Verify preemptive quality downgrade (tier drops one level on WiFi→cellular)
8. Re-enable WiFi → verify transition back
9. Rapid WiFi toggle (5x in 10s) → verify no crashes, deduplication works
10. Airplane mode → verify `onLost` fires with `TYPE_NONE`

View File

@@ -0,0 +1,217 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Peer-to-Peer Direct Calls (No Relay)
## Problem
All calls currently route through a relay, even 1-on-1 calls between clients that could reach each other directly. This adds latency (2x hop), creates a single point of failure, and requires trusting the relay operator (even though media is encrypted, the relay sees metadata).
## Solution
For 1-on-1 calls, clients attempt a direct QUIC connection using STUN-discovered addresses. If NAT traversal succeeds, media flows directly between peers. If it fails, fall back to relay-assisted mode (current behavior).
## Architecture
```
Preferred (P2P):
Client A ←──QUIC direct──→ Client B
(no relay in media path, true E2E)
Fallback (Relay):
Client A ──→ Relay ──→ Client B
(current model)
Hybrid discovery:
Client A → Relay (signaling only) → Client B
↓ ↓
STUN server STUN server
↓ ↓
Discover public IP:port Discover public IP:port
↓ ↓
Exchange candidates via relay signaling
↓ ↓
Attempt direct QUIC connection ←──→
```
## Why P2P = True E2E
- QUIC TLS handshake establishes encrypted tunnel directly between A and B
- No third party sees the traffic
- Certificate pinning via identity fingerprints: each client derives their TLS cert from their Ed25519 seed (same as relay identity). During QUIC handshake, both sides verify the peer's cert fingerprint against the known identity
- MITM elimination: if A knows B's fingerprint (from prior call, QR code, or identity server), any interceptor presents a different cert → fingerprint mismatch → connection rejected
- Stronger guarantee than relay-assisted: user doesn't need to trust relay operator
## Requirements
### Phase 1: STUN Discovery
1. **STUN client**: lightweight UDP-based STUN client to discover public IP:port
- Use existing public STUN servers (stun.l.google.com:19302, etc.)
- Or run a STUN server alongside the relay
- Discover: local addresses, server-reflexive addresses (STUN), relay candidates (TURN/relay fallback)
2. **Candidate gathering**: on call initiation, gather all candidates:
- Host candidates: local network interfaces
- Server-reflexive: STUN-discovered public IP:port
- Relay candidate: the relay's address (fallback)
3. **Candidate exchange**: via relay signaling channel (existing `IceCandidate` signal message)
- A sends candidates to relay → relay forwards to B
- B sends candidates to relay → relay forwards to A
### Phase 2: Direct Connection
1. **QUIC hole punching**: both clients simultaneously attempt QUIC connections to each other's candidates
- Quinn supports connecting to multiple addresses
- First successful connection wins
- Timeout after 3 seconds, fall back to relay
2. **Identity verification**: during QUIC handshake, verify peer's TLS cert fingerprint
- `server_config_from_seed()` already exists — derive client cert from identity seed
- Both sides present certs (mutual TLS)
- Verify fingerprint matches expected identity
3. **Media flow**: once connected, use existing `QuinnTransport` for media + signals
- Same `send_media()` / `recv_media()` API
- Same codec pipeline, FEC, jitter buffer
- No code changes needed in the call engine
### Phase 3: Adaptive Quality (P2P)
P2P connections have direct quality visibility — no relay middleman:
1. Both clients observe RTT, loss, jitter directly from QUIC stats
2. Adapt codec quality based on direct observations
3. Since only 2 participants, coordinated switching is simple: propose → ack → switch
This is the simplest case for adaptive quality. Once proven, backport the logic to relay-assisted mode.
### Phase 4: Hybrid Mode
1. **Call initiation**: always connect to relay for signaling
2. **Parallel attempt**: while relay call is active, attempt P2P in background
3. **Seamless migration**: if P2P succeeds, migrate media path from relay to direct
- Both clients switch simultaneously
- Relay connection kept alive for signaling (presence, room updates)
4. **Fallback**: if P2P connection drops, seamlessly fall back to relay
## Security Properties
| Property | Relay Mode | P2P Mode |
|----------|-----------|----------|
| Encryption | ChaCha20-Poly1305 (app layer) | QUIC TLS 1.3 + ChaCha20-Poly1305 |
| Key exchange | Via relay signaling | Direct QUIC handshake |
| Identity verification | TOFU (server fingerprint) | Mutual TLS cert pinning |
| Metadata privacy | Relay sees who talks to whom | No third party sees anything |
| MITM resistance | Depends on relay trust | Strong (cert pinning) |
| Forward secrecy | ECDH ephemeral keys | QUIC built-in + app-layer rekey |
## Implementation Notes
### STUN in Rust
Use `stun-rs` or `webrtc-rs` crate for STUN client. Minimal: just need Binding Request/Response to discover server-reflexive address.
### Quinn Hole Punching
Quinn's `Endpoint` can both listen and connect. For hole punching:
```rust
let endpoint = create_endpoint(bind_addr, Some(server_config))?;
// Send connect to peer's address (opens NAT pinhole)
let conn = connect(&endpoint, peer_addr, "peer", client_config).await?;
// Simultaneously, peer connects to our address
// First successful handshake wins
```
### Client TLS Certificate
Already have `server_config_from_seed()` for relays. Create `client_config_from_seed()` that presents a TLS client certificate derived from the identity seed. The peer verifies this cert's fingerprint.
### Signaling via Relay
The existing relay connection carries `IceCandidate` signals. No new infrastructure needed — just use the relay as a dumb signaling pipe for candidate exchange.
## Non-Goals (v1)
- SFU over P2P (P2P is 1-on-1 only; multi-party uses relay SFU)
- TURN server (relay acts as the fallback, no separate TURN)
- mDNS local discovery (future)
- Mesh P2P for multi-party (future, complex)
## Milestones
| Phase | Scope | Effort | Status |
|-------|-------|--------|--------|
| 1 | STUN client + candidate gathering | 2 days | Done |
| 2 | QUIC hole punching + identity verification | 3 days | Done |
| 3 | Adaptive quality on P2P connection | 2 days | Done (#23) |
| 4 | Hybrid mode (relay + P2P, seamless migration) | 3 days | Done |
| 5 | Single-socket Nebula (shared signal+direct endpoint) | 2 days | Done |
| 6 | ICE path negotiation + dual-path race | 3 days | Done |
| 7 | IPv6 dual-socket | 2 days | Done (but `dual_path.rs` integration tests broken — missing `ipv6_endpoint` arg) |
| 8.1 | Public STUN client (RFC 5389) | 1 day | Done |
| 8.2 | PCP/PMP/UPnP port mapping | 2 days | Done |
| 8.3 | Mid-call ICE re-gathering + CandidateUpdate signal | 2 days | Done (signal plane; transport hot-swap TODO) |
| 8.4 | Netcheck diagnostic | 1 day | Done |
| 8.5 | Region-based relay selection (data model) | 1 day | Done |
| 8.6a | Hard NAT: port allocation detection | 1 day | Done |
| 8.6b | Hard NAT: sequential port prediction signal | 1 day | Done (signal + prediction fn; dial integration pending) |
| 8.6c | Hard NAT: birthday attack (256×1024 probes) | 3 days | Not started |
| 8.6d | Hard NAT: hybrid waterfall + background upgrade | 2 days | Not started |
## Implementation Status (2026-04-13)
Phases 1-2, 4-7 are implemented. First P2P call completed 2026-04-12.
### Known regression
Phase 7 added `ipv6_endpoint: Option<Endpoint>` parameter to `race()` in `crates/wzp-client/src/dual_path.rs` but the 3 test call sites in `crates/wzp-client/tests/dual_path.rs` (lines 111, 153, 191) were not updated — they pass 6 args instead of 7. Fix: add `None,` after the `shared_endpoint` arg in each call.
## Update (2026-04-13)
P2P adaptive quality (#23) now implemented:
- Both peers self-observe network quality from QUIC path stats
- Quality reports generated every ~1s and attached to outgoing packets
- AdaptiveQualityController drives codec switching on both P2P and relay calls
## Update (2026-04-14): Phase 8 — Tailscale-Inspired Enhancements
Added 5 new modules to bring NAT traversal capability close to Tailscale's:
### Phase 8.1: Public STUN Client (Done)
- `stun.rs`: RFC 5389 Binding Request/Response over raw UDP
- Independent reflexive discovery via public STUN servers (Google, Cloudflare)
- `detect_nat_type_with_stun()` combines relay + STUN probes for higher confidence
- STUN fallback in desktop's `try_reflect_own_addr()` when relay reflection fails
### Phase 8.2: PCP/PMP/UPnP Port Mapping (Done)
- `portmap.rs`: NAT-PMP (RFC 6886), PCP (RFC 6887), UPnP IGD
- Gateway discovery (macOS + Linux), try NAT-PMP → PCP → UPnP in sequence
- New candidate type: `PeerCandidates.mapped` + signal fields `caller_mapped_addr`/`callee_mapped_addr`/`peer_mapped_addr`
- Dial order: host → mapped → reflexive (mapped helps on symmetric NATs)
### Phase 8.3: Mid-Call ICE Re-Gathering (Done — signal plane)
- `ice_agent.rs`: `IceAgent` with `gather()`, `re_gather()`, `apply_peer_update()`
- `SignalMessage::CandidateUpdate` with monotonic generation counter
- Relay forwards `CandidateUpdate` like `MediaPathReport`
- Desktop handles and emits to JS frontend
- Transport hot-swap: designed but not yet wired into live call engine
### Phase 8.4: Netcheck Diagnostic (Done)
- `netcheck.rs`: comprehensive network diagnostic (NAT type, reflexive addr, IPv4/v6, port mapping, relay latencies)
- CLI: `wzp-client --netcheck <relay>`
### Phase 8.5: Region-Based Relay Selection (Done — data model)
- `relay_map.rs`: `RelayMap` sorted by RTT with `preferred()` selection
- `RegisterPresenceAck` extended with `relay_region` + `available_relays`
### Phase 8.6: Hard NAT Traversal (Phase A done, B-D pending)
- **Phase A (Done)**: Port allocation pattern detection — `PortAllocation` enum (`PortPreserving`/`Sequential{delta}`/`Random`/`Unknown`), `detect_port_allocation()` probes N STUN servers from single socket, `classify_port_allocation()` with wraparound + jitter tolerance, `predict_ports()` for sequential NATs
- **Phase B (signal ready)**: `HardNatProbe` signal message carries `port_sequence`, `allocation`, `external_ip` — relay forwarding implemented. Actual dial-to-predicted-ports integration into `dual_path::race()` pending.
- **Phase C (not started)**: Birthday attack (256 sockets × 1024 probes) for random NATs
- **Phase D (not started)**: Hybrid waterfall with background relay-to-direct upgrade
- `NetcheckReport.port_allocation` populated automatically from `detect_port_allocation()`
- See `docs/PRD-hard-nat.md` for full design

97
vault/PRDs/PRD-portmap.md Normal file
View File

@@ -0,0 +1,97 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: NAT Port Mapping (PCP/PMP/UPnP)
> Phase: Implemented
> Status: Done (2026-04-14)
> Crate: wzp-client, wzp-proto, wzp-relay
## Problem
WarzonePhone falls back to relay-only when the client is behind a symmetric NAT (different external port per destination). The STUN-discovered reflexive address won't match what a peer sees, so direct hole-punching fails. Tailscale reports ~70% of consumer routers support NAT-PMP, PCP, or UPnP — protocols that let clients request explicit port mappings, making symmetric NATs traversable.
## Solution
Implement all three port mapping protocols, tried in sequence (NAT-PMP -> PCP -> UPnP). When a mapping is acquired, advertise the mapped address as a new candidate type alongside reflexive and host candidates. The relay cross-wires it into `CallSetup.peer_mapped_addr` so the peer can dial it.
## Implementation
### New Module: `crates/wzp-client/src/portmap.rs`
**NAT-PMP (RFC 6886)**:
- UDP to gateway:5351
- External address request (opcode 0) -> returns router's public IP
- Map UDP request (opcode 1) -> returns mapped external port + lifetime
- 12-byte request, 16-byte response
**PCP (RFC 6887)**:
- Same gateway:5351, version 2
- MAP opcode with client IP as IPv4-mapped IPv6
- 60-byte request/response with 12-byte nonce for anti-spoofing
- Superset of NAT-PMP, supports IPv6
**UPnP IGD**:
- SSDP M-SEARCH to 239.255.255.250:1900 for InternetGatewayDevice discovery
- Parse LOCATION header -> fetch device description XML -> find WANIPConnection controlURL
- SOAP `GetExternalIPAddress` -> router's public IP
- SOAP `AddPortMapping` -> maps the QUIC port
**Gateway discovery**:
- macOS: `route -n get default` (parse `gateway:` line)
- Linux/Android: `/proc/net/route` (parse hex gateway for 00000000 destination)
**Public API**:
- `acquire_port_mapping(internal_port, local_ip)` -> tries all 3, first success wins
- `release_port_mapping(mapping)` -> best-effort cleanup (lifetime=0 for NAT-PMP)
- `spawn_refresh(mapping)` -> background task renewing at half-lifetime
- `default_gateway()` -> cross-platform gateway discovery
### Signal Protocol Extensions
| Message | New Field | Purpose |
|---------|-----------|---------|
| `DirectCallOffer` | `caller_mapped_addr: Option<String>` | Caller's port-mapped address |
| `DirectCallAnswer` | `callee_mapped_addr: Option<String>` | Callee's port-mapped address |
| `CallSetup` | `peer_mapped_addr: Option<String>` | Relay cross-wires peer's mapped addr |
All fields use `#[serde(default, skip_serializing_if)]` for backward compatibility.
### Relay Cross-Wiring
`CallRegistry` extended with `caller_mapped_addr` / `callee_mapped_addr` fields + setter methods. The relay:
1. Extracts `caller_mapped_addr` from `DirectCallOffer`, stores in registry
2. Extracts `callee_mapped_addr` from `DirectCallAnswer`, stores in registry
3. Cross-wires into `CallSetup`: caller gets callee's mapped addr as `peer_mapped_addr`, and vice versa
### Candidate Priority
`PeerCandidates.mapped` added to `dual_path.rs`. Dial order:
1. Host (LAN) candidates — fastest on same-LAN
2. **Port-mapped** — stable even behind symmetric NATs
3. Server-reflexive (STUN) — standard hole-punching
4. Relay — always-available fallback
### Desktop Integration
Both `place_call()` and `answer_call()` call `acquire_port_mapping()` using the signal endpoint's local port. Privacy-mode answers (`AcceptGeneric`) skip portmap to keep the address hidden.
## Files
| File | Change |
|------|--------|
| `crates/wzp-client/src/portmap.rs` | New — NAT-PMP/PCP/UPnP client |
| `crates/wzp-client/src/dual_path.rs` | `PeerCandidates.mapped` field + dial_order update |
| `crates/wzp-proto/src/packet.rs` | `caller/callee_mapped_addr` + `peer_mapped_addr` fields |
| `crates/wzp-relay/src/call_registry.rs` | `caller/callee_mapped_addr` fields + setters |
| `crates/wzp-relay/src/main.rs` | Extract, store, cross-wire mapped addrs |
| `desktop/src-tauri/src/lib.rs` | Call portmap in place_call/answer_call |
## Testing
- 18 unit tests: NAT-PMP encoding, UPnP XML parsing (5 variants including real-world router XML), URL host extraction, error Display, protocol serde, PortMapping serialization, gateway detection, constants verification
- 2 integration tests (`#[ignore]`): gateway discovery, acquire_mapping
- 9 PeerCandidates tests: dial_order with all types, dedup, is_empty edge cases
- 12 protocol roundtrip tests: offer/answer/setup with mapped addr, backward compat without

View File

@@ -0,0 +1,205 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Protocol Analyzer & Debug Tap
## 1. Relay-Side Metadata Tap (`--debug-tap`)
### Problem
When debugging federation, codec issues, or packet flow problems, there's no visibility into what's actually flowing through the relay. You have to guess from client-side logs.
### Solution
A `--debug-tap <room>` flag on the relay that logs every packet's **header metadata** for a specific room (or all rooms with `--debug-tap *`). No decryption needed — the MediaHeader is not encrypted, only the audio payload is.
### Output Format
```
[12:00:00.123] TAP room=test dir=in src=192.168.1.5:54321 seq=1234 codec=Opus24k ts=24000 fec_block=5 fec_sym=2 repair=false len=87
[12:00:00.123] TAP room=test dir=out dst=192.168.1.6:54322 seq=1234 codec=Opus24k ts=24000 fec_block=5 fec_sym=2 repair=false len=87 fan_out=2
[12:00:00.143] TAP room=test dir=in src=192.168.1.5:54321 seq=1235 codec=Opus24k ts=24960 fec_block=5 fec_sym=3 repair=false len=91
[12:00:00.500] TAP room=test dir=in src=192.168.1.6:54322 seq=0042 codec=Codec2_1200 ts=40000 fec_block=1 fec_sym=0 repair=false len=6
[12:00:01.000] TAP room=test SIGNAL type=RoomUpdate count=3 participants=[Alice,Bob,Charlie]
[12:00:05.000] TAP room=test STATS period=5s in_pkts=250 out_pkts=500 fan_out_avg=2.0 loss_detected=0 codecs_seen=[Opus24k,Codec2_1200]
```
### What it shows
- **Per-packet**: direction, source/dest, sequence number, codec ID, timestamp, FEC block/symbol, repair flag, payload size
- **Signals**: RoomUpdate, FederationRoomJoin/Leave, handshake events
- **Periodic stats**: packets in/out, average fan-out, codecs seen, detected sequence gaps (loss)
- **Federation**: room-hash tagged datagrams with source/dest relay
### Implementation
**File:** `crates/wzp-relay/src/room.rs` — in `run_participant_plain()` and `run_participant_trunked()`
After receiving a packet and before forwarding:
```rust
if debug_tap_enabled {
let h = &pkt.header;
info!(
room = %room_name,
dir = "in",
src = %addr,
seq = h.seq,
codec = ?h.codec_id,
ts = h.timestamp,
fec_block = h.fec_block,
fec_sym = h.fec_symbol,
repair = h.is_repair,
len = pkt.payload.len(),
"TAP"
);
}
```
**Activation:** `--debug-tap <room_name>` CLI flag, or `debug_tap = "test"` / `debug_tap = "*"` in TOML config.
**Performance:** Only active when enabled. When enabled, adds one `info!()` log per packet per direction. At 50 fps × 5 participants = 500 log lines/sec — acceptable for debugging, not for production.
**Output options:**
- Default: tracing log (stderr)
- `--debug-tap-file <path>`: write to a dedicated file (JSONL format for machine parsing)
### Effort: 0.5 day
### Implementation Status (2026-04-13)
Fully implemented. `--debug-tap <room>` (or `*` for all rooms) logs:
- **Per-packet metadata** (`TAP`): direction, addr, seq, codec, timestamp, FEC fields, payload size, fan_out
- **Signal events** (`TAP SIGNAL`): `RoomUpdate` (count + participant names), `QualityDirective` (codec + reason), other signals by discriminant
- **Lifecycle events** (`TAP EVENT`): participant join (id, addr, alias), participant leave (id, addr, forwarded count, or room closed)
All output uses tracing `target: "debug_tap"` so it can be filtered with `RUST_LOG=debug_tap=info`.
---
## 2. Full Protocol Analyzer (Standalone Tool)
### Problem
The metadata tap shows packet flow but can't inspect audio content, verify encryption, or measure audio quality. For deep debugging (codec issues, resampling bugs, encryption mismatches), you need to see the actual decrypted audio.
### Solution
A standalone `wzp-analyzer` binary that either:
- **A)** Acts as a transparent proxy between client and relay (MITM mode)
- **B)** Reads a pcap/capture file with QUIC session keys (passive mode)
- **C)** Runs as a special "observer" client that joins a room in listen-only mode with all participants' consent
### Architecture
**Option C (recommended — simplest, no MITM):**
```
┌──────────────┐
Client A ────────►│ Relay │◄──────── Client B
│ │
│ (SFU) │◄──────── wzp-analyzer
└──────────────┘ (observer mode)
┌──────────────────┐
│ Decode + Analyze │
│ - Packet timing │
│ - Codec decode │
│ - Audio quality │
│ - Jitter stats │
│ - Waveform plot │
└──────────────────┘
```
The analyzer joins the room as a regular participant (receives all media via SFU forwarding) but doesn't send audio. It decodes everything it receives and produces analysis.
**Limitation:** End-to-end encrypted payloads can't be decoded without session keys. The analyzer would either:
1. Need the session key (shared out-of-band for debugging)
2. Or only analyze unencrypted headers + timing (same as the relay tap, but from client perspective with jitter buffer simulation)
For now, since encryption is not fully enforced in the current codebase (the crypto session is established but the actual ChaCha20 encryption of payloads is TODO in some paths), the analyzer can decode raw Opus/Codec2 payloads directly.
### Features
**Real-time display (TUI):**
```
┌─ wzp-analyzer: room "podcast" on 193.180.213.68:4433 ─────────────┐
│ │
│ Participants: Alice (Opus24k), Bob (Codec2_3200) │
│ │
│ Alice ──────────────────────────────────────── │
│ seq: 5234 codec: Opus24k ts: 125760 loss: 0.2% jitter: 3ms │
│ RMS: 4521 peak: 15280 silence: no │
│ FEC blocks: 1046/1046 complete (0 recovered) │
│ ▁▂▃▅▇█▇▅▃▂▁▁▂▃▅▇█▇▅▃▂▁ (waveform last 1s) │
│ │
│ Bob ────────────────────────────────────── │
│ seq: 2617 codec: Codec2_3200 ts: 62800 loss: 1.5% jitter: 8ms│
│ RMS: 1250 peak: 6800 silence: no │
│ FEC blocks: 523/525 complete (4 recovered) │
│ ▁▁▂▃▅▇▅▃▂▁▁▁▂▃▅▇▅▃▂▁▁ (waveform last 1s) │
│ │
│ Total: 7851 pkts recv, 0 pkts sent, 2 participants │
│ Uptime: 2m 35s │
└──────────────────────────────────────────────────────────────────────┘
```
**Recorded analysis:**
- Save all received packets to a capture file
- Post-session report: per-participant stats, quality timeline, codec switches, packet loss patterns
- Export decoded audio as WAV per participant (if decryptable)
**Quality metrics per participant:**
- Packet loss % (from sequence gaps)
- Jitter (inter-arrival time variance)
- Codec switches (timestamps + reasons)
- RMS audio level over time
- Silence detection
- FEC recovery rate
- Round-trip estimates (from Ping/Pong if available)
### Implementation
**Binary:** `wzp-analyzer` (new crate or subcommand of `wzp-client`)
```
wzp-analyzer 193.180.213.68:4433 --room podcast
wzp-analyzer 193.180.213.68:4433 --room podcast --record capture.wzp
wzp-analyzer --replay capture.wzp --report report.html
```
**Dependencies:**
- Existing: `wzp-transport`, `wzp-proto`, `wzp-codec`, `wzp-crypto`
- New: `ratatui` for TUI display (optional)
### Phases
| Phase | Scope | Effort |
|-------|-------|--------|
| 1 | Header-only analysis: join room, log packet metadata, show per-participant stats (TUI) | 2 days |
| 2 | Audio decode: decode Opus/Codec2 payloads (unencrypted path), show waveform + RMS | 1-2 days |
| 3 | Capture/replay: save packets to file, replay offline with full analysis | 1 day |
| 4 | HTML report: post-session quality report with charts | 2 days |
| 5 | Encrypted payload support: accept session keys, decrypt ChaCha20 | 1 day |
### Non-Goals (v1)
- Active probing (sending test patterns)
- Modifying packets in transit
- Automated quality scoring (MOS estimation)
- Video support
## Implementation Status (2026-04-13)
All phases implemented:
- Phase 1 (Observer + stats): wzp-analyzer binary, passive room observer, per-participant stats — DONE
- Phase 2 (TUI): ratatui display with color-coded loss severity — DONE
- Phase 3 (Capture/Replay): Binary .wzp format + CaptureReader for offline replay — DONE
- Phase 4 (HTML report): Self-contained with Chart.js loss/jitter timelines — DONE
- Phase 5 (Encrypted decode): Stub — SFU E2E encryption requires session context. Header-only analysis works. — PARTIAL
Binary: `cargo build --bin wzp-analyzer`
Usage: `wzp-analyzer relay:4433 --room test [--capture out.wzp] [--html report.html] [--no-tui]`

View File

@@ -0,0 +1,114 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Protocol Hardening Batch
> **Status:** proposed
> **Resolves:** Audit W2 (fec_block_id width), W3 (timestamp rebase doc), W5 (QualityReport AEAD binding), W11 (per-stream anti-replay), W12 (signal version byte), W13 (RoomManager lock).
> **Depends on:** PRD #1 (wire format v2 already widens block_id field).
## Problem
A handful of medium-priority audit findings that don't individually justify a PRD but together represent the long tail of protocol correctness and concurrency. Batching them avoids version churn.
## Items
### H1 — W5: `QualityReport` trailer must be inside AEAD
**Current risk.** If the 4-byte trailer sits *outside* the encrypted payload, anything stripping the last 4 bytes corrupts AEAD verification on legitimate packets and creates a quality-feedback downgrade vector. Even if it's correctly inside today, the v2 wire format change is the right moment to assert this explicitly.
**Action.**
- Audit `crates/wzp-proto/src/packet.rs` for `QualityReport` placement.
- Move inside AEAD payload if currently outside.
- Document: "QualityReport, when Q-flag set, is appended to plaintext payload before encryption."
- Test: tamper with trailer → AEAD decrypt fails.
**Severity.** Security correctness. Do this in Wave 1.
### H2 — W2: `fec_block_id` width
Resolved by v2 wire format (`u16` instead of `u8`). PRD #1 carries the wire change; this PRD just confirms semantics:
- Wraps at 2^16. At 5-frame blocks and 50 pps → ~22 min between collisions, vs. ~25 s in v1.
- Late-joining peers must still discard FEC blocks older than 2 s; widening is defense in depth.
**Action.** Update `wzp-fec` to operate on u16 block_id end-to-end. Test reconstruction across a synthetic 22-min session.
### H3 — W11: Per-stream, per-`MediaType` anti-replay window
**Current.** 64-packet sliding window globally.
**Problem.** Video keyframe burst (100+ packets) can stall the window behind one reordered prior packet.
**Action.**
- Anti-replay state is per (stream_id, media_type).
- Window size: 64 for audio, 1024 for video, 256 for data.
- Window size selected at session setup based on declared profile; tunable via `QualityProfile`.
**Severity.** Required before video. Wave 1.
### H4 — W12: `SignalMessage` versioning
**Current.** Bincode-serialized enum. `#[serde(default, skip_serializing_if)]` handles field additions; variant removals or semantic changes are unsafe.
**Action.**
- Every variant gains `version: u8` as its first field.
- Add `SignalMessage::Unknown { version, raw: Bytes }` to absorb future unknown variants gracefully.
- Decode path: unknown variant → log + drop, do not close session.
**Severity.** Future-proofing. Wave 3.
### H5 — W3: `timestamp_ms` rebase documentation
**Current.** Behavior at rekey (every 65,536 packets, ~22 min) is not documented.
**Decision (this PRD).** `timestamp_ms` is **monotonic across rekeys** — it does not reset. Rekey changes only the cryptographic key material; sequence and timestamp are session-scoped, not key-scoped.
**Action.**
- Document in `WZP-SPEC.md` and inline in `packet.rs` doc comments.
- Add a test that performs a rekey mid-session and asserts `timestamp_ms` continuity.
**Severity.** Doc + test. Wave 3.
### H6 — W13: `RoomManager` lock concurrency
**Current.** Single `Mutex<RoomManager>` acquired per packet by every participant for fan-out peer list. Serializes packet processing within a room.
**Problem.** At 1500 pps/sender for video, this is the dominant bottleneck.
**Action.**
- Migrate to `DashMap<RoomId, Arc<RwLock<Room>>>`.
- Per-room `RwLock` allows concurrent reads (fan-out peer list) and exclusive writes (join/leave/quality changes).
- Fan-out path holds read lock; participant churn holds write lock.
- Federation manager updated to match.
**Severity.** Required for video scale. Wave 3.
**Migration safety.**
- Integration test suite (40 + 4 relay tests) must pass.
- Federation tests must pass.
- Trunking tests must pass.
- Property-test: 100-participant room, 500 join/leave events, 10k packets — no panics, no missed forwards.
## Implementation order
| Wave | Item | Task |
|---|---|---|
| 1 | H1 (W5 AEAD binding) | T1.4 |
| 1 | H3 (W11 anti-replay per-stream) | T1.5 |
| 1 | H2 (W2 block_id widening) | folded into PRD #1 |
| 3 | H4 (W12 signal versioning) | T3.3 |
| 3 | H5 (W3 timestamp doc) | T3.2 |
| 3 | H6 (W13 RoomManager lock) | T3.4 |
## Acceptance criteria
- All current tests pass post-hardening.
- New tests: AEAD trailer tampering, rekey timestamp continuity, 100-participant property test, signal forward-compat decode.
- No Prometheus regression in fan-out latency p99 after H6.
## Effort
~4.5 engineer-days total (1.5 in Wave 1, 3 in Wave 3).

View File

@@ -0,0 +1,73 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Public STUN Client
> Phase: Implemented
> Status: Done (2026-04-14)
> Crate: wzp-client
## Problem
WarzonePhone's reflexive address discovery depends entirely on relay-based `Reflect` messages over an authenticated QUIC signal channel. If the relay is unreachable, overloaded, or not yet connected, the client cannot discover its public IP:port for P2P hole-punching. This single point of failure means call setup is delayed or falls back to relay-only unnecessarily.
Tailscale solves this by querying multiple public STUN servers in parallel, independent of its DERP relay infrastructure.
## Solution
Implement a minimal RFC 5389 STUN Binding client over raw UDP that queries public STUN servers (Google, Cloudflare) in parallel. This provides:
1. **Independent reflexive discovery** — works without any relay connection
2. **Redundancy** — STUN fallback when relay reflection fails
3. **Better NAT classification** — more probes = higher confidence in Cone vs Symmetric detection
4. **Faster call setup** — STUN can run before signal registration completes
## Implementation
### New Module: `crates/wzp-client/src/stun.rs`
**Wire format** (RFC 5389):
- 20-byte header: type (u16) + length (u16) + magic cookie (0x2112A442) + transaction ID (12 bytes)
- Binding Request (0x0001): no attributes, just the header
- Binding Response (0x0101): parses XOR-MAPPED-ADDRESS (0x0020, preferred) and MAPPED-ADDRESS (0x0001, fallback)
- XOR decoding: port XOR'd with top 16 bits of magic cookie, IPv4 XOR'd with cookie, IPv6 XOR'd with cookie || txn ID
**Public API**:
- `stun_reflect(socket, server, timeout)` — single-server probe with one retry on first-packet timeout
- `discover_reflexive(config)` — parallel probe of N servers, first success wins
- `probe_stun_servers(config)` — all-server probe returning `Vec<NatProbeResult>` for NAT classification
- `resolve_stun_server(host_port)` — DNS resolution preferring IPv4
**Default servers**: `stun.l.google.com:19302`, `stun1.l.google.com:19302`, `stun.cloudflare.com:3478`
**Error handling**: `StunError` enum — Io, Timeout, Malformed, TxnMismatch, ErrorResponse, NoMappedAddress, DnsError
### Integration Points
1. **`reflect.rs`**: New `detect_nat_type_with_stun()` runs relay probes and STUN probes concurrently via `tokio::join!`, merges results, re-classifies
2. **Desktop `lib.rs`**: `try_reflect_own_addr()` falls back to `try_stun_fallback()` when relay reflection fails or times out
3. **Desktop `detect_nat_type` command**: Uses `detect_nat_type_with_stun()` for combined relay + STUN classification
### Design Decisions
- **Separate UDP socket** per STUN probe — can't share the QUIC socket (quinn owns its I/O driver)
- **No external crate** — RFC 5389 Binding is ~200 lines of code, no need for `stun-rs` or `webrtc-rs`
- **Retry once** at half-timeout — handles the "first-packet problem" where some NATs drop the initial UDP packet to a new destination
- **IPv4 preferred** for DNS resolution — Phase 7 IPv6 is still flaky
## Files
| File | Change |
|------|--------|
| `crates/wzp-client/src/stun.rs` | New — STUN client |
| `crates/wzp-client/src/lib.rs` | Add `pub mod stun` |
| `crates/wzp-client/src/reflect.rs` | Add `detect_nat_type_with_stun()` |
| `crates/wzp-client/Cargo.toml` | Add `rand` dependency |
| `desktop/src-tauri/src/lib.rs` | STUN fallback in `try_reflect_own_addr()`, STUN in `detect_nat_type` |
## Testing
- 22 unit tests: encode/decode roundtrips, XOR-MAPPED-ADDRESS (IPv4, IPv6, high port), MAPPED-ADDRESS fallback (IPv4, IPv6), unknown family, attribute padding, unknown attributes skipped, truncated attributes, error response, bad cookie, txn mismatch, too short, no mapped address, XOR preferred over mapped, error Display, default config, empty servers
- 2 integration tests (`#[ignore]`): query `stun.l.google.com`, multi-server probe

View File

@@ -0,0 +1,319 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Relay Concurrency — DashMap Room Sharding
## Problem
The relay's media forwarding hot path routes every packet through a single `Arc<Mutex<RoomManager>>`. In a room with N participants, all N per-participant tasks compete for this one lock on every packet. The lock hold time is short (~1ms, no I/O), but the serialization means a 100-participant room effectively runs single-threaded despite having a multi-core tokio runtime.
Separately, the federation manager holds `peer_links` locked across multiple network sends, meaning a slow federation peer blocks all others.
### Measured bottleneck (from code audit)
```
Per-packet hot path (room.rs:748-757, 968-976):
lock(room_mgr)
→ observe_quality() O(N) iterate qualities HashMap
→ others() O(M) clone Vec<ParticipantSender>
unlock
→ fan-out sends sequential, no lock held
```
Lock contention = O(N) per room per packet, where N = participants in the room.
### Current lock inventory (hot path only)
| Lock | Location | Hold Duration | I/O While Locked | Frequency |
|------|----------|---------------|-------------------|-----------|
| `RoomManager` | room.rs:749, 968 | ~1ms | No | Every packet, every participant |
| `RoomManager` | room.rs:845, 1041 | <1ms | No | Every 5s per participant |
| `RoomManager` | room.rs:870 | ~1ms | No (explicit `drop` before broadcast) | On leave |
| `peer_links` | federation.rs:409 | N × send latency | **YES**`send_raw_datagram` in loop | Every federation packet |
| `peer_links` | federation.rs:216 | N × send latency | **YES**`send_signal` in loop | Every federation signal |
| `dedup` | federation.rs:1066 | <1ms | No | Every federation ingress packet |
| `rate_limiters` | federation.rs:1113 | <1ms | No | Every federation ingress packet |
### Scaling impact
| Room Size | Effective Core Usage | Bottleneck |
|-----------|---------------------|------------|
| 3 people × 100 rooms | All cores | None |
| 10 people × 10 rooms | Most cores | Mild contention per room |
| 100 people × 1 room | ~1 core | RoomManager lock |
| 1000 people × 1 room | ~1 core | Severely serialized |
## Goals
- Eliminate the global RoomManager Mutex as a serialization point for media forwarding
- Allow per-room parallelism: packets in room A don't block packets in room B
- Fix federation `peer_links` lock held across network sends
- Maintain correctness: no double-delivery, no stale participant lists
- Zero-copy or minimal-clone for fan-out participant lists
- Keep the refactor incremental — each phase independently shippable
## Non-Goals
- Lock-free data structures (overkill for our scale; DashMap or per-room Mutex is sufficient)
- Changing the SFU forwarding model (no mixing, no transcoding)
- Optimizing single-room beyond ~1000 participants (conferencing at that scale needs a different architecture)
- Changing the wire protocol or client behavior
## Design Options Evaluated
### Option A: Per-Room `Arc<Mutex<Room>>`
**Approach:** Replace `HashMap<String, Room>` inside RoomManager with `HashMap<String, Arc<Mutex<Room>>>`. The outer HashMap is protected by a short-lived lock for room lookup only; the per-room lock protects participant state.
```rust
struct RoomManager {
rooms: Mutex<HashMap<String, Arc<Mutex<Room>>>>, // outer: room lookup
// ...
}
// Hot path becomes:
let room_arc = {
let rooms = room_mgr.rooms.lock().await;
rooms.get(&room_name).cloned() // Arc clone, <1ns
}; // outer lock released
if let Some(room) = room_arc {
let room = room.lock().await; // per-room lock
let others = room.others(participant_id);
drop(room);
// fan-out sends...
}
```
**Pros:**
- Rooms are fully independent — room A's lock doesn't block room B
- Minimal code change (~50 lines)
- Per-room lock contention = O(participants in that room), not O(total participants)
- Outer lock held for <1μs (just a HashMap get + Arc clone)
**Cons:**
- Two-level locking (room lookup + room lock) — slightly more complex
- Room creation/deletion still serialized through outer lock (acceptable, rare operation)
- Quality tracking needs to move into the Room struct
**Verdict: Best option. Biggest win for least effort.**
### Option B: `DashMap<String, Room>`
**Approach:** Replace `Mutex<HashMap<String, Room>>` with `dashmap::DashMap<String, Room>`. DashMap uses internal sharding (default 64 shards) with per-shard RwLocks.
```rust
struct RoomManager {
rooms: DashMap<String, Room>,
}
// Hot path:
if let Some(room) = room_mgr.rooms.get(&room_name) {
let others = room.others(participant_id); // read lock on shard
drop(room); // release shard lock
// fan-out sends...
}
```
**Pros:**
- No explicit locking in user code
- Built-in sharding (64 shards by default)
- Read-heavy workload benefits from RwLock per shard
**Cons:**
- New dependency (`dashmap` crate)
- DashMap guards can't be held across `.await` points (not `Send`)
- Mutable operations (join/leave/quality update) need `get_mut()` which takes exclusive shard lock
- Less control over lock granularity than Option A
- Quality tracking across rooms becomes awkward (can't iterate all rooms while holding one shard)
**Verdict: Good but Option A is simpler and more explicit.**
### Option C: Channel-Based Fan-Out
**Approach:** Replace direct `send_media()` calls with per-participant `mpsc::Sender` channels. Room join registers a sender; the forwarding loop just does `tx.send(pkt)` which is lock-free.
```rust
struct Room {
participants: Vec<(ParticipantId, mpsc::Sender<MediaPacket>)>,
}
// Each participant's task:
let (tx, mut rx) = mpsc::channel(64);
room_mgr.join(room, participant_id, tx);
// Forwarding in recv loop:
let senders = room.others(participant_id); // Vec<mpsc::Sender> clone
for tx in &senders {
let _ = tx.try_send(pkt.clone()); // non-blocking, no lock
}
```
**Pros:**
- Fan-out is completely lock-free (channel send is atomic)
- Backpressure per participant (full channel = drop packet, not block others)
- Natural decoupling: recv task → channel → send task
**Cons:**
- Requires cloning MediaPacket per participant (currently we clone ParticipantSender Arc, much cheaper)
- Additional memory: 64-packet channel buffer × N participants
- Still need a lock to get the sender list (unless we snapshot on join/leave)
- Adds latency: channel hop + wake adds ~1-5μs vs direct send
**Verdict: Over-engineered for current scale. Consider for 1000+ participant rooms.**
### Option D: Snapshot-on-Change (Optimistic Read)
**Approach:** Maintain a read-optimized `Arc<Vec<ParticipantSender>>` snapshot per room. Updated atomically on join/leave (rare). Readers just `Arc::clone()` — no lock at all.
```rust
struct Room {
participants: Vec<Participant>,
/// Atomically-updated snapshot of all senders (rebuilt on join/leave).
sender_snapshot: Arc<ArcSwap<Vec<ParticipantSender>>>,
}
// Hot path (zero locking!):
let senders = room.sender_snapshot.load(); // atomic load, ~1ns
for sender in senders.iter() {
if sender.id != participant_id { ... }
}
```
**Pros:**
- Zero lock contention on hot path — just an atomic pointer load
- Rebuild cost amortized over all packets between joins/leaves
- `arc-swap` crate is battle-tested and tiny
**Cons:**
- New dependency (`arc-swap`)
- Quality tracking still needs a mutable path (separate concern)
- Snapshot doesn't include mutable room state (quality tiers)
- More complex join/leave (must rebuild snapshot atomically)
**Verdict: Best theoretical performance, but adds complexity. Consider if DashMap proves insufficient.**
## Recommended Implementation: Option B (DashMap) + Federation Fix
DashMap is the right tool here. The original objections don't hold up:
- "Guards can't be held across `.await`" — we already drop locks before any async sends
- "Less control" — DashMap's 64 internal shards give finer granularity than manual per-room locks
- "New dependency" — one crate, battle-tested, widely used in the Rust ecosystem
DashMap's advantages over manual per-room `Arc<Mutex<Room>>`:
- **No two-level locking** — single `rooms.get()` vs outer-lock → Arc clone → drop → inner-lock
- **Read/write separation** — `get()` is a shared shard lock, multiple rooms on the same shard can read concurrently
- **Less code** — no manual Arc/Mutex wrapping, no explicit lock choreography
- **Iteration without global lock** — federation room announcements don't block media forwarding
### Phase 1: DashMap Room Storage (Biggest Win)
1. Add `dashmap` dependency to `wzp-relay`
2. Replace `rooms: HashMap<String, Room>` with `rooms: DashMap<String, Room>`
3. Move `qualities` and `room_tiers` into the `Room` struct (per-room state, not global)
4. RoomManager no longer needs a wrapping Mutex — it becomes `Arc<RoomManager>` directly
5. Per-packet hot path: `rooms.get(&name)` takes a shared shard lock, releases on drop
```rust
pub struct RoomManager {
rooms: DashMap<String, Room>,
acl: Option<HashMap<String, HashSet<String>>>, // read-only after init
event_tx: broadcast::Sender<RoomEvent>,
}
struct Room {
participants: Vec<Participant>,
qualities: HashMap<ParticipantId, ParticipantQuality>,
current_tier: Tier,
}
// Hot path becomes:
let (others, directive) = if let Some(mut room) = room_mgr.rooms.get_mut(&room_name) {
let directive = if let Some(ref qr) = pkt.quality_report {
room.observe_quality(participant_id, qr)
} else {
None
};
let o = room.others(participant_id);
(o, directive)
} else {
(vec![], None)
};
// Shard lock released here — fan-out sends are lock-free
```
**Files to modify:**
- `crates/wzp-relay/Cargo.toml` — add `dashmap` dependency
- `crates/wzp-relay/src/room.rs` — RoomManager struct, Room struct, all methods
- `crates/wzp-relay/src/lib.rs` — change from `Arc<Mutex<RoomManager>>` to `Arc<RoomManager>`
- `crates/wzp-relay/src/main.rs` — update RoomManager construction and all `.lock().await` call sites
- `crates/wzp-relay/src/federation.rs` — update room_mgr usage (no more `.lock().await`)
**Key behavior change:** `Arc<Mutex<RoomManager>>``Arc<RoomManager>`. Every call site that does `room_mgr.lock().await.some_method()` becomes `room_mgr.some_method()` directly. The DashMap handles internal locking.
**Concurrency improvement:**
- Before: 100 rooms × 10 people = all 1000 tasks compete for 1 Mutex
- After: 100 rooms × 10 people = distributed across 64 shards, ~15 tasks per shard average
- Within a room: participants still serialize through the shard lock, but hold time is <0.1ms for `get()` and `others()` (just Vec clone of Arcs)
### Phase 2: Federation Lock Fix
Clone the peer list, release lock, then send:
```rust
pub async fn forward_to_peers(&self, room_hash: &[u8; 8], media_data: &Bytes) {
let peers: Vec<_> = {
let links = self.peer_links.lock().await;
links.values().map(|l| (l.label.clone(), l.transport.clone())).collect()
}; // lock released immediately
for (label, transport) in &peers {
// send without holding lock — slow peer doesn't block others
}
}
```
Also apply to `broadcast_signal()` and `send_signal_to_peer()`.
**Files to modify:**
- `crates/wzp-relay/src/federation.rs` — 3 methods
**Concurrency improvement:** A slow federation peer no longer blocks all other peers' media delivery.
### Phase 3: Quality Tracking Optimization (Optional)
With DashMap, quality tracking uses `get_mut()` (exclusive shard lock) on every packet that carries a QualityReport. For rooms where quality reports are frequent, this creates write contention on the shard.
Option: Move quality observation to a background task:
1. Per-participant `AtomicU8` for latest loss/RTT (lock-free write from hot path)
2. Background task every 1s reads atomics, computes tiers, broadcasts directives
3. Hot path becomes read-only: `rooms.get()` (shared lock) → `others()` → done
**Reduces shard lock from exclusive (`get_mut`) to shared (`get`) on every packet.**
## Verification
1. **Correctness:** `cargo test -p wzp-relay` — all existing tests must pass
2. **Compile check:** `cargo check --workspace` — no regressions
3. **Load test:** 10 rooms × 10 participants, verify rooms forward concurrently
4. **Large room:** 1 room × 50 participants, no deadlocks
5. **Federation:** 3 relays, media bridges correctly with new lock pattern
6. **Benchmark:** Before/after packets-per-second on multi-core with `wzp-bench`
## Effort
- Phase 1: 1 day (DashMap migration + test updates)
- Phase 2: 0.5 day (federation clone-and-release)
- Phase 3: 0.5 day (optional, quality tracking with atomics)
- Total: 1.52 days
## Implementation Status (2026-04-13)
Phase 1 (DashMap): DONE — global Mutex → DashMap<String, Room> with 64 shards
Phase 2 (Federation clone-before-send): DONE — forward_to_peers, broadcast_signal, send_signal_to_peer
Phase 3 (Quality atomics): NOT DONE — optional optimization
See also: docs/REFACTOR-relay-concurrency.md for the full post-refactor analysis.

View File

@@ -0,0 +1,176 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Relay Conformance Enforcement (Abuse Mitigation Tiers AG)
> **Status:** proposed
> **Resolves:** All in-scope vectors from `docs/ATTACK-SURFACE-RELAY-ABUSE.md`.
> **Depends on:** PRD #1 (wire format v2 — for `MediaType` separation in Tiers D/F).
## Problem
WZP relays forward E2E-encrypted ciphertext and cannot inspect payload content. A trivial PoC on another E2E SFU (LiveKit) showed that without conformance enforcement, the relay becomes a free arbitrary-data tunnel. WZP must enforce media-shape conformance against observable header and timing metadata, without breaking E2E.
## Goals
- Make bulk data tunneling through WZP infeasible.
- Bound aggregate per-user abuse blast radius.
- Make covert tunneling expensive (Tier F) without false-positiving real calls.
- Audio and video evaluated by **separate scorers** (statistical signatures don't overlap).
## Non-goals
- Content inspection (would break E2E).
- Detecting steganographic covert channels inside legitimate audio (information-theoretic limit; not worth chasing).
- CSAM / copyright detection (would require E2E break; explicit non-goal).
## Design — tiered enforcement
### Tier A — Codec-conformance bitrate caps
For each `CodecID`, compute math-derived ceiling and enforce sliding 1 s window per session:
```
ceiling_bps[CodecID] = nominal * (1 + max_FEC_ratio) * (1 + overhead_pct)
= nominal * 3.0 * 1.15
```
Hard violation (sustained > ceiling for 1 s) → close session with `Hangup::PolicyViolation { code: BITRATE }`.
### Tier B — Packet-rate cap
Per `CodecID`, max `pps` known (25 or 50 base × up to 3× for FEC = ~150 pps for audio). Sustained > 200 pps audio → hard violation.
### Tier C — Timestamp-rate consistency
`Δtimestamp_ms / Δsequence` over rolling 200-packet window must match codec frame duration ± 2×. Violation → hard.
### Tier D — Per-codec packet-size sanity
EWMA(`payload_len`) per session; reject sustained mean > 2× codec typical. Per-codec table in spec.
### Tier E — Per-fingerprint / per-IP token bucket
```
For each (fingerprint, src_ip):
monthly_bytes_quota authed = 50 GB (tunable)
anon = 1 GB
per-session bps cap audio = 256 kbps
video = 5 Mbps
burst = 30 s @ 2× cap
```
Anonymous quotas tight; authenticated (via featherChat) quotas generous. Soft enforcement: throttle, then close on persistent overage.
### Tier F — Behavioral entropy scoring (per `MediaType`)
Separate scorers for audio and video. Computed over 1030 s windows.
**Audio scorer features:**
| Feature | Legitimate | Abusive |
|---|---|---|
| IAT coefficient of variation | 0.10.4 | > 1.0 |
| Payload-size bimodality | Bimodal (speech + silence) | Unimodal |
| Silence fraction | 1040 % | < 2 % |
| 30 s bitrate vs. nominal | ± 20 % | Saturates ceiling |
| `Q` flag cadence | Periodic | Absent/random |
**Video scorer features (post-PRD #5):**
| Feature | Legitimate | Abusive |
|---|---|---|
| Keyframe periodicity | Regular (14 s or on PLI) | Absent / uniform KF=1 |
| I/P frame-size ratio | 520× | ~1× |
| Burst structure | I-frame in < 5 ms, then quiet | Uniform spacing |
| Bitrate response to BWE | Tracks `remb_bps` | Ignores |
| NACK/PLI responsiveness | Keyframe within 200 ms | No response |
Output: `legitimacy ∈ [0, 1]` per session per `MediaType`. < 0.3 for 60 s → Suspect; < 0.1 for 60 s → Abusive.
### Tier G — Reactive response
```
Verdict::Legitimate → no action
Verdict::Suspect → apply tighter Tier E quota; emit metric
Verdict::Abusive → close session with typed Hangup; cool-down fingerprint 1 h
Verdict::RepeatAbusive → relay-local block 24 h; (optional gossip)
```
Always typed close. No silent drops.
## Implementation outline
New module `wzp-relay/src/conformance.rs`:
```rust
pub struct ConformanceMeter {
media_type: MediaType,
declared_codec: AtomicU8,
bytes_window: SlidingWindow<1000>,
packet_window: SlidingWindow<1000>,
iat_ewma: ExponentialMovingAverage,
iat_variance: ExponentialMovingVariance,
size_histogram: SizeBuckets<8>,
silence_count: AtomicU32,
speech_count: AtomicU32,
quality_reports_seen: AtomicU32,
last_timestamp_ms: AtomicU32,
last_seq: AtomicU32,
keyframe_intervals: RingBuffer<u32, 16>,
violations: AtomicU32,
}
impl ConformanceMeter {
pub fn observe(&self, h: &MediaHeader, payload_len: usize, now: Instant) -> Result<(), Violation>;
pub fn legitimacy(&self) -> f32;
pub fn verdict(&self) -> Verdict;
}
```
Hooked into per-participant forwarding loop in `RoomManager`. Tier AD run synchronously (cheap). Tier F runs on a periodic task (every 1 s per session).
Prometheus exports:
```
wzp_relay_conformance_violations_total{tier,codec_id,media_type,verdict}
wzp_relay_conformance_legitimacy{media_type} histogram
wzp_relay_conformance_iat_cov{media_type} histogram
wzp_relay_conformance_silence_fraction histogram
```
## Rollout
1. Deploy with all tiers in **observe-only** mode (Prometheus only, no enforcement).
2. Collect 12 weeks of baseline traffic.
3. Set thresholds at observed 99.9th percentile of legitimate traffic + headroom.
4. Flip Tier A enforcement first (highest confidence, lowest false-positive risk).
5. Flip B, C, D over 2 weeks.
6. Tune Tier F thresholds against the baseline; flip Suspect first, then Abusive.
## Acceptance criteria
- Synthetic abuse test (5 Mbps random bytes declared as Opus 24 k) closed within 1 s.
- Synthetic abuse test (audio-rate small packets with stuffed payload) closed within 5 s by Tier D.
- Synthetic abuse test (audio-rate, audio-sized, but no silence and CoV=2.0 IAT) flagged Suspect within 60 s.
- Real-call false-positive rate < 0.1 % over a week of production baseline.
- All verdict transitions emit Prometheus counters.
## Risks
- **False positives on edge cases** (long lectures with little silence, ambient-music calls). Mitigation: Tier F floor at Suspect for 30 s minimum; manual review channel for repeat-flagged authed users.
- **Threshold drift** as codecs evolve. Mitigation: ceilings are math-derived from codec table; updated when codec table updates.
- **Federated abuse moving between relays.** Mitigation: Tier G optional gossip (post-Wave 5).
## Effort
- Tier A + B + C: 1.5 d (T2.4 + T2.5)
- Tier D: 0.5 d (T3.6)
- Tier E: 1.5 d (T3.5)
- Tier F audio: 3 d (T5.7)
- Tier F video: 3 d (T6.2)
- Tier G: 1 d (T5.8)
Total: ~10 engineer-days, spread across Waves 26.

View File

@@ -0,0 +1,307 @@
---
tags: [prd, wzp]
type: prd
---
# Design Exploration: Federated Reputation Gossip (T6.3)
> **Status:** Design exploration — no approach selected.
> **Blocked on:** Reviewer design call (needs operator-trust model decision).
> **Scope:** How should WZP relays share abuse verdicts across a federation mesh so that abusers cannot hop between relays?
## Background
WZP relays are E2E-encrypted SFUs. They cannot inspect payload content, but they **can** observe metadata: packet rates, payload sizes, timestamps, keyframe patterns, and BWE responsiveness. Tiers AF of the conformance pipeline observe these signals and produce a `Verdict ∈ {Legitimate, Suspect, Abusive}`.
Tier G (`ResponsePolicy`) escalates:
- `Abusive` → typed `Hangup` + 1 h fingerprint cool-down
- Repeat `Abusive` within 24 h → relay-local `Block` for 24 h
**The gap:** Block state is in-memory only and per-relay. A blocked fingerprint on relay A can reconnect to relay B (same room, same mesh) and resume abuse immediately. T6.3 explores closing this gap.
**What is being gossiped?** A *reputation event*: "fingerprint `F` produced violation `V` with verdict `Abusive` at time `T` on relay `R`."
---
## Assumptions
1. Relays trust each other *connection-level* (TLS fingerprints in `PeerConfig` / `TrustedConfig`) but are **not** guaranteed to share the same abuse-detection thresholds or calibration.
2. The federation mesh is small (tens of relays, not thousands).
3. False positives happen — a legitimate user on a long lecture call can trigger `Suspect` or even `Abusive` on an aggressively-tuned relay.
4. A compromised relay is a realistic threat model (stolen credentials, operator coercion, buggy calibration).
5. Relays are operated by different entities — there is no single administrative root of trust.
---
## Approach 1: Push Gossip
### Summary
When a relay issues a `Block` action (repeat abusive), it immediately broadcasts a `ReputationEvent` to all peer relays via the existing federation QUIC channels. Peers incorporate the event into their local block lists.
### Wire format
```rust
// New SignalMessage variant
ReputationEvent {
version: u8,
/// Fingerprint being reported (the abused party, not the reporter).
fingerprint: String,
/// Which violation code triggered the block.
violation: ViolationCode,
/// When the block was issued (Unix epoch seconds, u64).
issued_at: u64,
/// TTL in seconds (default 86400 = 24 h).
ttl_secs: u32,
/// Relay that issued the block (TLS fingerprint hex).
origin_relay_fp: String,
/// Ed25519 signature over (fingerprint || violation || issued_at || ttl_secs || origin_relay_fp).
/// The signing key is the relay's long-term identity key (reused from client handshake identity).
signature: [u8; 64],
}
```
**What is signed?** The canonical serialization of the payload fields (excluding the signature itself). This prevents replay of old events and binds the event to a specific origin relay.
**Key distribution:** Each relay's Ed25519 public key is published in a well-known endpoint (e.g., `/.well-known/wzp-relay.pub`) or embedded in the `FederationHello` handshake. Verification happens on receipt.
### Sybil resistance
- **Signing requirement:** Only relays with a known Ed25519 pubkey can produce valid events. A rogue relay must first be in the `TrustedConfig` to even connect.
- **Origin attribution:** Every event is traceable to a specific relay. If relay R starts flooding false positives, peers can blacklist R's pubkey or reduce R's event weight to zero.
- **No aggregate thresholding:** This design trusts every signed event equally. A single malicious relay can block any fingerprint across the mesh.
**Mitigation option (not implemented):** Require *k-of-n* independent relay reports before applying a cross-relay block. This introduces complexity (tracking per-fingerprint report counts, decaying counters, handling churn).
### Convergence model
- **Eventual consistency:** Events propagate via multi-hop flood (same mechanism as `GlobalRoomActive`).
- **Bounded staleness:** Events carry TTL. Stale events (> TTL) are ignored.
- **No ordering guarantee:** Two relays may issue conflicting events (relay A blocks F, relay B clears F). Last-writer-wins based on `issued_at`.
### Storage
- **In-memory only:** `HashMap<(fingerprint, origin_relay), ReputationEntry>` with TTL-based eviction.
- **No persistence:** Restarting a relay loses all gossiped state. Re-blocking requires the origin relay to re-gossip or the abuser to re-offend.
- **Memory bound:** ~100 bytes per entry × 10k entries = ~1 MB. Trivial.
### Partition tolerance
- **Partitioned relay A** blocks fingerprint F locally but cannot push to peers. When partition heals, A floods the backlog. Peers apply events with `issued_at` within TTL; expired backlog is ignored.
- **Partitioned relay B** never hears about F's block. F can abuse B freely during the partition. This is acceptable for a best-effort gossip system.
### Failure modes
| Scenario | Impact | Mitigation |
|---|---|---|
| Compromised relay floods false blocks | All fingerprints blocked mesh-wide | Manual pubkey blacklist; no automatic mitigation in this design |
| Buggy relay false-positives a popular fingerprint | Legitimate users blocked everywhere | Operator contact + pubkey downgrade |
| Network partition | Split-brain block lists | Acceptable; partition healing replays backlog |
| Clock skew | Events from future/past rejected or mis-ordered | Use `issued_at` with ±5 min tolerance; NTP assumed |
| Replay attack | Old event re-broadcast after TTL | Signature binds `issued_at`; verify TTL at receipt |
### Complexity
- **Low-medium:** Reuses existing federation broadcast infrastructure. Adds one `SignalMessage` variant, Ed25519 signing/verification, and an in-memory TTL map.
---
## Approach 2: Pull Gossip (Reputation Oracle)
### Summary
One relay in the mesh is designated the **reputation oracle** (or a small quorum of 3). Other relays query the oracle periodically (e.g., every 60 s) for the current block list. The oracle maintains the authoritative reputation state.
### Wire format
```rust
// Pull request
ReputationQuery {
version: u8,
/// Last checkpoint the requester has seen (opaque cursor).
since_cursor: Option<String>,
}
// Pull response
ReputationSnapshot {
version: u8,
/// Opaque cursor for delta pagination.
cursor: String,
/// List of active blocks at the oracle.
blocks: Vec<ReputationBlock>,
/// Oracle's Ed25519 signature over the serialized snapshot.
signature: [u8; 64],
}
struct ReputationBlock {
fingerprint: String,
violation: ViolationCode,
issued_at: u64,
ttl_secs: u32,
/// Which relay originally reported this (for audit).
reported_by: String,
}
```
**What is signed?** The entire `ReputationSnapshot` serialized canonically. The oracle is the sole signer.
**Oracle selection:** Config-based. Each relay's config names its oracle(s):
```toml
[reputation]
oracle = "https://relay-oracle.example.com"
oracle_pubkey = "AA:BB:CC:..."
```
### Sybil resistance
- **Centralized trust:** The oracle is a single point of trust. If the oracle is honest, no rogue relay can poison the mesh.
- **Oracle compromise:** A compromised oracle can block or unblock any fingerprint across all querying relays. This is a **catastrophic** failure mode.
- **Quorum variant:** 3 oracles with 2-of-3 signing. Queries return the intersection of block lists. More resilient but adds latency and complexity.
### Convergence model
- **Bounded staleness:** Worst-case = query interval (60 s) + network RTT.
- **Strong consistency within staleness bound:** All querying relays see the same oracle state (modulo query timing skew).
- **No multi-hop gossip:** Direct query/response only.
### Storage
- **Oracle side:** In-memory + optional persistence (SQLite or flat file). Oracle restarts reload from disk.
- **Querying relays:** In-memory cache of the last snapshot. No local state between restarts.
- **Memory bound:** Same as Approach 1 (~1 MB for 10k entries).
### Partition tolerance
- **Partitioned querying relay:** Falls back to local-only blocking (current behavior). When partition heals, next query re-syncs.
- **Partitioned oracle:** All querying relays lose cross-relay reputation and fall back to local-only. Oracle restart without persistence = same.
- **No split-brain:** Either you have the oracle snapshot or you don't. No conflicting states.
### Failure modes
| Scenario | Impact | Mitigation |
|---|---|---|
| Oracle compromise | Global block/unblock of any fingerprint | 2-of-3 quorum; operator out-of-band alert |
| Oracle downtime | All querying relays lose cross-relay reputation | Fallback to local-only; no amplification |
| Rogue relay reports to oracle | Oracle may incorporate false positives | Oracle applies local thresholding (e.g., require 2 independent reports) |
| Query amplification | N relays × 60 s = many queries | Oracle caches; responses are cheap |
| Man-in-the-middle on query | Attacker injects fake snapshot | TLS + signature verification on response |
### Complexity
- **Medium:** Adds an HTTP/QUIC query endpoint, snapshot pagination, oracle election/config, and a new failure mode (oracle as SPOF).
- **Operational burden:** Someone must run the oracle. Small federations may not want this.
---
## Approach 3: No Gossip — Explicit Ban-List Distribution
### Summary
Relays do **not** gossip reputation automatically. Instead, a human admin (or an automated audit pipeline) curates an authoritative ban list, signs it, and pushes it to all relays. Relays apply the list verbatim.
### Wire format
```rust
// Signed ban list (distributed out-of-band: HTTPS, SCP, S3, etc.)
BanList {
version: u8,
/// Issued at (Unix epoch seconds).
issued_at: u64,
/// Expires at (Unix epoch seconds). After this, the list is ignored.
expires_at: u64,
/// Entries.
entries: Vec<BanEntry>,
/// Admin Ed25519 signature over canonical serialization.
signature: [u8; 64],
}
struct BanEntry {
fingerprint: String,
/// Human-readable reason (not machine-parsed).
reason: String,
/// Optional: which relay originally reported.
source_relay: Option<String>,
}
```
**What is signed?** The entire `BanList`. The admin (not a relay) is the signer.
**Distribution:** Out-of-band from the federation mesh. Could be:
- Admin `scp`s JSON to each relay's config directory
- Relays poll an HTTPS URL every 5 min
- Shared object storage (S3, GCS)
**Key distribution:** Admin pubkey is baked into each relay's config at provisioning time:
```toml
[ban_list]
admin_pubkey = "AA:BB:CC:..."
url = "https://ops.example.com/banlist.json"
refresh_secs = 300
```
### Sybil resistance
- **Strong:** Only the admin can produce a valid ban list. No relay can poison another relay.
- **Admin compromise:** Catastrophic — attacker controls the global ban list. But this is a standard operational threat model (same as DNS root, certificate transparency log, etc.).
- **No relay-to-relay trust required:** Relays don't need to trust each other's calibration or behaviour.
### Convergence model
- **Poll-based bounded staleness:** Worst-case = `refresh_secs` (default 300 s = 5 min).
- **Strong consistency:** All relays that successfully fetch the list see identical state.
- **No event propagation:** No flood, no multi-hop, no deduplication needed.
### Storage
- **On-disk cache:** Each relay stores the latest fetched ban list to survive restart.
- **In-memory lookup:** `HashSet<fingerprint>` for O(1) block checks.
- **Memory bound:** Same as other approaches.
### Partition tolerance
- **Partitioned relay:** Continues using its last cached ban list until `expires_at`. After expiry, falls back to local-only blocking.
- **No split-brain:** Either you have the signed list or you don't.
### Failure modes
| Scenario | Impact | Mitigation |
|---|---|---|
| Admin key compromise | Global block/unblock of any fingerprint | Key rotation + out-of-band alert |
| Distribution channel down | Relays use stale cached list until expiry | Short expiry + monitoring |
| Admin false-positives a popular fingerprint | Global block of legitimate users | Human review process; short expiry allows quick recovery |
| Relay never fetches list | Local-only blocking only | Monitoring alert on relay ops dashboard |
| List too large | Fetch latency, memory bloat | Pagination; but 10k entries = ~500 KB JSON, trivial |
### Complexity
- **Low:** No changes to federation mesh at all. Adds an HTTP polling loop, a file watcher, or an S3 sync. The signing/verification is trivial (one Ed25519 verify on fetch).
- **Operational burden:** Requires a human or automated pipeline to maintain the ban list. This is acceptable for small federations (the WZP target audience) but does not scale to open meshes.
---
## Comparative Summary
| Dimension | Approach 1: Push Gossip | Approach 2: Pull Oracle | Approach 3: Ban-List Distribution |
|---|---|---|---|
| **Trust model** | Every relay trusts every other relay's verdict equally | Trusts a designated oracle (or 2-of-3 quorum) | Trusts a single admin key |
| **Sybil resistance** | Weak — one rogue relay can poison the mesh | Medium-strong — oracle is gatekeeper | Strong — only admin can sign |
| **Convergence** | Eventual; multi-hop flood | Bounded (query interval); direct pull | Bounded (poll interval); out-of-band |
| **Partition tolerance** | Acceptable (backlog replay on heal) | Acceptable (fallback to local) | Good (cached list + expiry) |
| **False-positive blast radius** | Mesh-wide from one relay | Mesh-wide from oracle | Mesh-wide from admin |
| **Operational burden** | Low — fully automatic | Medium — must run oracle | Medium — must curate list |
| **Federation code changes** | Medium — broadcast loop, dedup, signatures | Medium — query endpoint, snapshot pagination | Low — out-of-band, no mesh changes |
| **Scaling** | Poor — flood doesn't scale past ~50 relays | Good — O(N) queries, oracle is O(1) | Good — O(N) fetches, no mesh load |
| **Audit trail** | Good — every event attributed to origin relay | Good — oracle logs all reports | Good — list is a snapshot |
| **Rollback / correction** | Hard — events spread everywhere; need counter-events | Easy — oracle updates snapshot | Easy — admin publishes new list |
## Open Questions (Blockers for Implementation)
1. **Trust model:** Do we trust all peer relays equally (Approach 1), trust a designated oracle (Approach 2), or trust a human admin (Approach 3)? This is a product/operations decision, not a technical one.
2. **Key infrastructure:** The federation layer currently has **no message-level signing**. All three approaches require adding Ed25519 signing/verification to relay-relay messages (or out-of-band fetches). The `wzp-crypto` crate already has Ed25519 identity support (used in client handshake) — it can be reused.
3. **Fingerprint scope:** Is the block scoped to a fingerprint (Ed25519 pubkey), an IP address, or both? Current `ResponsePolicy` uses a string fingerprint. Gossip should probably scope to the cryptographic identity, not IP, to prevent NAT rebind evasion.
4. **Privacy leakage:** Gossiping "fingerprint F is abusive" leaks that F was on a call. Is this acceptable? WZP is not anonymous by design (participants know each other's fingerprints), but operators learning about calls they are not in is a privacy concern.
5. **TTL vs. persistent bans:** Should a gossiped block expire automatically (TTL), or should it require manual review for extension? Automatic expiry limits false-positive damage but allows persistent abusers to cycle.
6. **Rate limiting on gossip:** A compromised relay could flood the mesh with `ReputationEvent` messages. The existing per-room rate limiter (500 pps) doesn't apply to control messages. A per-peer gossip rate limit is needed regardless of approach.
## Recommendation
**Do not implement any approach yet.** The blockers above (especially #1 trust model and #4 privacy) need a reviewer/operator design call. T6.3 should remain **Blocked** until then.
If forced to pick a default for a small, closed federation (the current WZP target audience), **Approach 3 (Ban-List Distribution)** has the lowest complexity, the strongest Sybil resistance, and the easiest rollback. It sacrifices automation but small federations typically have an ops person who can review blocks anyway. Approach 1 (Push Gossip) is more elegant for larger meshes but requires solving the rogue-relay problem first (e.g., k-of-n thresholding).

View File

@@ -0,0 +1,175 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Relay Federation (Multi-Relay Mesh)
## Problem
Currently all participants in a call must connect to the same relay. This creates:
- **Single point of failure** — if the relay goes down, the entire call drops
- **Geographic latency** — users far from the relay get high RTT
- **Capacity limits** — one relay handles all traffic
Users should be able to connect to their nearest/preferred relay and still talk to users on other relays, as long as the relays are federated.
## Prerequisite: Fix Relay Identity Persistence
### Bug: TLS certificate regenerates on every restart
**Root cause:** `wzp-transport/src/config.rs:17` calls `rcgen::generate_simple_self_signed()` which creates a new keypair every time. The relay's Ed25519 identity seed IS persisted to `~/.wzp/relay-identity`, but the TLS certificate is not derived from it.
**Impact:** Clients see a different server fingerprint after every relay restart, triggering the "Server Key Changed" warning. This also breaks federation since relays identify each other by certificate fingerprint.
**Fix:** Derive the TLS certificate from the persisted relay seed:
1. Add `server_config_from_seed(seed: &[u8; 32])` to `wzp-transport`
2. Use the seed to create a deterministic keypair (e.g., derive an ECDSA key via HKDF from the Ed25519 seed)
3. Generate a self-signed cert with that keypair — same seed = same cert = same fingerprint
4. The relay passes its loaded seed to `server_config_from_seed()` instead of `server_config()`
**Effort:** 0.5 day
## Federation Design
### Core Concept
Two or more relays form a **federation mesh**. Each relay is an independent SFU. When relays are configured to trust each other, they bridge rooms with matching names — participants on relay A in room "podcast" hear participants on relay B in room "podcast" as if everyone were on the same relay.
### Configuration
Each relay reads a YAML config file (e.g., `~/.wzp/relay.yaml` or `--config relay.yaml`):
```yaml
# Relay identity (auto-generated if missing)
listen: 0.0.0.0:4433
# Federation peers — other relays we trust and bridge rooms with
# Both sides must configure each other for federation to work
peers:
- url: "193.180.213.68:4433"
fingerprint: "a5d6:e3c6:5ae7:185c:4eb1:af89:daed:4a43"
label: "Pangolin EU"
- url: "10.0.0.5:4433"
fingerprint: "7f2a:b391:0c44:..."
label: "Office LAN"
```
**Key rules:**
- Both relays must configure each other — **mutual trust** required
- A relay that receives a connection from an unknown peer logs: `"Relay a5d6:e3c6:... (193.180.213.68) wants to federate. To accept, add to peers config: url: 193.180.213.68:4433, fingerprint: a5d6:e3c6:..."`
- Fingerprints are verified via the TLS certificate (requires the identity fix above)
### Protocol
#### Peer Connection
1. On startup, each relay attempts QUIC connections to all configured peers
2. The connection uses SNI `"_federation"` (reserved room name prefix) to distinguish from client connections
3. After QUIC handshake, verify the peer's certificate fingerprint matches the configured fingerprint
4. If fingerprint mismatch → reject, log warning
5. If peer connects but isn't in our config → log the helpful "add to config" message, reject
#### Room Bridging
Once two relays are connected:
1. **Room discovery**: When a local participant joins room "T", the relay sends a `FederationRoomJoin { room: "T" }` signal to all connected peers
2. **Room leave**: When the last local participant leaves room "T", send `FederationRoomLeave { room: "T" }`
3. **Media forwarding**: For each room that exists on both relays:
- Relay A forwards all media packets from its local participants to relay B
- Relay B forwards all media packets from its local participants to relay A
- Each relay then fans out received federated media to its local participants (same as local SFU forwarding)
4. **Participant presence**: `RoomUpdate` signals are merged — local participants + federated participants from all peers
```
Relay A (2 local users) Relay B (1 local user)
┌─────────────────────┐ ┌─────────────────────┐
│ Room "T" │ │ Room "T" │
│ Alice (local) ────┼──media──►│ Charlie (local) │
│ Bob (local) ────┼──media──►│ │
│ │◄──media──┼── Charlie │
│ Charlie (federated)│ │ Alice (federated) │
│ │ │ Bob (federated) │
└─────────────────────┘ └─────────────────────┘
```
#### Signal Messages (new)
```rust
enum FederationSignal {
/// A room exists on this relay with active participants
RoomJoin { room: String, participants: Vec<ParticipantInfo> },
/// Room is empty on this relay
RoomLeave { room: String },
/// Participant update for a federated room
ParticipantUpdate { room: String, participants: Vec<ParticipantInfo> },
}
```
#### Media Forwarding
Federated media is forwarded as raw QUIC datagrams — the relay doesn't decode/re-encode. Each packet is prefixed with a room identifier so the receiving relay knows which room to fan it out to:
```
[room_hash: 8 bytes][original_media_packet]
```
The 8-byte room hash is computed once when the federation room bridge is established.
### What Relays DON'T Do
- **No transcoding** — media passes through as-is. If Alice sends Opus 64k, Charlie receives Opus 64k
- **No re-encryption** — packets are already encrypted end-to-end between participants. Relays just forward opaque bytes
- **No central coordinator** — each relay independently connects to its configured peers. No master/slave, no consensus protocol
- **No automatic peer discovery** — peers must be explicitly configured in YAML
### Failure Handling
- If a peer relay goes down, the federation link drops. Local rooms continue to work. Federated participants disappear from presence.
- Reconnection: attempt every 30 seconds with exponential backoff up to 5 minutes
- If a peer relay restarts with a new identity (bug not fixed), the fingerprint check fails and federation is rejected with a clear error log
## Implementation Plan
### Phase 0: Fix Relay Identity (prerequisite)
- Derive TLS cert from persisted seed
- Same seed → same cert → same fingerprint across restarts
### Phase 1: YAML Config + Peer Connection
- Add `--config relay.yaml` CLI flag
- Parse peers config
- On startup, connect to all configured peers via QUIC
- Verify certificate fingerprints
- Log helpful message for unconfigured peers
- Reconnect on disconnect
### Phase 2: Room Bridging
- Track which rooms exist on each peer
- Forward media for shared rooms
- Merge participant presence across peers
- Handle room join/leave signals
### Phase 3: Resilience
- Graceful handling of peer disconnect/reconnect
- Don't duplicate packets if a participant is reachable via multiple paths
- Rate limiting on federation links (prevent amplification)
- Metrics: federated rooms, packets forwarded, peer latency
## Effort Estimates
| Phase | Scope | Effort |
|-------|-------|--------|
| 0 | Fix relay TLS identity from seed | 0.5 day |
| 1 | YAML config + peer QUIC connections | 2 days |
| 2 | Room bridging + media forwarding + presence merge | 3-4 days |
| 3 | Resilience + metrics | 2 days |
## Non-Goals (v1)
- Automatic peer discovery (mDNS, DHT, etc.)
- Cascading federation (relay A ↔ B ↔ C where A doesn't know C)
- Load balancing across relays
- Encryption between relays (QUIC provides transport encryption; e2e encryption between participants is orthogonal)
- Different rooms on different relays (all federated rooms are bridged by name)

View File

@@ -0,0 +1,93 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Region-Based Relay Selection
> Phase: Implemented (data model)
> Status: Done (2026-04-14)
> Crate: wzp-client, wzp-proto, wzp-relay
## Problem
Clients are configured with a single relay address. With multiple relays in the federation mesh, the client should automatically discover all available relays and select the lowest-latency one. Currently there is no mechanism for the relay to advertise its mesh peers to clients, and no client-side data structure to track relay health over time.
## Solution
1. Relays advertise their region and mesh peers in `RegisterPresenceAck`
2. Clients maintain a `RelayMap` sorted by measured RTT
3. `preferred()` returns the best relay for call setup
## Implementation
### New Module: `crates/wzp-client/src/relay_map.rs`
**RelayEntry**:
```rust
pub struct RelayEntry {
pub name: String,
pub addr: SocketAddr,
pub region: Option<String>,
pub rtt_ms: Option<u32>,
pub last_probed: Option<Instant>,
pub reachable: bool,
}
```
**RelayMap API**:
- `upsert(name, addr, region)` — add or update a relay entry
- `update_rtt(addr, rtt_ms)` — record probe result, marks reachable, re-sorts
- `mark_unreachable(addr)` — sorts unreachable entries to end
- `preferred()` -> `Option<&RelayEntry>` — lowest RTT reachable relay
- `populate_from_ack(relays, region)` — parse `RegisterPresenceAck.available_relays` (format: `"name|addr"`)
- `needs_reprobe(max_age)` — true if any entry has stale or missing probe
- `stale_entries(max_age)` — list of entries needing fresh probes
### Signal Protocol Extension
`RegisterPresenceAck` extended:
```rust
RegisterPresenceAck {
success: bool,
error: Option<String>,
relay_build: Option<String>,
relay_region: Option<String>, // NEW
available_relays: Vec<String>, // NEW — "name|addr" format
}
```
### Relay Config Extension
`RelayConfig` extended:
```rust
pub region: Option<String>, // e.g., "us-east", "eu-west"
pub advertised_addr: Option<SocketAddr>, // for available_relays population
```
### Relay Population
On `RegisterPresenceAck`, the relay populates:
- `relay_region` from `config.region`
- `available_relays` from `config.peers` (label|url format)
### Deferred
- **Automatic relay switching** — using `preferred()` to select relay during call setup instead of hardcoded config
- **Background reprobing** — periodic RTT measurements to keep the relay map fresh
- **Cross-relay RTT estimation** — using mesh probe data to estimate combined caller-RTT + callee-RTT for optimal relay placement
## Files
| File | Change |
|------|--------|
| `crates/wzp-client/src/relay_map.rs` | New — RelayMap + RelayEntry |
| `crates/wzp-client/src/lib.rs` | Add `pub mod relay_map` |
| `crates/wzp-proto/src/packet.rs` | `relay_region` + `available_relays` on RegisterPresenceAck |
| `crates/wzp-relay/src/config.rs` | `region` + `advertised_addr` fields |
| `crates/wzp-relay/src/main.rs` | Populate RegisterPresenceAck from config + peers |
## Testing
- 15 unit tests: preferred by RTT, unreachable not preferred, preferred empty/all-unreachable, populate_from_ack (valid + malformed entries), upsert updates/preserves region, needs_reprobe (empty/never/fresh), stale_entries, sort stability with equal RTT, mark_unreachable sorts to end, RelayEntry serialization
- 2 protocol tests: RegisterPresenceAck roundtrip with new fields, backward compat without new fields

View File

@@ -0,0 +1,61 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Studio Quality Tiers (Opus 32k/48k/64k)
## Status: Implemented
Studio quality tiers have been added to the wire protocol and all clients.
## What Was Added
### Wire Protocol (codec_id.rs)
Three new `CodecId` variants using the 4-bit header space (values 6-8):
| CodecId | Wire Value | Bitrate | Frame | Use Case |
|---------|-----------|---------|-------|----------|
| Opus32k | 6 | 32 kbps | 20ms | Studio low — noticeable improvement over 24k for voice |
| Opus48k | 7 | 48 kbps | 20ms | Studio — excellent voice, captures nuance |
| Opus64k | 8 | 64 kbps | 20ms | Studio high — near-transparent quality |
### Quality Profiles
| Profile | Codec | FEC | Bandwidth (with FEC) |
|---------|-------|-----|---------------------|
| STUDIO_32K | Opus 32k | 10% | ~35 kbps |
| STUDIO_48K | Opus 48k | 10% | ~53 kbps |
| STUDIO_64K | Opus 64k | 10% | ~70 kbps |
FEC is set to 10% (vs 20% for GOOD) — studio assumes a good network.
### Client Support
| Client | Selection | Status |
|--------|-----------|--------|
| Desktop (Tauri) | Quality slider in Settings (8 levels) | Done |
| CLI | `--profile studio-64k` / `studio-48k` / `studio-32k` | Done |
| Android | Needs codec picker update in SettingsScreen.kt | TODO |
| Web | Needs UI | TODO |
### Cross-Codec Interop
All decoder auto-switch paths (call.rs, desktop engine.rs) handle the new codec IDs. A studio-64k client can talk to a codec2-1200 client — the receiver auto-switches.
## When to Use Studio Tiers
- **Podcast recording sessions**: Use studio-64k for best quality (combined with local WAV recording for pristine output)
- **Music collaboration**: Opus at 48-64k captures instrument harmonics much better than 24k
- **Good network conditions**: Only useful when bandwidth isn't constrained; the extra bits are wasted on lossy networks
## When NOT to Use
- **Mobile data**: Stick with Auto/GOOD — studio tiers use 2-3x the bandwidth
- **High packet loss**: Studio profiles use minimal FEC (10%); degraded networks need DEGRADED or CATASTROPHIC profiles with 50-100% FEC
- **Large group calls**: Each participant's stream multiplies bandwidth; 64k * 10 participants = 640 kbps incoming
## Backward Compatibility
Old clients (before this change) will receive packets with CodecId 6/7/8 which they don't recognize. The `from_wire()` returns `None` for unknown values, causing the packet to be dropped. Old clients can still *send* to new clients fine (they use CodecId 0-5). This is acceptable for a pre-release protocol.

View File

@@ -0,0 +1,121 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Transport Feedback & Bandwidth Estimator
> **Status:** proposed
> **Resolves:** Audit W6 (no BWE), W14 (no receiver→sender feedback channel).
> **Depends on:** PRD #1 (wire format v2 — for u32 seq).
## Problem
`AdaptiveQualityController` decides tier transitions from loss% and RTT only. Quinn exposes congestion-window and bytes-in-flight, but we don't consume them. There is no receiver→sender feedback channel beyond the inline 4-byte `QualityReport`.
Consequences:
- On stable links with spare capacity, we never upgrade past the declared profile (audio stuck at Opus 24 k when 64 k is available).
- Oscillation between adjacent tiers on the boundary.
- **No bandwidth-aware adaptation = no usable video.** Video without BWE either oscillates wildly or never uses available capacity.
## Goals
- Continuous bandwidth estimate per session, surfaced to adaptation controllers.
- Receiver→sender feedback at ~50 ms cadence carrying ack/nack/remb.
- Audio benefits immediately (smarter upgrades, fewer oscillations).
- Video uses BWE as its primary input (PRD #7).
## Non-goals
- Replacing Quinn's congestion controller — we ride on top.
- Cross-stream BWE (each session estimates independently for v1).
## Design
### `SignalMessage::TransportFeedback`
New signal variant, sent on the existing signal stream every 50 ms or every N media packets, whichever first:
```rust
pub struct TransportFeedback {
pub version: u8, // PRD #4 W12: always present
pub stream_id: u8, // 0 for session-wide; >0 for per-stream
pub acked_seqs: Vec<u32>, // recent seqs received OK (RLE-compressed)
pub nacked_seqs: Vec<u32>, // recent seqs missing (RLE-compressed)
pub remb_bps: u32, // receiver's estimated max bandwidth
pub recv_time_us: u64, // arrival-time for sender-side jitter calc
}
```
RLE compression keeps the wire size bounded (typical payload ~50 B).
### `BandwidthEstimator` (in `wzp-proto`)
```rust
pub struct BandwidthEstimator {
cwnd_bps: AtomicU64, // from Quinn path stats
bytes_in_flight: AtomicU64, // from Quinn path stats
peer_remb_bps: AtomicU64, // from TransportFeedback
smoothed_bps: AtomicU64, // EWMA output
}
impl BandwidthEstimator {
pub fn update_from_quinn(&self, stats: &QuinnPathStats);
pub fn update_from_peer(&self, fb: &TransportFeedback);
pub fn target_send_bps(&self) -> u64 {
// 0.9 × min(cwnd_bps, peer_remb_bps), EWMA-smoothed
}
}
```
Three signals fused:
1. **Quinn cwnd.** Conservative ceiling — sending faster than cwnd just drops or queues.
2. **Peer REMB.** Receiver's perspective on what they can actually consume (after their own jitter buffer, decode budget, etc.).
3. **EWMA smoothing.** Half-life ~2 s; avoids oscillation.
Target = 90 % of `min(cwnd, remb)`, leaving headroom for probing upward.
### Adaptation controller integration
`AdaptiveQualityController::tick()` already consumes loss/RTT/jitter. Add BWE input:
```rust
if self.bwe.target_send_bps() > self.current_tier_ceiling_bps() * 1.3
&& consecutive_upgrade_reports >= UPGRADE_THRESHOLD {
self.upgrade_one_tier();
}
```
Upgrade gated on BWE *headroom*, not just clean reports. Eliminates the "always at Opus 24 k on a fiber link" pathology.
### Probing
To detect unused capacity, sender occasionally adds 510 % padding/FEC during otherwise-clean windows. If `cwnd` doesn't drop and `remb` doesn't fall, the headroom is real — upgrade. If signals degrade, back off. Cheap and standard.
## Implementation outline
1. New `wzp-proto::bwe::BandwidthEstimator`.
2. `wzp-transport` exposes `QuinnPathStats { cwnd_bps, bytes_in_flight, rtt_ms }`; already partially there via `QuinnPathSnapshot`.
3. `SignalMessage::TransportFeedback` variant + serde.
4. Receiver-side: track recent seqs in a ring buffer; emit feedback every 50 ms.
5. Sender-side: BWE consumes own Quinn stats + incoming feedback.
6. `AdaptiveQualityController::set_bwe(&BandwidthEstimator)`.
7. Prometheus: `wzp_session_bwe_bps`, `wzp_session_remb_bps`, `wzp_session_cwnd_bps`.
8. Probing logic behind a flag for first deployment.
## Acceptance criteria
- On a shaped 5 Mbps link with Opus 24 k, controller upgrades to Opus 64 k within 30 s.
- On a shaped 50 kbps link, controller stays at Opus 6 k and does not oscillate.
- Feedback wire size < 100 B per 50 ms (= < 2 kbps overhead).
- Probing finds headroom on a 10 Mbps link in < 60 s.
## Risks
- **Probing-induced loss on already-saturated links.** Mitigation: probe only when smoothed loss < 1 % over 10 s.
- **Feedback storm under heavy loss.** Mitigation: feedback rate capped at 20 Hz independent of media rate.
- **Quinn cwnd lies on QUIC-over-some-VPNs.** Mitigation: REMB serves as cross-check; take min of the two.
## Effort
~4 engineer-days (Wave 2 tasks T2.1T2.3).

View File

@@ -0,0 +1,116 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Multi-Codec Video Negotiation (H.264 + H.265 + AV1)
> **Status:** proposed
> **Resolves:** Road-to-video Phase V3 codec rollout; reserves `CodecID` slots 913.
> **Depends on:** PRD #5 (video v1 working with H.264).
## Problem
H.264 baseline ships first because it has universal hardware encode coverage. H.265 offers ~30 % efficiency at equal quality and is now broadly supported in HW (Apple A10+, Snapdragon since ~2017, NVENC since GTX 9xx). AV1 is the long-term target but hardware encode is limited (Apple M3/A17+, Snapdragon 8 Gen 3+, RTX 40+).
We need codec negotiation so each session uses the best mutually-supported codec without manual configuration, and so we can roll AV1 in gated on real telemetry.
## Goals
- `CodecID` assignments for H.264 baseline (9), H.264 main (10), H.265 main (11), AV1 (12), VP9 reserved (13).
- Capability declaration in `CallOffer.supported_codecs`.
- Picker logic: highest mutually-supported codec from a deterministic preference cascade.
- Hardware-encode detection at session start; refuse codecs requiring SW encode on battery-powered devices.
- Existing framer/depacketizer reused — only the codec wrapper changes.
## Non-goals
- New codecs beyond this list.
- Per-receiver codec selection (one codec per stream for v1; could be revisited with simulcast).
## Design
### Codec capability declaration
```rust
pub struct CodecCapability {
pub codec_id: u8,
pub max_resolution: (u16, u16),
pub max_fps: u8,
pub hardware: bool, // true if HW encode available
}
pub struct CallOffer {
...
pub supported_codecs: Vec<CodecCapability>,
}
```
### Preference cascade
```
preference: [AV1, H.265 main, H.264 main, H.264 baseline]
pick = first codec in `preference` where:
caller.supported.contains(codec)
AND callee.supported.contains(codec)
AND (codec.hardware on both sides OR codec.allow_software)
```
`allow_software` defaults to `false` for AV1 (battery cost too high), `true` for H.264 (cheap SW fallback).
### Per-codec details
| ID | Codec | Encoder priority |
|---|---|---|
| 9 | H.264 baseline | VideoToolbox / MediaCodec / NVENC / QSV / AMF / VAAPI; OpenH264 SW |
| 10 | H.264 main | Same HW; same SW |
| 11 | H.265 main | VideoToolbox A10+ / MediaCodec / NVENC GTX 9xx+ / QSV Skylake+; x265 SW (slow, disabled by default) |
| 12 | AV1 | VideoToolbox M3+/A17+ / MediaCodec SD8G3+ / NVENC RTX 40+; SVT-AV1 SW (gated) |
| 13 | VP9 | Reserved; may not implement |
### Framer reuse
The 16 B `MediaHeader` carries `codec_id`. The framer doesn't care which codec — it fragments NALs (for H.264/H.265) or OBUs (for AV1) into MTU-sized chunks, sets `KeyFrame`/`FrameEnd` bits, and passes payload through. Per-codec parameter sets (SPS/PPS for H.264/H.265, sequence header OBU for AV1) ship on the signal stream.
### Mid-call codec switch
Optional in v1. If implemented:
- Sender sends `SignalMessage::CodecSwitch { stream_id, new_codec_id, parameter_sets }`.
- Receiver swaps decoder and emits PLI to force a clean keyframe.
## Implementation outline
1. `CodecCapability` declaration + serde (additive change).
2. HW probe at session start (per platform).
3. Picker logic in `CallOffer`/`CallAnswer` flow.
4. H.265 encoder/decoder wrappers (VideoToolbox + MediaCodec).
5. AV1 encoder/decoder wrappers, gated on HW (SVT-AV1 fallback behind flag).
6. Prometheus: `wzp_session_codec_id_total{codec}` for telemetry on actual codec usage.
## Acceptance criteria
- Two macOS clients (M1 + M3) pick H.265 by default; M3 + iPhone 15 Pro pick AV1.
- M1 + Android device without H.265 HW picks H.264.
- Codec selection is deterministic given both sides' capabilities.
- AV1 refused on devices without HW unless `allow_software` flag explicitly set.
## Rollout gates
- H.264 baseline + main: ship with PRD #5.
- H.265: enable by default once HW probe accuracy verified on 5+ macOS + 5+ Android devices.
- AV1: 20 % of session-start probes must report HW encode capability before enabling by default. Until then, available only via debug flag.
## Risks
- **AV1 SW encode torches battery.** Mitigation: HW gate is mandatory; SW fallback off by default.
- **H.265 patent surface.** Mitigation: rely on platform-provided HW encoders (license covered upstream); avoid shipping x265 binary.
- **HW probe lies on some Android devices.** Mitigation: in-session fallback if encoder errors at start; degrade one codec tier.
## Effort
- H.265 wrappers: 3 d (T5.4)
- AV1 wrappers + HW gate: 5 d (T6.1)
- Picker + capability declaration: 1 d
Total: ~9 engineer-days, in Waves 56.

View File

@@ -0,0 +1,165 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Video Quality Controller + PriorityMode
> **Status:** proposed
> **Resolves:** Road-to-video Phase V5 (video adaptive controller, audio-priority gate, ScreenShare slide-mode).
> **Depends on:** PRD #3 (BWE), PRD #5 (video v1).
## Problem
Audio and video share a finite bandwidth budget. The FaceTime model — audio absolute priority, video elastic on top — is right for the default voice/video call, but it's wrong for screen-share / presentation where a frozen slide deck is worse than slightly degraded audio.
We need: a single `VideoQualityController` consuming BWE, with a policy gate driven by a user/product-selectable `PriorityMode`.
## Goals
- `PriorityMode` enum carried on `QualityProfile`.
- Per-mode allocation gates: `AudioFirst`, `VideoFirst`, `ScreenShare`, `Balanced`.
- Mid-call `SetPriorityMode` signal for runtime override.
- ScreenShare slide-fallback: when bandwidth drops below SD video floor, encoder switches to single-I-frame-every-N-seconds mode (no wire format change).
- Sensible defaults per call type (voice/video call → AudioFirst; presentation app → ScreenShare).
## Non-goals
- Multi-stream priority (e.g., one HD + one screen-share in the same session — separate work).
- Custom user-defined modes; only the four enum variants.
## Design
### `PriorityMode`
```rust
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum PriorityMode {
AudioFirst, // default for voice/video calls
VideoFirst, // user override
ScreenShare, // video + slide fallback; audio = intelligible speech only
Balanced, // proportional split
}
```
Carried on `QualityProfile`:
```rust
pub struct QualityProfile {
...
pub priority_mode: PriorityMode, // default AudioFirst
pub video_bitrate_kbps: Option<u32>,
pub video_resolution: Option<(u16, u16)>,
pub video_fps: Option<u8>,
}
```
Mid-call change:
```rust
SignalMessage::SetPriorityMode {
version: u8,
mode: PriorityMode,
}
```
### Allocation gates
```
let bwe = bandwidth_estimator.target_send_bps();
match priority_mode {
AudioFirst => {
audio_budget = max(24_kbps, audio_tier_min); // audio floor first
video_budget = bwe.saturating_sub(audio_budget);
// video → 0 before audio degrades below floor
}
VideoFirst => {
video_budget = max(video_floor, target_video_bps);
audio_budget = bwe.saturating_sub(video_budget);
// audio degrades to Opus 16k floor first
}
ScreenShare => {
// Audio gets just enough for intelligible speech.
audio_budget = 16_kbps;
video_budget = bwe.saturating_sub(audio_budget);
if video_budget < SD_VIDEO_FLOOR {
encoder.set_mode(EncoderMode::SlideFallback);
}
}
Balanced => {
audio_budget = (bwe as f64 * 0.15) as u64;
video_budget = bwe - audio_budget;
}
}
```
### `VideoQualityController`
```rust
pub struct VideoQualityController {
bwe: Arc<BandwidthEstimator>,
mode: AtomicU8, // PriorityMode
encoder: Arc<dyn VideoEncoder>,
loss_pct: AtomicU8,
rtt_ms: AtomicU32,
encoder_queue_ms: AtomicU32,
}
impl VideoQualityController {
pub fn tick(&self) {
let budget = self.allocate();
let target = self.derive_target(budget); // (bitrate, fps, resolution, layer)
self.encoder.set_target(target);
}
}
```
`derive_target` maps `(budget, loss, rtt, queue)` to encoder parameters via a step table. Smoothed; no jumps larger than 2× per second.
### ScreenShare slide-fallback
Pure encoder policy:
- Normal video: continuous frames, target fps (515 for screen content).
- When `video_budget < SD_VIDEO_FLOOR` (e.g., 150 kbps): switch to slide mode.
- Slide mode: emit one high-quality I-frame every 25 s. No P-frames. Encoder prefers H.265 or AV1 (text legibility).
- Wire format: `KeyFrame=1` on every packet, `FrameEnd=1` on last packet of slide. No new fields.
Receiver doesn't know slide mode is on — just sees keyframes arriving slowly.
### Defaults
| Product flow | Default mode |
|---|---|
| Voice call | AudioFirst (no video) |
| Video call | AudioFirst |
| Screen share | ScreenShare |
| User toggle in settings | VideoFirst or Balanced |
## Implementation outline
1. `PriorityMode` enum + serde + `QualityProfile` field (T5.1).
2. `SetPriorityMode` signal variant (T5.1).
3. `VideoQualityController::new` + `tick` (T5.2).
4. Per-mode allocation gates (T5.2).
5. `EncoderMode::SlideFallback` in `wzp-video` (T5.3).
6. Integration: `CallEngine` honors `SetPriorityMode` within 1 s.
7. UI plumbing for runtime toggle (out of scope here; tracked by platform team).
## Acceptance criteria
- 100 kbps shaped link, `AudioFirst`: audio holds Opus 24 k, video drops to 0.
- 100 kbps shaped link, `ScreenShare`: audio holds Opus 16 k, video in slide mode emits 1 I-frame / 3 s.
- 100 kbps shaped link, `VideoFirst`: audio drops to Opus 16 k, video holds floor.
- 5 Mbps link, `AudioFirst`: video reaches HD within 10 s.
- `SetPriorityMode` mid-call applied within 1 s.
## Risks
- **Mode flapping under unstable BWE.** Mitigation: 10 s dwell time before allowing mode-driven encoder reconfiguration.
- **Slide mode mistaken for poor connection by users.** Mitigation: UI indicator distinguishing "slide mode active" from "poor connection".
- **AudioFirst floor too aggressive for low-bandwidth music calls.** Mitigation: when audio profile is `Opus 64k music`, floor raised to 48 k.
## Effort
~6 engineer-days (Wave 5 tasks T5.1T5.3).

View File

@@ -0,0 +1,111 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Simulcast + Per-Receiver Layer Selection
> **Status:** proposed
> **Resolves:** Road-to-video Phases V5 + V6 (simulcast at sender, layer selection at SFU).
> **Depends on:** PRD #5 (video v1), PRD #7 (VideoQualityController).
## Problem
In a multi-peer video room, peers have wildly different link quality. A single uplink stream forces a choice: encode for the worst peer (everyone sees SD) or encode for the best peer (poor peers drop out). Simulcast solves this — sender uploads multiple independent layers, and the SFU forwards the appropriate layer to each receiver based on their current quality.
WZP's v2 wire format already reserves `stream_id: u8` for this. This PRD wires it up.
## Goals
- Sender emits 23 simultaneous H.264/H.265/AV1 streams per source (different bitrate/resolution).
- Each layer tagged by `stream_id` (0 = base/SD, 1 = mid/HD, 2 = high/FHD).
- SFU selects per-receiver which layer to forward, based on that receiver's last `QualityReport` / BWE.
- Layer switches are seamless (next keyframe boundary) and don't require sender involvement.
- Mixed-quality rooms work: best peer gets FHD, worst peer gets SD, no peer holds the room back.
## Non-goals
- SVC (per-layer temporal scalability within one bitstream). Simulcast achieves the same outcome with simpler encoder.
- Audio simulcast (audio is small; not worth the encode cost).
## Design
### Sender side
Three encoder instances per source:
| `stream_id` | Resolution | Target bitrate | Frame rate |
|---|---|---|---|
| 0 (low) | 480×270 | 150 kbps | 15 fps |
| 1 (mid) | 960×540 | 600 kbps | 30 fps |
| 2 (high) | 1920×1080 | 2.5 Mbps | 30 fps |
Resolution/bitrate ladder configurable per profile. Encoders share input frames (downsample for low/mid).
Each layer is an independent stream with its own `sequence`, `timestamp_ms`, and FEC blocks. Identified on the wire by `stream_id` byte in `MediaHeader` v2.
### SFU forwarding
`RoomManager` per-receiver state:
```rust
pub struct ReceiverState {
fingerprint: Fingerprint,
bwe_kbps: AtomicU32,
loss_pct: AtomicU8,
selected_layer: AtomicU8, // per (sender, source_stream)
}
```
Layer selection logic (run periodically per receiver):
```
if receiver.bwe_kbps > HIGH_THRESHOLD && receiver.loss_pct < 2:
selected_layer = high
elif receiver.bwe_kbps > MID_THRESHOLD:
selected_layer = mid
else:
selected_layer = low
```
Hysteresis: must hold new tier for 3 s before switching.
On layer switch:
- SFU continues forwarding the old layer until the next keyframe arrives on the new layer.
- If no keyframe on the new layer within 500 ms, SFU emits PLI to sender for that layer.
### Per-layer keyframe cache
PRD #5 keyframe cache extended: one cache entry per `(room, sender, stream_id)`. New joiner gets the most recent keyframe from the layer matched to their BWE.
### Layer-aware PLI suppression
PLI is layer-scoped. Sender refreshes only the requested layer, not all three.
## Implementation outline
1. `VideoQualityController` extended to drive 3 encoder instances per source (T5.5).
2. Frame distributor: downsample input frame for low/mid layers before encode.
3. Per-layer state on `MediaHeader` (already in v2 via `stream_id`).
4. SFU `ReceiverState` and selection logic (T5.6).
5. Per-layer keyframe cache (extension of PRD #5).
6. Per-layer PLI plumbing.
7. Telemetry: `wzp_room_layer_distribution{stream_id}` histogram.
## Acceptance criteria
- 3-encoder uplink works on M1 within 8 % CPU at 1080p30 / 540p30 / 270p15.
- 4-peer room with shaped links (5 Mbps, 1 Mbps, 500 kbps, 100 kbps): each peer receives the highest layer their link supports.
- Layer switch under improving link conditions occurs within 5 s of bandwidth recovery.
- No peer's bandwidth degradation holds back any other peer.
## Risks
- **3-encoder CPU cost on mid/low-end Android.** Mitigation: dynamic layer count — drop high layer if encoder queue grows; some devices may only support 2 layers.
- **Frame-rate drift between layers** (independent encoders running). Mitigation: shared frame clock; low/mid layers drop frames if needed to stay aligned.
- **SFU per-receiver state bloat.** Mitigation: only allocate state for active receivers; 80 B/receiver/sender bound.
- **Layer switch causing brief visible flicker.** Mitigation: switch only at keyframes; UI may show momentary resolution change but no glitch.
## Effort
~7 engineer-days (Wave 5 tasks T5.5 + T5.6).

137
vault/PRDs/PRD-video-v1.md Normal file
View File

@@ -0,0 +1,137 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Video v1 — H.264 Single-Layer
> **Status:** proposed
> **Resolves:** Road-to-video Phases V3 + V4 (encoder/decoder, framer, NACK, keyframe cache).
> **Depends on:** PRD #1 (wire format v2), PRD #3 (TransportFeedback + BWE).
## Problem
WZP has no video path. Add a working unidirectional video call (macOS↔macOS first, then Android↔macOS) using H.264 baseline, with loss recovery appropriate for lossy mobile links.
## Goals
- New `wzp-video` crate parallel to `wzp-codec`.
- H.264 baseline encode/decode using platform hardware encoders.
- NAL fragmentation and access-unit reassembly conformant to our 16 B `MediaHeader` v2.
- NACK loop for P-frame loss (RTT-gated).
- Dynamic FEC ratio boost on I-frame packets.
- SFU keyframe cache for fast join-to-first-frame.
- PLI suppression at SFU to bound upstream keyframe-request traffic.
## Non-goals
- Multi-codec negotiation (PRD #6).
- Simulcast or per-receiver layer selection (PRD #8).
- VideoQualityController logic beyond a fixed bitrate target (PRD #7).
- Native camera capture pipelines (separate platform work).
## Design
### `wzp-video` crate
```
wzp-video/
src/
encoder.rs # trait VideoEncoder
# VideoToolboxEncoder (macOS)
# MediaCodecEncoder (Android, JNI)
# OpenH264Encoder (software fallback)
decoder.rs # trait VideoDecoder; mirror per-platform
framer.rs # H.264 NAL fragmentation to MTU-sized chunks
depacketizer.rs # Reassemble NALs, emit access units
keyframe.rs # Keyframe request handling, sender + receiver
config.rs # SPS/PPS shipment over signal stream
```
### Framing
One access unit (frame) → N packets, each ≤ `MTU - 16 (header) - 16 (AEAD tag)`.
- `sequence` global per (session, stream_id), advances per packet.
- `timestamp_ms` is presentation time, equal across all packets of a single access unit.
- `KeyFrame` bit set on every packet of an I-frame.
- `FrameEnd` bit set on the last packet of the access unit.
- `fec_block_id` per access unit (u16 in v2, large blocks).
Parameter sets (SPS/PPS) ride on the **signal stream**, not media datagrams. Sent at session start and on codec change. Reliable, ordered, one-time.
### NACK loop
```
SignalMessage::Nack {
version: u8,
stream_id: u8,
seqs: Vec<u32>, // missing P-frame packets
}
```
Receiver behavior:
- If access unit incomplete after `frame_interval` ms:
- If `RTT < 2 × frame_interval`: emit `Nack`.
- Else: emit `PictureLossIndication`.
- Backoff: max 1 Nack per (stream, seq) per 2 × RTT.
Sender behavior:
- On `Nack`: re-transmit if packet is still in send buffer (last 500 ms).
- On `PictureLossIndication`: emit a fresh I-frame within 200 ms.
### Dynamic FEC on I-frames
Encoder marks packets belonging to I-frames. FEC layer applies a higher ratio (default 0.5) to I-frame blocks, vs. nominal (0.1) for P-frames. Configurable.
### SFU keyframe cache
`RoomManager` maintains per `(room, sender, stream_id)`:
```rust
struct KeyframeCache {
packets: Vec<Bytes>, // most recent complete I-frame
timestamp_ms: u32,
sequence_first: u32,
}
```
On new participant join, cache is replayed before live forwarding starts. Eliminates 2 s black-screen-on-join.
Cache TTL: replaced whenever a new complete I-frame arrives.
### PLI suppression
If ≥ 2 receivers PLI within 200 ms for the same `(sender, stream_id)`, the SFU emits one `KeyframeRequest` upstream, not N. Tracked per-(sender, stream).
## Implementation outline
1. `wzp-video` crate scaffold (T4.1).
2. Framer/depacketizer with property tests (T4.1).
3. VideoToolbox encoder/decoder (macOS) (T4.2).
4. MediaCodec encoder/decoder (Android, JNI) (T4.3).
5. NACK signal + sender/receiver state machines (T4.4).
6. I-frame FEC ratio hint plumbed from encoder to FEC layer (T4.5).
7. SFU keyframe cache (T4.6).
8. PLI suppression (T4.7).
9. End-to-end test: macOS sender → relay → macOS receiver, 5 min call, < 1 % loss network.
## Acceptance criteria
- Unidirectional H.264 720p30 call macOS↔macOS, CPU < 5 % on M1.
- Android↔macOS works with MediaCodec (surface-texture path).
- Black-screen-on-join < 200 ms when keyframe cache is warm.
- Under 5 % synthetic packet loss at 50 ms RTT: NACK recovery keeps video smooth, < 1 keyframe / 2 s.
- Under 5 % synthetic packet loss at 300 ms RTT: PLI fallback fires, keyframe rate ~ 1 / s.
- Upstream PLI traffic at SFU < 2 / s under simulated mass packet loss with 8 receivers.
## Risks
- **MediaCodec surface-texture edge cases.** Per-device matrix; software fallback path mandatory.
- **VideoToolbox H.264 baseline restrictions** (some profiles are main-only in HW). Mitigation: profile detection at session start.
- **NACK storm under heavy loss.** Mitigation: rate cap (max 50 Nacks/s/receiver) and exponential backoff.
- **Keyframe cache memory footprint** (one I-frame per active stream per room). Mitigation: cap cache at 200 KB; if exceeded, drop and rely on PLI.
## Effort
~3 weeks (Wave 4 tasks T4.1T4.7).

View File

@@ -0,0 +1,119 @@
---
tags: [prd, wzp]
type: prd
---
# PRD: Wire Format v2
> **Status:** proposed
> **Resolves:** Audit W1, W4, W9, W10. Keystone prerequisite for video and per-`MediaType` conformance enforcement.
> **References:** `docs/WZP-SPEC.md`, `docs/ROAD-TO-VIDEO.md` Phase V1, `docs/PROTOCOL-AUDIT.md`.
## Problem
v1 wire format has four structural problems that compound the moment video lands:
- 16-bit sequence wraps in ~21 min at 50 pps (W1)
- MiniHeader has no sequence delta, so a missed full header desyncs (W4)
- CodecID is 4 bits → 16 codec slots, 9 used; video will exhaust it (W9)
- No `MediaType` field → SFU cannot distinguish audio/video/data without a codec lookup (W10)
Fixing these post-deployment is a multi-client coordinated break. Fix once, before video.
## Goals
- One wire-format change resolves W1, W4, W9, W10 and reserves headroom for the next decade.
- v1 and v2 can co-exist briefly during rollout via explicit version handshake (typed rejection, not silent corruption).
- All 571 audio tests pass under v2.
## Non-goals
- Backward wire compatibility (we will not encode v2 atop v1 — it is a clean break).
- Video framing rules themselves (covered by PRD #5).
- New codec IDs beyond reservation (covered by PRDs #5, #6).
## Design
### `MediaHeader` v2 (16 bytes, byte-aligned)
```
Byte 0: version (u8) 0x02
Byte 1: flags (u8) bit 7: T (FEC repair)
bit 6: Q (QualityReport trailer present, inside AEAD)
bit 5: KeyFrame (video I-frame packet)
bit 4: FrameEnd (last packet of access unit)
bits 3-0: reserved (must be 0)
Byte 2: media_type (u8) 0=audio, 1=video, 2=data, 3=control
Byte 3: codec_id (u8)
Byte 4: stream_id (u8) 0=base; simulcast layers 1..N
Byte 5: fec_ratio (u8) 0..200 → 0.0..2.0
Bytes 6-9: sequence (u32 BE)
Bytes 10-13: timestamp_ms (u32 BE)
Bytes 14-15: fec_block_id (u16 BE)
audio: low 8 bits = block_id, high 8 = symbol_idx
video: full u16 block_id (large FEC blocks for I-frames)
```
Justification for byte alignment (16 B over 12 B packed) is in `ROAD-TO-VIDEO.md` Phase V1; benchmarks showed ≤ 0.32 % stream overhead delta across all scenarios.
### `MiniHeader` v2 (5 bytes)
```
[FRAME_TYPE_MINI = 0x01]
Byte 0: seq_delta (u8) ← new; resolves W4
Bytes 1-2: timestamp_delta_ms (u16 BE)
Bytes 3-4: payload_len (u16 BE)
```
Audio only. Video pays the full 16 B header per packet (no clean periodic structure to compress).
### Version negotiation
`CallOffer` and `CallAnswer` already carry supported profiles. Add:
```rust
struct CallOffer {
...
protocol_version: u8, // 2 in v2 clients
supported_versions: Vec<u8>, // e.g. [2]
}
```
Relay/peer side:
- If `protocol_version` is supported → proceed.
- If unsupported → close with `Hangup::ProtocolVersionMismatch { server_supported: Vec<u8> }`.
No silent fallback. No mixed-version session.
### Sequencing semantics
- `sequence` is per-stream, monotonic, u32, wraps at 2^32. At 1000 pps that is ~50 days — effectively no wrap.
- `timestamp_ms` is per-stream, milliseconds since session start, u32, ~49.7 days range. Rebase behavior at rekey: **does not reset** — kept monotonic across rekeys (documented as a separate hardening item in PRD #4, W3).
- `fec_block_id` is per-stream, u16, wraps at 2^16. With ≥ 5-frame blocks that is ~22 minutes at 50 pps — adequate but PRD #4 (W2) covers epoch counter if needed.
## Implementation outline
1. New types in `wzp-proto/src/packet.rs` behind a `proto-v2` feature flag.
2. Round-trip tests for `MediaHeader v2` and `MiniHeader v2` (encode → decode → assert equal).
3. Migrate `wzp-codec` encode path to emit v2 headers.
4. Migrate `wzp-client` and `wzp-relay` parse paths.
5. `CallOffer`/`CallAnswer` carry `protocol_version` and `supported_versions`.
6. Typed `Hangup::ProtocolVersionMismatch` reason.
7. Remove v1 emission path once all 571 tests pass under v2 (drop the feature flag default).
8. Add migration note to `WZP-SPEC.md`.
## Acceptance criteria
- All 571 audio tests pass with v2 headers.
- A v1 client connecting to a v2 relay receives `Hangup::ProtocolVersionMismatch` within 1 RTT.
- Wire-level capture confirms 16 B `MediaHeader` and 5 B `MiniHeader` on real audio calls.
- `media_type` byte readable by relay without parsing `codec_id` (enables PRD #2 Tier A separation).
## Risks
- **Stranding old clients.** Force-update prompt in UI; release notes; staged rollout (relays accept v1 for 2 weeks before flipping to reject).
- **MiniHeader 5 B vs 4 B regression check.** Trunking math reconfirmed (cap of 10 binds before MTU — no change).
## Effort
~2.5 engineer-days (Wave 1 tasks T1.1T1.3 in the index).

156
vault/PRDs/README.md Normal file
View File

@@ -0,0 +1,156 @@
---
tags: [prd, wzp]
type: prd
---
# PRD Index — Protocol v2, Video, Abuse Mitigation
> Coordinated worklist that addresses (a) the P0/P1 findings in `docs/PROTOCOL-AUDIT.md`, (b) the video roadmap in `docs/ROAD-TO-VIDEO.md`, and (c) the relay abuse vectors in `docs/ATTACK-SURFACE-RELAY-ABUSE.md`. Each item below links to its own PRD.
## Why a combined plan
The three documents share substantial structure:
- **Wire format v2** (audit P0: W1, W4, W9, W10) is the prerequisite for video framing **and** for per-`MediaType` conformance enforcement against abuse. One change resolves three pressures.
- **TransportFeedback + BWE** (audit P1: W6, W14) is mandatory for video, materially improves audio adaptation, and gives the relay another observable for abuse detection.
- **Relay conformance enforcement** (attack surface Tiers AG) is independently valuable for audio today, and the v2 `MediaType` bit lets it scale cleanly to video.
Sequencing matters. Implementing v2 wire format **before** any video work or any deep abuse mitigation avoids two compatibility breaks.
## PRD catalog
| # | PRD | Resolves | Status |
|---|---|---|---|
| 1 | [PRD-wire-format-v2](./PRD-wire-format-v2.md) | Audit W1, W4, W9, W10; prereq for #5/#6/#7/#8 and Tier F of #2 | proposed |
| 2 | [PRD-relay-conformance](./PRD-relay-conformance.md) | Attack-surface Tiers AG | proposed |
| 3 | [PRD-transport-feedback-bwe](./PRD-transport-feedback-bwe.md) | Audit W6, W14 | proposed |
| 4 | [PRD-protocol-hardening](./PRD-protocol-hardening.md) | Audit W2, W3, W5, W11, W12, W13 (security + correctness batch) | proposed |
| 5 | [PRD-video-v1](./PRD-video-v1.md) | Road-to-video Phases V3 + V4 (H.264 single-layer, NACK, keyframe cache) | proposed |
| 6 | [PRD-video-multicodec](./PRD-video-multicodec.md) | H.265 + AV1 negotiation (road-to-video Phase V3 codec rollout) | proposed |
| 7 | [PRD-video-quality-priority](./PRD-video-quality-priority.md) | Road-to-video Phase V5 (VideoQualityController + PriorityMode + ScreenShare) | proposed |
| 8 | [PRD-video-simulcast](./PRD-video-simulcast.md) | Road-to-video Phases V5 + V6 (simulcast, per-receiver layer selection at SFU) | proposed |
Native capture pipelines (road-to-video Phase V7) are out of scope here — they sit downstream of #5 and are platform team work; tracked separately.
## Dependency graph
```
┌───────────────────────────────┐
│ #1 Wire format v2 (keystone) │
└────────┬──────────────────────┘
┌──────────────────────┼────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────────┐ ┌──────────────────────┐
│ #2 Conformance│ │ #3 Transport │ │ #4 Protocol │
│ Tier A-G │ │ Feedback + BWE │ │ Hardening │
└──────┬────────┘ └────────┬─────────┘ └──────────────────────┘
│ Tier A-D first │
│ Tier F needs traffic │
│ baseline │
│ │
│ ┌───────▼────────┐
│ │ #5 Video v1 │
│ │ (H.264 + NACK) │
│ └───────┬────────┘
│ │
│ ┌──────────────┼──────────────┐
│ │ │ │
│ ▼ ▼ ▼
│ ┌────────┐ ┌──────────────┐ ┌──────────────┐
│ │ #6 │ │ #7 Video │ │ #8 Simulcast │
│ │ Multi- │ │ Quality + │ │ │
│ │ codec │ │ Priority │ │ │
│ └────────┘ └──────────────┘ └──────────────┘
└──> #2 Tier F (video) — needs #5 in production traffic to baseline
```
## Combined task list
Ordered by dependency and risk. Each task references its PRD.
### Wave 1 — Foundation (week 1)
| Task | PRD | Effort | Output |
|---|---|---|---|
| T1.1 Land 16 B MediaHeader v2 + 5 B MiniHeader v2 in `wzp-proto` | #1 | 1 d | New types behind feature flag; old paths still work |
| T1.2 Update `wzp-codec` + `wzp-client` + `wzp-relay` to emit v2 | #1 | 1 d | All audio tests pass under v2 |
| T1.3 Protocol version negotiation in `CallOffer/CallAnswer` (typed `Hangup::ProtocolVersionMismatch`) | #1 + #4 (W12) | 0.5 d | v1 clients rejected with clear reason |
| T1.4 `QualityReport` trailer moved inside AEAD payload (or AAD-bound) | #4 (W5) | 0.5 d | Security fix, audit log |
| T1.5 Anti-replay window made per-stream and per-MediaType configurable | #4 (W11) | 0.5 d | Audio=64, video=1024 ready |
### Wave 2 — Feedback + abuse mitigation (week 2)
| Task | PRD | Effort | Output |
|---|---|---|---|
| T2.1 `SignalMessage::TransportFeedback` variant | #3 | 1 d | Wire path; not yet consumed |
| T2.2 `BandwidthEstimator` in `wzp-proto` (cwnd + remb fusion) | #3 | 2 d | Prometheus output |
| T2.3 `AdaptiveQualityController` consumes BWE | #3 | 1 d | Audio upgrade decisions use bandwidth, not just loss |
| T2.4 `wzp-relay/src/conformance.rs` — Tier A (bitrate ceilings per CodecID) | #2 | 1 d | Bulk-tunnel abuse killed |
| T2.5 Tier B (packet-rate cap) + Tier C (timestamp consistency) | #2 | 1 d | Loud abuse caught |
| T2.6 Prometheus: `relay_conformance_*` counters + observable histograms | #2 | 0.5 d | Baseline data collection starts |
### Wave 3 — Protocol hardening (week 3)
| Task | PRD | Effort | Output |
|---|---|---|---|
| T3.1 `fec_block_id` widened to u16 in v2 | #4 (W2) | 0.5 d | No FEC collisions on slow joiners |
| T3.2 Document `timestamp_ms` rebase behavior at rekey | #4 (W3) | 0.5 d | Spec clarity |
| T3.3 `SignalMessage` variants prefixed with `version: u8` | #4 (W12) | 0.5 d | Future-proof signaling |
| T3.4 `RoomManager` migrated to `DashMap<RoomId, Arc<RwLock<Room>>>` | #4 (W13) | 2 d | No per-packet global lock |
| T3.5 Tier E (per-fingerprint / per-IP token bucket) wired to featherChat auth | #2 | 1.5 d | Aggregate quota enforced |
| T3.6 Tier D (per-codec packet-size sanity) | #2 | 0.5 d | Sneaky-payload class caught |
### Wave 4 — Video v1 (weeks 46)
| Task | PRD | Effort | Output |
|---|---|---|---|
| T4.1 `wzp-video` crate scaffold; H.264 framer + depacketizer | #5 | 4 d | NAL fragmentation, access-unit reassembly |
| T4.2 VideoToolbox encoder + decoder (macOS) | #5 | 3 d | Unidirectional video macOS↔macOS |
| T4.3 MediaCodec encoder + decoder (Android, via JNI) | #5 | 5 d | Android video path |
| T4.4 NACK loop (`SignalMessage::Nack`) + RTT-gated policy | #5 | 2 d | P-frame loss recovery |
| T4.5 Dynamic FEC ratio on I-frames (encoder hint to FEC layer) | #5 | 1 d | I-frame survivability without round trip |
| T4.6 SFU keyframe cache per (room, sender, stream) | #5 | 2 d | < 200 ms join-to-first-frame |
| T4.7 PLI suppression at SFU | #5 | 1 d | Bounded upstream PLI rate |
### Wave 5 — Quality, codecs, simulcast (weeks 79)
| Task | PRD | Effort | Output |
|---|---|---|---|
| T5.1 `PriorityMode` enum on `QualityProfile` + `SignalMessage::SetPriorityMode` | #7 | 1 d | Wire path |
| T5.2 `VideoQualityController` with per-mode allocation gates | #7 | 3 d | AudioFirst / VideoFirst / Balanced live |
| T5.3 ScreenShare mode: slide-fallback encoder policy | #7 | 2 d | Presentation use case viable |
| T5.4 H.265 encoder/decoder (reuse framer) | #6 | 3 d | Codec negotiation cascade live |
| T5.5 Simulcast: encoder emits 3 layers; `stream_id` carries layer | #8 | 4 d | Layer-tagged uplink |
| T5.6 Per-receiver layer selection at SFU | #8 | 3 d | Mixed-quality rooms work |
| T5.7 Tier F (entropy scorer) — audio variant first, baselined from Wave 2/3 data | #2 | 3 d | Covert-tunnel pressure |
| T5.8 Tier G (response policy + audit log) | #2 | 1 d | Operational |
### Wave 6 — AV1 + Tier F video (weeks 10+)
| Task | PRD | Effort | Output |
|---|---|---|---|
| T6.1 AV1 encoder/decoder with HW detection (SVT-AV1 fallback) | #6 | 5 d | Top-tier efficiency on capable HW |
| T6.2 Tier F video scorer (keyframe periodicity, I/P frame-size ratio, BWE responsiveness) | #2 | 3 d | Video abuse detection |
| T6.3 Federated reputation gossip (optional) | #2 | 4 d | Cross-relay abuse mitigation |
## Risk register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| v2 wire format break strands old clients | High | High | Typed `Hangup::ProtocolVersionMismatch`, clear UI, force update prompt |
| BWE oscillation regresses audio adaptation | Med | Med | Behind feature flag; A/B with shadow Prometheus before flipping default |
| Conformance Tier A false positives | Low | High | Math-derived ceilings × 1.5; counter-only mode for 1 week before enforcement |
| `DashMap` migration regresses room semantics | Med | Med | Integration tests for federation + trunking before merging |
| Android MediaCodec edge cases (Nothing A059 baseline) | High | Med | Per-device test matrix; software fallback path |
| AV1 software encode torches battery | High | Low | HW probe at session start; refuse AV1 if no HW encode |
| Tier F false-positives on edge cases (e.g., long silences in lectures) | Med | High | Verdict-only mode + 30 s window minimum + Suspect tier escalation |
## Open product questions (not blocking)
- Anonymous vs. authenticated quota split — numbers TBD pending Prometheus baseline.
- Whether to expose `PriorityMode` UI for end users or only via product preset (call vs. screen-share).
- AV1 rollout gate: 5 %? 20 %? of sessions reporting HW support before enabling by default.
- Federated reputation gossip is powerful but introduces a poisoning surface; decision deferred to after Wave 5.

1907
vault/PRDs/TASKS.md Normal file

File diff suppressed because it is too large Load Diff