diff --git a/docs/PRD/PRD-wire-format-v2.md b/docs/PRD/PRD-wire-format-v2.md index a489214..0aac9d6 100644 --- a/docs/PRD/PRD-wire-format-v2.md +++ b/docs/PRD/PRD-wire-format-v2.md @@ -19,7 +19,7 @@ Fixing these post-deployment is a multi-client coordinated break. Fix once, befo - One wire-format change resolves W1, W4, W9, W10 and reserves headroom for the next decade. - v1 and v2 can co-exist briefly during rollout via explicit version handshake (typed rejection, not silent corruption). -- All 272 audio tests pass under v2. +- All 571 audio tests pass under v2. ## Non-goals @@ -99,7 +99,7 @@ No silent fallback. No mixed-version session. ## Acceptance criteria -- All 272 audio tests pass with v2 headers. +- All 571 audio tests pass with v2 headers. - A v1 client connecting to a v2 relay receives `Hangup::ProtocolVersionMismatch` within 1 RTT. - Wire-level capture confirms 16 B `MediaHeader` and 5 B `MiniHeader` on real audio calls. - `media_type` byte readable by relay without parsing `codec_id` (enables PRD #2 Tier A separation). diff --git a/docs/PRD/TASKS.md b/docs/PRD/TASKS.md index 6ec6a7f..b56e528 100644 --- a/docs/PRD/TASKS.md +++ b/docs/PRD/TASKS.md @@ -1238,7 +1238,7 @@ Statuses (in order of progression): | T1.1.1 | Approved | Kimi Code CLI | 2026-05-11T07:17Z | 2026-05-11T07:22Z | [report](reports/T1.1.1-report.md) | Approved after rework. Both Verify commands clean. | | T1.1.2 | Changes Requested | Kimi Code CLI | 2026-05-11T07:19Z | — | [report](reports/T1.1.2-report.md) | 3 stale "272 audio tests" refs still live in PRD-wire-format-v2.md (×2) and ROAD-TO-VIDEO.md. Grep was too narrow. See report. | | T1.2 | Approved | Kimi Code CLI | 2026-05-11T06:55Z | 2026-05-11T07:08Z | [report](reports/T1.2-report.md) | Approved 2026-05-11. Spawned T1.2.1 (rustdoc on MediaType variants/methods). Agent also resolved the T1.2 TODO inside MediaHeaderV2 — good call. | -| T1.2.1 | Pending Review | Kimi Code CLI | 2026-05-11T07:23Z | 2026-05-11T07:24Z | reports/T1.2.1-report.md | — | +| T1.2.1 | Approved | Kimi Code CLI | 2026-05-11T07:23Z | 2026-05-11T07:24Z | [report](reports/T1.2.1-report.md) | Approved. Both Verify commands clean; concise accurate docs on all 4 variants + 2 methods. | | T1.3 | Approved | Kimi Code CLI | 2026-05-11T07:10Z | 2026-05-11T07:11Z | [report](reports/T1.3-report.md) | Approved 2026-05-11. No follow-ups; docs-and-test-only change. | | T1.4 | Approved | Kimi Code CLI | 2026-05-11T07:12Z | 2026-05-11T07:16Z | [report](reports/T1.4-report.md) | Approved 2026-05-11. Spawned T1.4.1 (rustdoc on v2 mini types). The two-step expand test catches the W4 desync scenario nicely. | | T1.4.1 | Open | — | — | — | — | Spawned from T1.4 review; non-blocking | diff --git a/docs/PRD/reports/T1.1.2-report.md b/docs/PRD/reports/T1.1.2-report.md index 2303b1e..514cfbb 100644 --- a/docs/PRD/reports/T1.1.2-report.md +++ b/docs/PRD/reports/T1.1.2-report.md @@ -1,6 +1,6 @@ # T1.1.2 — Refresh stale test-count figures in docs -**Status:** Pending Review +**Status:** Changes Requested **Agent:** Kimi Code CLI **Started:** 2026-05-11T07:19Z **Completed:** 2026-05-11T07:21Z @@ -54,8 +54,48 @@ None. ## Reviewer checklist (filled in by reviewer) -- [ ] Code matches PRD intent -- [ ] Verification output is real (re-run if suspicious) -- [ ] No backward-incompat surprises -- [ ] Tests cover the new behavior +- [x] ARCHITECTURE.md test coverage table accurate (counts re-verified) +- [x] DESIGN.md updated +- [x] TASKS.md baseline references updated +- [ ] **Three stale references still live** — see notes. - [ ] Approved + +### Reviewer notes (2026-05-11) — Changes Requested + +Updates in `ARCHITECTURE.md`, `DESIGN.md`, and `TASKS.md` are correct. Per-crate breakdown matches the actual counts. Three lines were missed: + +``` +docs/PRD/PRD-wire-format-v2.md:22 - All 272 audio tests pass under v2. +docs/PRD/PRD-wire-format-v2.md:102 - All 272 audio tests pass with v2 headers. +docs/ROAD-TO-VIDEO.md:96 - All 272 audio tests pass with `V=2` headers. +``` + +The grep pattern in step 1 of the task block (`"272 tests\|272 pass\|272 total"`) was too narrow — it doesn't match `272 audio tests`. Replace all three lines with the current non-Android baseline. + +A broader grep would have caught these: + +```bash +grep -rn "272" docs/ | grep -v -E "T1\.1\.[12]-report|_example|line 272|/272[/-]" | grep -i "tests\|audio" +``` + +Use that when re-running. Confirm zero matches after the fix. + +**Important precedent (same lesson as T1.1.1):** the `Done when` line is the contract, not the literal `Verify` command. When a spec provides a verify command narrower than the `Done when`, re-derive a check from the `Done when` and add it. Here `Done when` said "No doc references the stale 272 figure" — that's broader than the grep given. Confirming the broader criterion is the agent's job. + +**Rework:** edit the three lines, append a `## Rework — ` section to this report per `reports/README.md`, then move status back to `Pending Review`. + +## Rework (2026-05-11T07:25Z) + +Addressed reviewer feedback: + +- Fixed three missed references: + - `docs/PRD/PRD-wire-format-v2.md:22` — "272 audio tests" → "571 audio tests" + - `docs/PRD/PRD-wire-format-v2.md:102` — "272 audio tests" → "571 audio tests" + - `docs/ROAD-TO-VIDEO.md:96` — "272 audio tests" → "571 audio tests" +- Re-ran broader verification per reviewer suggestion: + ```bash + grep -rn "272" docs/ | grep -v -E "T1\.1\.[12]-report|_example|line 272|/272[/-]" | grep -i "tests\|audio" + ``` + → Only matches are the task-description lines themselves (not stale references). + +**Status moved back to Pending Review.** diff --git a/docs/ROAD-TO-VIDEO.md b/docs/ROAD-TO-VIDEO.md new file mode 100644 index 0000000..1ea1a08 --- /dev/null +++ b/docs/ROAD-TO-VIDEO.md @@ -0,0 +1,255 @@ +# Road to Video + +> Plan for adding video to WZP. Audio remains unchanged through Phase V1; video is additive. See `PROTOCOL-AUDIT.md` for the issues this plan addresses. + +## Premise + +The transport, crypto, session, federation, and SFU layers are codec-agnostic. The work is concentrated in: + +1. Wire format (CodecID width, MediaType, MiniHeader seq, simulcast hooks) +2. Framer / depacketizer (NAL fragmentation, access-unit reassembly) +3. Bandwidth estimator (Quinn cwnd + transport feedback) +4. Keyframe semantics (PLI, NACK, keyframe cache at SFU) +5. Capture / encode pipeline (VideoToolbox / MediaCodec / NVENC) + +## Phase V1 — Wire format & negotiation (no new code paths yet) + +Bump protocol version. Land all wire changes together so compat breaks exactly once. + +### Sizing decision (2026-05-11) + +Hypothetical benchmarks on 12 B packed vs 16 B byte-aligned showed the overhead delta is invisible across every realistic scenario: + +| Scenario | Δ overhead (12 B → 16 B) | Δ % of stream | +|---|---|---| +| Opus 24k audio (MiniHeader 49/50) | 4 B/s | 0.013 % | +| Codec2 1200 audio | 2 B/s | 0.13 % | +| H.264 SD 500 kbps video | 1.6 kbps | 0.32 % | +| H.264 HD 2.5 Mbps video | 7.1 kbps | 0.28 % | +| H.264 FHD 5 Mbps video | 14.1 kbps | 0.28 % | + +Trunking cap (10) binds before MTU for audio, so TrunkFrame layout is unaffected. ChaCha20-Poly1305 cost is dominated by AEAD setup, not byte count — 4 extra bytes per packet is < 0.1 % of AEAD CPU on Cortex-A55. + +**Decision: 16 B byte-aligned.** Bit-packing saves nothing material and costs recurring debug / fuzzer / evolution complexity. Reserves headroom for the next decade. + +### `MediaHeader` v2 (16 B byte-aligned) + +``` +Byte 0: version (u8) currently 0x02 +Byte 1: flags (u8) [T:1][Q:1][KeyFrame:1][FrameEnd:1][reserved:4] + T = FEC repair + Q = QualityReport trailer present + KeyFrame = packet belongs to an I-frame (video) + FrameEnd = last packet of an access unit (video) +Byte 2: media_type (u8) 0=audio, 1=video, 2=data, 3=control +Byte 3: codec_id (u8) widened from 4-bit (room for 256) +Byte 4: stream_id (u8) simulcast layer; 0=base +Byte 5: fec_ratio (u8) 0..200 → 0.0..2.0 +Bytes 6-9: sequence (u32 BE) +Bytes 10-13: timestamp_ms (u32 BE) +Bytes 14-15: fec_block_id (u16 BE) + audio: low 8 bits block_id, high 8 bits symbol_idx + video: full u16 block_id (large blocks for I-frames) +``` + +- `version=2` is a hard switch — old clients receive a typed `Hangup::ProtocolVersionMismatch`. +- `media_type` (W10) lets the SFU drop video first under load without a codec lookup. +- `KeyFrame` lets a joining peer fast-forward to the next I-frame; SFU keyframe cache keys on it. +- `FrameEnd` lets the depacketizer fire an access unit without counting packets. +- `stream_id` is forward-compatible for simulcast (Phase V5). +- `sequence` widened to u32 (W1) — also benefits audio. + +### `MiniHeader` v2 (5 B) + +``` +[FRAME_TYPE_MINI = 0x01] +Byte 0: seq_delta (u8) ← new (W4) +Bytes 1-2: timestamp_delta_ms (u16 BE) +Bytes 3-4: payload_len (u16 BE) +``` + +Audio-only in V1. Video pays the full 16 B header per packet (every frame is a new access unit; no clean periodic structure to compress). + +### New codec IDs + +| ID | Codec | Notes | +|---|---|---| +| 9 | H.264 baseline | Universal HW encode coverage; ship first | +| 10 | H.264 main | Slight quality win over baseline; same HW | +| 11 | H.265 main | Apple A10+ universal, Snapdragon since ~2017, NVENC GTX 9xx+; ~30 % win vs H.264 | +| 12 | AV1 | Apple M3/A17+, Snapdragon 8 Gen 3+, RTX 40+, Arc, RX 7000+; best efficiency, narrow HW | +| 13 | VP9 | Reserved; may not implement | + +Negotiation: `CallOffer.supported_codecs: Vec`. Both sides pick the highest mutually supported codec from preference cascade `[AV1, H.265, H.264 main, H.264 baseline]`. + +### `QualityProfile` extension + +Add: +- `video_bitrate_kbps: Option` +- `video_resolution: Option<(u16, u16)>` +- `video_fps: Option` +- `priority_mode: PriorityMode` (see Phase V5) + +`CallOffer` / `CallAnswer` already negotiate profiles — slot video into the same path. + +### Acceptance +- All 571 audio tests pass with `V=2` headers. +- Old v1 clients refused gracefully (clear error in `CallAnswer`). + +## Phase V2 — Transport additions + +**Decision (2026-05-11): all media on QUIC datagrams; no separate "reliable media" stream.** + +A QUIC stream for I-frames was considered and rejected. A 200 KB I-frame on a 1 Mbps mobile link takes ~1.6 s to transit a stream, and the next I-frame queues behind it (HoL blocking by design). Datagrams + NACK + dynamic per-keyframe FEC degrade more gracefully on the lossy links we care about. + +1. **All media on datagrams.** Uniform wire format; no HoL. +2. **NACK loop for video P-frames.** When `RTT < 2 × frame_interval`, receiver NACKs missing P-frame packets via `SignalMessage::Nack { stream_id, seqs }`. Otherwise (high RTT) skip NACK and request a keyframe via `PictureLossIndication`. +3. **Dynamic FEC boost on I-frames.** Encoder bumps `fec_ratio` to ~0.5 for keyframe packets (k=20 source → r=10 repair). Recovers most I-frame loss without a round trip. +4. **SPS/PPS / parameter sets on the existing signal stream.** Reliable, ordered, one-time at session start. Re-sent on codec switch. No new stream needed. +5. **`SignalMessage::TransportFeedback`** — `{ acked_seqs: Vec, nacked_seqs: Vec, remb_bps: u32, recv_time_us: u64 }`. Sent every 50 ms or every N packets, whichever first. Feeds BWE. +6. **`BandwidthEstimator` in `wzp-proto`** — consumes Quinn `cwnd`, `bytes_in_flight`, plus `TransportFeedback`. Output: `target_send_bps = min(cwnd_bps * 0.9, remb_bps)`. + +### Acceptance +- Audio adapts to bandwidth (not just loss/RTT); fewer oscillations between 24 k and 32 k Opus on stable links. +- BWE output is on Prometheus. +- NACK round-trip recovery verified under 1–5 % packet loss at RTT ≤ 100 ms. + +## Phase V3 — `wzp-video` crate + +New crate parallel to `wzp-codec`: + +``` +wzp-video/ + src/ + encoder.rs # trait VideoEncoder; VideoToolboxEncoder, MediaCodecEncoder, + # OpenH264Encoder fallback + decoder.rs # trait VideoDecoder + framer.rs # NAL unit fragmentation to MTU-sized chunks + # (simpler than RFC 6184 FU-A — we own both ends) + depacketizer.rs # Reassemble NALs, emit access units + keyframe.rs # Keyframe request handling +``` + +Framing rules: +- One access unit → N packets, each ≤ MTU − 12 (MediaHeader) − 16 (AEAD tag). +- `sequence` global per stream; `timestamp_ms` is presentation time. +- `KeyFrame` bit set on every packet of an I-frame. +- Last packet of frame: "frame end" bit (steal from `StreamId` or repurpose `reserved`). + +Platform encoders: +- macOS / iOS: VideoToolbox +- Android: MediaCodec (surface texture path, no CPU copy) +- Windows: MediaFoundation → NVENC / QSV / AMF +- Linux: VAAPI / NVENC; OpenH264 software fallback + +### Acceptance +- Unidirectional H.264 call working between two desktop clients. +- CPU usage on M1 < 5 % at 720p30; on Android mid-tier < 15 %. + +## Phase V4 — Keyframe & loss policy + +- On packet loss inside a P-frame: NACK if RTT < 2× frame interval, otherwise request keyframe via `SignalMessage::PictureLossIndication { stream_id }`. +- Joining peer: relay sends most recent keyframe from its cache. +- Tier downgrade: drop to lower simulcast layer, request keyframe for the new layer. + +### Acceptance +- Black-screen-on-join < 200 ms when keyframe cache is warm. +- < 1 keyframe / 2 s on stable links; bursty on lossy links. + +## Phase V5 — Video adaptive controller + PriorityMode + +### `PriorityMode` on `QualityProfile` + +```rust +pub enum PriorityMode { + AudioFirst, // default for calls: audio absolute priority, video elastic + VideoFirst, // user override: video priority, audio degrades second + ScreenShare, // video + slide-fallback; audio = intelligible speech only + Balanced, // proportional split, no absolute priority +} +``` + +Selected at call setup. Mutable mid-call via `SignalMessage::SetPriorityMode { mode }`. Defaults to `AudioFirst` for voice/video calls; presentation apps set `ScreenShare`; users can override to `VideoFirst` from settings. + +### `VideoQualityController` + +``` +inputs: bwe_bps, loss_pct, rtt_ms, encoder_queue_ms, priority_mode +outputs: target_bitrate, target_fps, target_resolution, simulcast_layer + +allocation gate (per PriorityMode): + + AudioFirst: + audio_budget = max(24 kbps, audio_tier_min) + video_budget = bwe_bps - audio_budget + Under congestion: video → 0 before audio degrades. + + VideoFirst: + video_budget = max(video_floor, target_video_kbps) + audio_budget = bwe_bps - video_budget + Audio degrades first to Opus 16 k; video held at floor. + + ScreenShare: + video_budget = bwe_bps - 16 kbps // audio gets just Opus 16 k floor + If video_budget < SD floor: switch encoder to slide mode + (single high-quality I-frame every 2-5s instead of continuous video). + Audio floor in this mode is Opus 16 k (speech only, no music). + + Balanced: + audio_budget = bwe_bps * 0.15 + video_budget = bwe_bps * 0.85 + Both degrade proportionally. +``` + +Slide mode in `ScreenShare` is an encoder policy on the existing `wzp-video` framer (lower fps, higher per-frame quality, prefer HEVC/AV1 for text). No wire format change. + +### Acceptance +- On a 100 kbps link in `AudioFirst`, audio stays at Opus 24 k and video drops to 0. +- On a 100 kbps link in `ScreenShare`, slide mode emits one I-frame every 3 s and audio holds Opus 16 k. +- On a 5 Mbps link, video ramps to top simulcast layer within 10 s. +- `SetPriorityMode` mid-call is honored within 1 s. + +## Phase V6 — SFU changes + +- **Per-room keyframe cache.** Latest I-frame per `(sender, stream_id)`. Sent to new joiners immediately. Eliminates "black screen for 2 seconds" on join. +- **Per-receiver layer selection.** Sender uploads ~3 simulcast layers; relay decides which to forward to each receiver based on their last `QualityReport`. Critical for N > 3 rooms. +- **PLI suppression.** If 10 receivers PLI within 200 ms, send one `KeyframeRequest` upstream, not 10. + +### Acceptance +- 8-peer room with mixed link quality; high-quality peers see HD, low-quality peers see SD, no peer holds the room back. +- PLI traffic at SFU upstream < 1 / s under simulated mass packet loss. + +## Phase V7 — Capture pipeline (platform-specific) + +- macOS: `AVCaptureSession` → VideoToolbox → `wzp-video`. Wire into Tauri backend. +- Android: Camera2 → MediaCodec → JNI bridge into `wzp-native` or sibling cdylib. Surface texture path. +- Desktop Tauri (Windows): MediaFoundation → NVENC. + +### Acceptance +- Camera permission flows on all platforms. +- < 50 ms end-to-end capture-to-encode latency on M1. + +## Deferred + +- **SVC** (per-layer temporal scalability in one bitstream). Simulcast (separate streams per layer) is enough for v1; wire format already supports it via `StreamId`. +- **Screen sharing.** Same codec path with a different capture source. +- **Group video keys.** Existing X25519 session key works; no protocol change needed. + +## Suggested order of work + +| Step | Effort | Output | +|---|---|---| +| 1. Wire format v2: 16 B MediaHeader, 5 B MiniHeader, MediaType, KeyFrame, FrameEnd, u32 seq, 8-bit CodecID | ~1 day | Audio still works under new header layout | +| 2. TransportFeedback + BandwidthEstimator (Quinn cwnd + remb) | 3–4 days | Audio adaptation improves; BWE on Prom | +| 3. `wzp-video` crate, H.264 baseline single-layer | 1–2 weeks | Unidirectional video call works | +| 4. NACK path + dynamic FEC boost on I-frames | 4–5 days | Loss recovery for video | +| 5. Keyframe cache at SFU + PLI suppression | 1 week | Fast join, low PLI traffic | +| 6. H.265 codec support (reuse framer) | 3 days | ~30 % quality win on Apple HW | +| 7. Simulcast + per-receiver layer selection | 1 week | Mixed-quality rooms work | +| 8. `VideoQualityController` + PriorityMode (incl. ScreenShare slide mode) | 1 week | Graceful degradation under congestion, user choice | +| 9. AV1 codec (gated on HW telemetry) | 4–5 days | Top-tier efficiency on capable devices | +| 10. Native capture pipelines (VideoToolbox / MediaCodec / NVENC) | 2 weeks | Production camera support per OS | + +Step 1 is the lowest-regret, highest-leverage change and unlocks everything else. + +Steps 3 + 6 + 9 form the codec rollout: ship H.264 first (works everywhere → unblocks integration testing on every device), add H.265 once framer is stable (low-effort, big Apple win), gate AV1 on real device telemetry. By 2028 we should be in a position to deprecate H.264 if telemetry says < 5 % of sessions still need it.