Files
wz-phone/vault/PRDs/PRD-protocol-hardening.md
Siavash Sameni ed8a7ae5aa docs: protocol audit 2026-05-25, update architecture + Obsidian vault
Audit:
- docs/AUDIT-2026-05-25.md: full protocol audit covering 8 findings
  (4 critical, 2 high, 5 medium, 4 low) with code references and fix
  effort estimates
- vault/Audit/Tasks.md: Obsidian Tasks plugin file tracking all audit
  items with priorities, due dates, and per-step checklists

Architecture docs updated for Wire format v2 and Wave 5/6 features:
- ARCHITECTURE.md: adds wzp-video to dependency graph and project
  structure; wire format updated to v2 (16B header, 5B MiniHeader);
  relay concurrency section corrected (DashMap+RwLock is current, not
  a future optimization); test count 571→702; Android note
- PROGRESS.md: Wave 5 and Wave 6 sections appended; test count 372→702;
  current status and open blockers as of 2026-05-25
- ROAD-TO-VIDEO.md: implementation status table inserted (/🟡/🔴/🔲
  per phase); 6-step critical path to first video call
- WZP-SPEC.md: MediaHeader updated to v2 (16B byte-aligned); MiniHeader
  updated to 5B with seq_delta; codec IDs 9-12 added (H.264/H.265/AV1);
  version negotiation section added

Obsidian vault (vault/):
- 114 files across Architecture/, PRDs/, Reports/, Android/,
  Reference/, Audit/ with YAML frontmatter
- 00 - Home.md index note with wiki links
- .obsidian/app.json config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 06:00:17 +04:00

115 lines
4.7 KiB
Markdown

---
tags: [prd, wzp]
type: prd
---
# PRD: Protocol Hardening Batch
> **Status:** proposed
> **Resolves:** Audit W2 (fec_block_id width), W3 (timestamp rebase doc), W5 (QualityReport AEAD binding), W11 (per-stream anti-replay), W12 (signal version byte), W13 (RoomManager lock).
> **Depends on:** PRD #1 (wire format v2 already widens block_id field).
## Problem
A handful of medium-priority audit findings that don't individually justify a PRD but together represent the long tail of protocol correctness and concurrency. Batching them avoids version churn.
## Items
### H1 — W5: `QualityReport` trailer must be inside AEAD
**Current risk.** If the 4-byte trailer sits *outside* the encrypted payload, anything stripping the last 4 bytes corrupts AEAD verification on legitimate packets and creates a quality-feedback downgrade vector. Even if it's correctly inside today, the v2 wire format change is the right moment to assert this explicitly.
**Action.**
- Audit `crates/wzp-proto/src/packet.rs` for `QualityReport` placement.
- Move inside AEAD payload if currently outside.
- Document: "QualityReport, when Q-flag set, is appended to plaintext payload before encryption."
- Test: tamper with trailer → AEAD decrypt fails.
**Severity.** Security correctness. Do this in Wave 1.
### H2 — W2: `fec_block_id` width
Resolved by v2 wire format (`u16` instead of `u8`). PRD #1 carries the wire change; this PRD just confirms semantics:
- Wraps at 2^16. At 5-frame blocks and 50 pps → ~22 min between collisions, vs. ~25 s in v1.
- Late-joining peers must still discard FEC blocks older than 2 s; widening is defense in depth.
**Action.** Update `wzp-fec` to operate on u16 block_id end-to-end. Test reconstruction across a synthetic 22-min session.
### H3 — W11: Per-stream, per-`MediaType` anti-replay window
**Current.** 64-packet sliding window globally.
**Problem.** Video keyframe burst (100+ packets) can stall the window behind one reordered prior packet.
**Action.**
- Anti-replay state is per (stream_id, media_type).
- Window size: 64 for audio, 1024 for video, 256 for data.
- Window size selected at session setup based on declared profile; tunable via `QualityProfile`.
**Severity.** Required before video. Wave 1.
### H4 — W12: `SignalMessage` versioning
**Current.** Bincode-serialized enum. `#[serde(default, skip_serializing_if)]` handles field additions; variant removals or semantic changes are unsafe.
**Action.**
- Every variant gains `version: u8` as its first field.
- Add `SignalMessage::Unknown { version, raw: Bytes }` to absorb future unknown variants gracefully.
- Decode path: unknown variant → log + drop, do not close session.
**Severity.** Future-proofing. Wave 3.
### H5 — W3: `timestamp_ms` rebase documentation
**Current.** Behavior at rekey (every 65,536 packets, ~22 min) is not documented.
**Decision (this PRD).** `timestamp_ms` is **monotonic across rekeys** — it does not reset. Rekey changes only the cryptographic key material; sequence and timestamp are session-scoped, not key-scoped.
**Action.**
- Document in `WZP-SPEC.md` and inline in `packet.rs` doc comments.
- Add a test that performs a rekey mid-session and asserts `timestamp_ms` continuity.
**Severity.** Doc + test. Wave 3.
### H6 — W13: `RoomManager` lock concurrency
**Current.** Single `Mutex<RoomManager>` acquired per packet by every participant for fan-out peer list. Serializes packet processing within a room.
**Problem.** At 1500 pps/sender for video, this is the dominant bottleneck.
**Action.**
- Migrate to `DashMap<RoomId, Arc<RwLock<Room>>>`.
- Per-room `RwLock` allows concurrent reads (fan-out peer list) and exclusive writes (join/leave/quality changes).
- Fan-out path holds read lock; participant churn holds write lock.
- Federation manager updated to match.
**Severity.** Required for video scale. Wave 3.
**Migration safety.**
- Integration test suite (40 + 4 relay tests) must pass.
- Federation tests must pass.
- Trunking tests must pass.
- Property-test: 100-participant room, 500 join/leave events, 10k packets — no panics, no missed forwards.
## Implementation order
| Wave | Item | Task |
|---|---|---|
| 1 | H1 (W5 AEAD binding) | T1.4 |
| 1 | H3 (W11 anti-replay per-stream) | T1.5 |
| 1 | H2 (W2 block_id widening) | folded into PRD #1 |
| 3 | H4 (W12 signal versioning) | T3.3 |
| 3 | H5 (W3 timestamp doc) | T3.2 |
| 3 | H6 (W13 RoomManager lock) | T3.4 |
## Acceptance criteria
- All current tests pass post-hardening.
- New tests: AEAD trailer tampering, rekey timestamp continuity, 100-participant property test, signal forward-compat decode.
- No Prometheus regression in fan-out latency p99 after H6.
## Effort
~4.5 engineer-days total (1.5 in Wave 1, 3 in Wave 3).