Files
wz-phone/vault/PRDs/PRD-protocol-hardening.md
Siavash Sameni ed8a7ae5aa docs: protocol audit 2026-05-25, update architecture + Obsidian vault
Audit:
- docs/AUDIT-2026-05-25.md: full protocol audit covering 8 findings
  (4 critical, 2 high, 5 medium, 4 low) with code references and fix
  effort estimates
- vault/Audit/Tasks.md: Obsidian Tasks plugin file tracking all audit
  items with priorities, due dates, and per-step checklists

Architecture docs updated for Wire format v2 and Wave 5/6 features:
- ARCHITECTURE.md: adds wzp-video to dependency graph and project
  structure; wire format updated to v2 (16B header, 5B MiniHeader);
  relay concurrency section corrected (DashMap+RwLock is current, not
  a future optimization); test count 571→702; Android note
- PROGRESS.md: Wave 5 and Wave 6 sections appended; test count 372→702;
  current status and open blockers as of 2026-05-25
- ROAD-TO-VIDEO.md: implementation status table inserted (/🟡/🔴/🔲
  per phase); 6-step critical path to first video call
- WZP-SPEC.md: MediaHeader updated to v2 (16B byte-aligned); MiniHeader
  updated to 5B with seq_delta; codec IDs 9-12 added (H.264/H.265/AV1);
  version negotiation section added

Obsidian vault (vault/):
- 114 files across Architecture/, PRDs/, Reports/, Android/,
  Reference/, Audit/ with YAML frontmatter
- 00 - Home.md index note with wiki links
- .obsidian/app.json config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 06:00:17 +04:00

4.7 KiB

tags, type
tags type
prd
wzp
prd

PRD: Protocol Hardening Batch

Status: proposed Resolves: Audit W2 (fec_block_id width), W3 (timestamp rebase doc), W5 (QualityReport AEAD binding), W11 (per-stream anti-replay), W12 (signal version byte), W13 (RoomManager lock). Depends on: PRD #1 (wire format v2 already widens block_id field).

Problem

A handful of medium-priority audit findings that don't individually justify a PRD but together represent the long tail of protocol correctness and concurrency. Batching them avoids version churn.

Items

H1 — W5: QualityReport trailer must be inside AEAD

Current risk. If the 4-byte trailer sits outside the encrypted payload, anything stripping the last 4 bytes corrupts AEAD verification on legitimate packets and creates a quality-feedback downgrade vector. Even if it's correctly inside today, the v2 wire format change is the right moment to assert this explicitly.

Action.

  • Audit crates/wzp-proto/src/packet.rs for QualityReport placement.
  • Move inside AEAD payload if currently outside.
  • Document: "QualityReport, when Q-flag set, is appended to plaintext payload before encryption."
  • Test: tamper with trailer → AEAD decrypt fails.

Severity. Security correctness. Do this in Wave 1.

H2 — W2: fec_block_id width

Resolved by v2 wire format (u16 instead of u8). PRD #1 carries the wire change; this PRD just confirms semantics:

  • Wraps at 2^16. At 5-frame blocks and 50 pps → ~22 min between collisions, vs. ~25 s in v1.
  • Late-joining peers must still discard FEC blocks older than 2 s; widening is defense in depth.

Action. Update wzp-fec to operate on u16 block_id end-to-end. Test reconstruction across a synthetic 22-min session.

H3 — W11: Per-stream, per-MediaType anti-replay window

Current. 64-packet sliding window globally.

Problem. Video keyframe burst (100+ packets) can stall the window behind one reordered prior packet.

Action.

  • Anti-replay state is per (stream_id, media_type).
  • Window size: 64 for audio, 1024 for video, 256 for data.
  • Window size selected at session setup based on declared profile; tunable via QualityProfile.

Severity. Required before video. Wave 1.

H4 — W12: SignalMessage versioning

Current. Bincode-serialized enum. #[serde(default, skip_serializing_if)] handles field additions; variant removals or semantic changes are unsafe.

Action.

  • Every variant gains version: u8 as its first field.
  • Add SignalMessage::Unknown { version, raw: Bytes } to absorb future unknown variants gracefully.
  • Decode path: unknown variant → log + drop, do not close session.

Severity. Future-proofing. Wave 3.

H5 — W3: timestamp_ms rebase documentation

Current. Behavior at rekey (every 65,536 packets, ~22 min) is not documented.

Decision (this PRD). timestamp_ms is monotonic across rekeys — it does not reset. Rekey changes only the cryptographic key material; sequence and timestamp are session-scoped, not key-scoped.

Action.

  • Document in WZP-SPEC.md and inline in packet.rs doc comments.
  • Add a test that performs a rekey mid-session and asserts timestamp_ms continuity.

Severity. Doc + test. Wave 3.

H6 — W13: RoomManager lock concurrency

Current. Single Mutex<RoomManager> acquired per packet by every participant for fan-out peer list. Serializes packet processing within a room.

Problem. At 1500 pps/sender for video, this is the dominant bottleneck.

Action.

  • Migrate to DashMap<RoomId, Arc<RwLock<Room>>>.
  • Per-room RwLock allows concurrent reads (fan-out peer list) and exclusive writes (join/leave/quality changes).
  • Fan-out path holds read lock; participant churn holds write lock.
  • Federation manager updated to match.

Severity. Required for video scale. Wave 3.

Migration safety.

  • Integration test suite (40 + 4 relay tests) must pass.
  • Federation tests must pass.
  • Trunking tests must pass.
  • Property-test: 100-participant room, 500 join/leave events, 10k packets — no panics, no missed forwards.

Implementation order

Wave Item Task
1 H1 (W5 AEAD binding) T1.4
1 H3 (W11 anti-replay per-stream) T1.5
1 H2 (W2 block_id widening) folded into PRD #1
3 H4 (W12 signal versioning) T3.3
3 H5 (W3 timestamp doc) T3.2
3 H6 (W13 RoomManager lock) T3.4

Acceptance criteria

  • All current tests pass post-hardening.
  • New tests: AEAD trailer tampering, rekey timestamp continuity, 100-participant property test, signal forward-compat decode.
  • No Prometheus regression in fan-out latency p99 after H6.

Effort

~4.5 engineer-days total (1.5 in Wave 1, 3 in Wave 3).