docs: protocol audit 2026-05-25, update architecture + Obsidian vault
Audit: - docs/AUDIT-2026-05-25.md: full protocol audit covering 8 findings (4 critical, 2 high, 5 medium, 4 low) with code references and fix effort estimates - vault/Audit/Tasks.md: Obsidian Tasks plugin file tracking all audit items with priorities, due dates, and per-step checklists Architecture docs updated for Wire format v2 and Wave 5/6 features: - ARCHITECTURE.md: adds wzp-video to dependency graph and project structure; wire format updated to v2 (16B header, 5B MiniHeader); relay concurrency section corrected (DashMap+RwLock is current, not a future optimization); test count 571→702; Android note - PROGRESS.md: Wave 5 and Wave 6 sections appended; test count 372→702; current status and open blockers as of 2026-05-25 - ROAD-TO-VIDEO.md: implementation status table inserted (✅/🟡/🔴/🔲 per phase); 6-step critical path to first video call - WZP-SPEC.md: MediaHeader updated to v2 (16B byte-aligned); MiniHeader updated to 5B with seq_delta; codec IDs 9-12 added (H.264/H.265/AV1); version negotiation section added Obsidian vault (vault/): - 114 files across Architecture/, PRDs/, Reports/, Android/, Reference/, Audit/ with YAML frontmatter - 00 - Home.md index note with wiki links - .obsidian/app.json config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
114
vault/PRDs/PRD-protocol-hardening.md
Normal file
114
vault/PRDs/PRD-protocol-hardening.md
Normal file
@@ -0,0 +1,114 @@
|
||||
---
|
||||
tags: [prd, wzp]
|
||||
type: prd
|
||||
---
|
||||
|
||||
# PRD: Protocol Hardening Batch
|
||||
|
||||
> **Status:** proposed
|
||||
> **Resolves:** Audit W2 (fec_block_id width), W3 (timestamp rebase doc), W5 (QualityReport AEAD binding), W11 (per-stream anti-replay), W12 (signal version byte), W13 (RoomManager lock).
|
||||
> **Depends on:** PRD #1 (wire format v2 already widens block_id field).
|
||||
|
||||
## Problem
|
||||
|
||||
A handful of medium-priority audit findings that don't individually justify a PRD but together represent the long tail of protocol correctness and concurrency. Batching them avoids version churn.
|
||||
|
||||
## Items
|
||||
|
||||
### H1 — W5: `QualityReport` trailer must be inside AEAD
|
||||
|
||||
**Current risk.** If the 4-byte trailer sits *outside* the encrypted payload, anything stripping the last 4 bytes corrupts AEAD verification on legitimate packets and creates a quality-feedback downgrade vector. Even if it's correctly inside today, the v2 wire format change is the right moment to assert this explicitly.
|
||||
|
||||
**Action.**
|
||||
- Audit `crates/wzp-proto/src/packet.rs` for `QualityReport` placement.
|
||||
- Move inside AEAD payload if currently outside.
|
||||
- Document: "QualityReport, when Q-flag set, is appended to plaintext payload before encryption."
|
||||
- Test: tamper with trailer → AEAD decrypt fails.
|
||||
|
||||
**Severity.** Security correctness. Do this in Wave 1.
|
||||
|
||||
### H2 — W2: `fec_block_id` width
|
||||
|
||||
Resolved by v2 wire format (`u16` instead of `u8`). PRD #1 carries the wire change; this PRD just confirms semantics:
|
||||
|
||||
- Wraps at 2^16. At 5-frame blocks and 50 pps → ~22 min between collisions, vs. ~25 s in v1.
|
||||
- Late-joining peers must still discard FEC blocks older than 2 s; widening is defense in depth.
|
||||
|
||||
**Action.** Update `wzp-fec` to operate on u16 block_id end-to-end. Test reconstruction across a synthetic 22-min session.
|
||||
|
||||
### H3 — W11: Per-stream, per-`MediaType` anti-replay window
|
||||
|
||||
**Current.** 64-packet sliding window globally.
|
||||
|
||||
**Problem.** Video keyframe burst (100+ packets) can stall the window behind one reordered prior packet.
|
||||
|
||||
**Action.**
|
||||
- Anti-replay state is per (stream_id, media_type).
|
||||
- Window size: 64 for audio, 1024 for video, 256 for data.
|
||||
- Window size selected at session setup based on declared profile; tunable via `QualityProfile`.
|
||||
|
||||
**Severity.** Required before video. Wave 1.
|
||||
|
||||
### H4 — W12: `SignalMessage` versioning
|
||||
|
||||
**Current.** Bincode-serialized enum. `#[serde(default, skip_serializing_if)]` handles field additions; variant removals or semantic changes are unsafe.
|
||||
|
||||
**Action.**
|
||||
- Every variant gains `version: u8` as its first field.
|
||||
- Add `SignalMessage::Unknown { version, raw: Bytes }` to absorb future unknown variants gracefully.
|
||||
- Decode path: unknown variant → log + drop, do not close session.
|
||||
|
||||
**Severity.** Future-proofing. Wave 3.
|
||||
|
||||
### H5 — W3: `timestamp_ms` rebase documentation
|
||||
|
||||
**Current.** Behavior at rekey (every 65,536 packets, ~22 min) is not documented.
|
||||
|
||||
**Decision (this PRD).** `timestamp_ms` is **monotonic across rekeys** — it does not reset. Rekey changes only the cryptographic key material; sequence and timestamp are session-scoped, not key-scoped.
|
||||
|
||||
**Action.**
|
||||
- Document in `WZP-SPEC.md` and inline in `packet.rs` doc comments.
|
||||
- Add a test that performs a rekey mid-session and asserts `timestamp_ms` continuity.
|
||||
|
||||
**Severity.** Doc + test. Wave 3.
|
||||
|
||||
### H6 — W13: `RoomManager` lock concurrency
|
||||
|
||||
**Current.** Single `Mutex<RoomManager>` acquired per packet by every participant for fan-out peer list. Serializes packet processing within a room.
|
||||
|
||||
**Problem.** At 1500 pps/sender for video, this is the dominant bottleneck.
|
||||
|
||||
**Action.**
|
||||
- Migrate to `DashMap<RoomId, Arc<RwLock<Room>>>`.
|
||||
- Per-room `RwLock` allows concurrent reads (fan-out peer list) and exclusive writes (join/leave/quality changes).
|
||||
- Fan-out path holds read lock; participant churn holds write lock.
|
||||
- Federation manager updated to match.
|
||||
|
||||
**Severity.** Required for video scale. Wave 3.
|
||||
|
||||
**Migration safety.**
|
||||
- Integration test suite (40 + 4 relay tests) must pass.
|
||||
- Federation tests must pass.
|
||||
- Trunking tests must pass.
|
||||
- Property-test: 100-participant room, 500 join/leave events, 10k packets — no panics, no missed forwards.
|
||||
|
||||
## Implementation order
|
||||
|
||||
| Wave | Item | Task |
|
||||
|---|---|---|
|
||||
| 1 | H1 (W5 AEAD binding) | T1.4 |
|
||||
| 1 | H3 (W11 anti-replay per-stream) | T1.5 |
|
||||
| 1 | H2 (W2 block_id widening) | folded into PRD #1 |
|
||||
| 3 | H4 (W12 signal versioning) | T3.3 |
|
||||
| 3 | H5 (W3 timestamp doc) | T3.2 |
|
||||
| 3 | H6 (W13 RoomManager lock) | T3.4 |
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- All current tests pass post-hardening.
|
||||
- New tests: AEAD trailer tampering, rekey timestamp continuity, 100-participant property test, signal forward-compat decode.
|
||||
- No Prometheus regression in fan-out latency p99 after H6.
|
||||
|
||||
## Effort
|
||||
|
||||
~4.5 engineer-days total (1.5 in Wave 1, 3 in Wave 3).
|
||||
Reference in New Issue
Block a user