Files
wz-phone/docs/PRD/PRD-protocol-hardening.md
2026-05-11 12:37:32 +04:00

110 lines
4.6 KiB
Markdown

# PRD: Protocol Hardening Batch
> **Status:** proposed
> **Resolves:** Audit W2 (fec_block_id width), W3 (timestamp rebase doc), W5 (QualityReport AEAD binding), W11 (per-stream anti-replay), W12 (signal version byte), W13 (RoomManager lock).
> **Depends on:** PRD #1 (wire format v2 already widens block_id field).
## Problem
A handful of medium-priority audit findings that don't individually justify a PRD but together represent the long tail of protocol correctness and concurrency. Batching them avoids version churn.
## Items
### H1 — W5: `QualityReport` trailer must be inside AEAD
**Current risk.** If the 4-byte trailer sits *outside* the encrypted payload, anything stripping the last 4 bytes corrupts AEAD verification on legitimate packets and creates a quality-feedback downgrade vector. Even if it's correctly inside today, the v2 wire format change is the right moment to assert this explicitly.
**Action.**
- Audit `crates/wzp-proto/src/packet.rs` for `QualityReport` placement.
- Move inside AEAD payload if currently outside.
- Document: "QualityReport, when Q-flag set, is appended to plaintext payload before encryption."
- Test: tamper with trailer → AEAD decrypt fails.
**Severity.** Security correctness. Do this in Wave 1.
### H2 — W2: `fec_block_id` width
Resolved by v2 wire format (`u16` instead of `u8`). PRD #1 carries the wire change; this PRD just confirms semantics:
- Wraps at 2^16. At 5-frame blocks and 50 pps → ~22 min between collisions, vs. ~25 s in v1.
- Late-joining peers must still discard FEC blocks older than 2 s; widening is defense in depth.
**Action.** Update `wzp-fec` to operate on u16 block_id end-to-end. Test reconstruction across a synthetic 22-min session.
### H3 — W11: Per-stream, per-`MediaType` anti-replay window
**Current.** 64-packet sliding window globally.
**Problem.** Video keyframe burst (100+ packets) can stall the window behind one reordered prior packet.
**Action.**
- Anti-replay state is per (stream_id, media_type).
- Window size: 64 for audio, 1024 for video, 256 for data.
- Window size selected at session setup based on declared profile; tunable via `QualityProfile`.
**Severity.** Required before video. Wave 1.
### H4 — W12: `SignalMessage` versioning
**Current.** Bincode-serialized enum. `#[serde(default, skip_serializing_if)]` handles field additions; variant removals or semantic changes are unsafe.
**Action.**
- Every variant gains `version: u8` as its first field.
- Add `SignalMessage::Unknown { version, raw: Bytes }` to absorb future unknown variants gracefully.
- Decode path: unknown variant → log + drop, do not close session.
**Severity.** Future-proofing. Wave 3.
### H5 — W3: `timestamp_ms` rebase documentation
**Current.** Behavior at rekey (every 65,536 packets, ~22 min) is not documented.
**Decision (this PRD).** `timestamp_ms` is **monotonic across rekeys** — it does not reset. Rekey changes only the cryptographic key material; sequence and timestamp are session-scoped, not key-scoped.
**Action.**
- Document in `WZP-SPEC.md` and inline in `packet.rs` doc comments.
- Add a test that performs a rekey mid-session and asserts `timestamp_ms` continuity.
**Severity.** Doc + test. Wave 3.
### H6 — W13: `RoomManager` lock concurrency
**Current.** Single `Mutex<RoomManager>` acquired per packet by every participant for fan-out peer list. Serializes packet processing within a room.
**Problem.** At 1500 pps/sender for video, this is the dominant bottleneck.
**Action.**
- Migrate to `DashMap<RoomId, Arc<RwLock<Room>>>`.
- Per-room `RwLock` allows concurrent reads (fan-out peer list) and exclusive writes (join/leave/quality changes).
- Fan-out path holds read lock; participant churn holds write lock.
- Federation manager updated to match.
**Severity.** Required for video scale. Wave 3.
**Migration safety.**
- Integration test suite (40 + 4 relay tests) must pass.
- Federation tests must pass.
- Trunking tests must pass.
- Property-test: 100-participant room, 500 join/leave events, 10k packets — no panics, no missed forwards.
## Implementation order
| Wave | Item | Task |
|---|---|---|
| 1 | H1 (W5 AEAD binding) | T1.4 |
| 1 | H3 (W11 anti-replay per-stream) | T1.5 |
| 1 | H2 (W2 block_id widening) | folded into PRD #1 |
| 3 | H4 (W12 signal versioning) | T3.3 |
| 3 | H5 (W3 timestamp doc) | T3.2 |
| 3 | H6 (W13 RoomManager lock) | T3.4 |
## Acceptance criteria
- All current tests pass post-hardening.
- New tests: AEAD trailer tampering, rekey timestamp continuity, 100-participant property test, signal forward-compat decode.
- No Prometheus regression in fan-out latency p99 after H6.
## Effort
~4.5 engineer-days total (1.5 in Wave 1, 3 in Wave 3).