Files
wz-phone/docs/PRD/reports/T3.1-report.md

3.7 KiB

T3.1 — Confirm RoomManager concurrency (W13)

Status: Pending Review Agent: Kimi Code CLI Started: 2026-05-11T20:55Z Completed: 2026-05-11T21:05Z Commit: (see git log) PRD: ../PRD-protocol-hardening.md

What I changed

  • crates/wzp-relay/src/room.rsRoomManager concurrency refactor:

    • Changed rooms: DashMap<String, Room>rooms: DashMap<String, Arc<RwLock<Room>>>.
    • Updated RoomManager::others() — now acquires arc.read() on the room-level RwLock after retrieving the Arc from DashMap. The DashMap shard guard is dropped before cloning senders.
    • Updated RoomManager::observe_quality() — now acquires arc.write() on the room-level RwLock instead of DashMap::get_mut(). Quality updates no longer contend with concurrent fan-out on the same room.
    • Updated RoomManager::join() / leave() — same pattern: brief DashMap access to get/insert the Arc, then room-level write lock for mutation.
    • Updated room_size(), local_participant_list(), local_senders(), list() — all use arc.read().
  • docs/PROTOCOL-AUDIT.md — Marked W13 as RESOLVED with a one-line explanation of the fix.

Why these choices

The hot path is others(), called once per media packet per participant. Before this change, others() held the DashMap shard read lock while cloning all ParticipantSenders. With many participants, this clone is non-trivial and blocks concurrent join() / leave() / observe_quality() on the same shard.

By wrapping each Room in Arc<std::sync::RwLock<Room>>:

  • others() → DashMap get() (brief) → RwLock::read() (while cloning senders)
  • observe_quality() → DashMap get() (brief) → RwLock::write() (while updating qualities)
  • Concurrent others() calls on the same room share the read lock.
  • observe_quality() only blocks writers, not other readers.

std::sync::RwLock is safe here because all critical sections are synchronous (no .await inside the lock).

Deviations from the task spec

None. The task offered two options (RwLock<Vec<Participant>> or ArcSwap<Vec<Participant>>); wrapping the whole Room in Arc<RwLock<Room>> is a superset that addresses the same hot path plus eliminates contention on qualities updates.

Verification output

$ cargo test -p wzp-relay
running 86 tests
...(all 86 pass)...

test result: ok. 86 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s
$ cargo test -p wzp-relay --test federation
running 29 tests
...(all 29 pass)...

test result: ok. 29 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.12s
$ cargo test -p wzp-relay --test handshake_integration
running 5 tests
...(all 5 pass)...

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.02s

Test summary

  • Tests added: 0
  • Tests modified: 0
  • wzp-relay test count: 86 (unchanged)
  • Integration tests: 40+4 all pass
  • cargo clippy -p wzp-relay --lib: pass (no new warnings)
  • cargo fmt --all -- --check: pass

Risks / follow-ups

  • std::sync::RwLock can panic if the lock is poisoned after a panicking thread. In practice, the relay is a single async task per participant, and panics are caught by tokio. If poison tolerance is needed, switch to parking_lot::RwLock (no poisoning) in a future dependency addition.
  • W13 was the last Mutex-based concern in the media hot path. The remaining contention points (ACL std::sync::Mutex, event broadcast channel) are on cold paths.

Reviewer checklist (filled in by reviewer)

  • Code matches PRD intent
  • Verification output is real
  • No backward-incompat surprises
  • Tests cover the new behavior
  • Approved