wz-phone

Author	SHA1	Message	Date
Siavash Sameni	aee41a638d	fix(audio+net): revert dual-stack [::]:0, add Oboe playout stall auto-restart Two fixes: ## Revert [::]:0 dual-stack sockets → back to 0.0.0.0:0 Android's IPV6_V6ONLY=1 default on some kernels (confirmed on Nothing Phone) makes [::]:0 IPv6-only, silently killing ALL IPv4 traffic. This broke P2P direct calls: IPv4 LAN candidates (172.16.81.x) couldn't complete QUIC handshakes through the IPv6-only socket, causing local_direct_ok=false and relay fallback on every call after the first. Reverted all bind sites to 0.0.0.0:0 (reliable IPv4). IPv6 host candidates are disabled in local_host_candidates() until a proper dual-socket approach (one IPv4 + one IPv6 endpoint, Phase 7) is implemented. ## Fix A (task #35): Oboe playout callback stall auto-restart The Nothing Phone's Oboe playout callback fires once (cb#0) and then stops draining the ring on ~50% of cold-launch calls. Fix D+C (stop+prime from previous commit) didn't help because audio_stop is a no-op on cold launch. New approach: self-healing watchdog in audio_write_playout. Tracks the playout ring's read_idx across writes. If read_idx hasn't advanced in 50 consecutive writes (~1 second), the Oboe playout callback has stopped: 1. Log "playout STALL detected" 2. Call wzp_oboe_stop() to tear down the stuck streams 3. Clear both ring buffers (prevent stale data reads) 4. Call wzp_oboe_start() to rebuild fresh streams 5. Log success/failure 6. Return 0 (caller retries on next frame) This is the same teardown+rebuild that "rejoin" does — but triggered automatically from the first stalled call instead of requiring the user to hang up and redial. The watchdog runs on every write so it fires within 1s of the stall starting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 11:24:16 +04:00
Siavash Sameni	9fb92967eb	fix(net): bind all endpoints to [::]:0 for dual-stack IPv4+IPv6 Every QUIC endpoint was bound to 0.0.0.0:0 (IPv4-only). This silently killed ALL IPv6 host candidates: the Dialer couldn't send packets to [2a0d:...] addresses (wrong address family on the socket), and the Acceptor couldn't receive incoming IPv6 QUIC handshakes. The IPv6 candidates were gathered and advertised in DirectCallOffer/Answer but were completely non-functional. On same-LAN with dual-stack (which both test phones have), this meant: - JoinSet fanned out 3+ candidates (2× IPv6 + 1× IPv4) - IPv6 dials failed silently or timed out - IPv4 dial worked but competed with failed IPv6 for JoinSet attention - Sometimes the JoinSet returned an IPv6 failure before the IPv4 success, causing unnecessary fallback to relay Fix: bind to [::]:0 (IPv6 any) instead of 0.0.0.0:0. On dual-stack systems (Linux/Android default), [::]:0 creates a socket that handles BOTH: - IPv6 natively (global unicast, ULA) - IPv4 via v4-mapped addresses (::ffff:172.16.81.x) One socket, both protocols. All 7 bind sites updated: - register_signal (signal endpoint) - do_register_signal - ping_relay - probe_reflect_addr (fresh endpoint fallback) - dual_path::race (A-role fresh, D-role fresh, relay fresh) With this fix, same-LAN P2P should prefer the IPv6 path (no NAT, direct routing, lower latency) and fall through to IPv4 if IPv6 fails — relay is the last resort after ALL candidates are exhausted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 11:09:06 +04:00
Siavash Sameni	9f2ff6a6ec	fix(android-audio): Fix D+C — stop+prime cycle on every call start Addresses the first-join no-audio regression (tasks #35-37) where the Oboe playout callback fires once (cb#0) and then stops draining the ring on the Nothing Phone, causing written_samples to freeze at 7679 (ring capacity minus one burst). Second call (rejoin) always works because audio_stop tears down the streams and audio_start rebuilds them fresh. Two combined fixes: Fix D (task #37): always call audio_stop() before audio_start() at the top of CallEngine::start. On a cold launch this is a no-op (streams not yet started). On subsequent calls it guarantees a clean teardown before rebuild — the same thing rejoin does. Added a 50ms pause between stop and start to let the Android HAL release the audio session. Fix C (task #36): after audio_start(), immediately write 960 samples (20ms) of silence into the playout ring. This ensures the Oboe playout callback has data to drain on its first invocation. On devices where an empty-ring first callback causes the stream to self-pause (Nothing Phone's Qualcomm HAL), the priming data keeps the callback loop alive until real decoded audio arrives from the recv task. Together these cover the two most likely root causes: 1. Stale Oboe state from a previous audio_start that didn't clean up properly → Fix D forces a clean rebuild 2. Playout callback self-pausing on an empty ring → Fix C ensures the ring is non-empty at callback time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:50:58 +04:00
Siavash Sameni	134ee3a77f	fix(engine): pass is_direct_p2p explicitly instead of deriving from is_some Critical Phase 6 bug: when the negotiation agreed on relay path but delivered the relay transport via pre_connected_transport, CallEngine saw is_some() = true → is_direct_p2p = true → skipped perform_handshake. The relay couldn't authenticate the participant → room join silently failed → recv_fr: 0, both sides sending into the void. Fix: add explicit is_direct_p2p: bool parameter to CallEngine:: start (both android and desktop branches). The connect command sets it from the Phase 6 negotiation result (use_direct), not from whether pre_connected_transport is Some. Now relay-negotiated calls correctly run perform_handshake, and direct P2P calls correctly skip it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:34:21 +04:00
Siavash Sameni	e61397ca85	fix(connect): remove pre-Phase-6 same-IP heuristic The commit `de007ec` added a heuristic that forced relay-only when peers had different public IPs. That was a stopgap for the race condition where one side picked Direct and the other picked Relay. Phase 6 (`f5542ef`) solved this properly via MediaPathReport negotiation, but the heuristic wasn't cleaned up and was still running BEFORE the Phase 6 code — suppressing the race entirely for cross-network calls. Removed. Phase 6 negotiation now handles ALL cases: both sides race, exchange reports, and agree on the same path before committing media. Cross-network calls that can't go P2P will have both sides report direct_ok=false and agree on relay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:23:36 +04:00
Siavash Sameni	f5542ef822	feat(p2p): Phase 6 — ICE-style path negotiation Before Phase 6, each side's dual-path race ran independently and committed to whichever transport completed first. When one side picked Direct and the other picked Relay, they sent media to different places — TX > 0 RX: 0 on both, completely silent call. Phase 6 adds a negotiation step: after the local race completes, each side sends a MediaPathReport { call_id, direct_ok, winner } to the peer through the relay. Both wait for the other's report before committing a transport to the CallEngine. The decision rule is simple: if BOTH report direct_ok = true, use direct; if EITHER reports false, BOTH use relay. ## Wire protocol New `SignalMessage::MediaPathReport { call_id, direct_ok, race_winner }`. The relay forwards it to the call peer via the same signal_hub routing used for DirectCallOffer/Answer. The cross-relay dispatcher also forwards it. ## dual_path::race restructured Returns `RaceResult` instead of `(Arc<QuinnTransport>, WinningPath)`: - `direct_transport: Option<Arc<QuinnTransport>>` - `relay_transport: Option<Arc<QuinnTransport>>` - `local_winner: WinningPath` Both paths are run as spawned tasks. After the first completes, a 1s grace period lets the loser also finish. The connect command gets BOTH transports (when available) and picks the right one based on the negotiation outcome. The unused transport is dropped. ## connect command flow (revised) 1. Run race() → RaceResult with both transports 2. Send MediaPathReport to relay with our direct_ok 3. Install oneshot; wait for peer's report (3s timeout) 4. Decision: both direct_ok → use direct; else → use relay 5. Start CallEngine with the agreed transport If the peer never responds (old build, timeout), falls back to relay — backward compatible. ## Relay forwarding MediaPathReport is forwarded like DirectCallOffer/Answer: via signal_hub.send_to(peer_fp) for same-relay calls, and via cross-relay dispatcher for federated calls. ## Debug log events - `connect:dual_path_race_done` — local race result - `connect:path_report_sent` — our report to the peer - `connect:peer_report_received` — peer's report - `connect:peer_report_timeout` — peer didn't respond (3s) - `connect:path_negotiated` — final agreed path with reasons Full workspace test: 423 passing (no regressions). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:03:42 +04:00
Siavash Sameni	de007ec2fd	fix(p2p): skip direct P2P when peers are on different public IPs Race condition: when two phones are on different networks (WiFi vs LTE, home vs office, etc.), each side's dual-path race runs independently. One side may pick Direct while the other picks Relay, causing both to send media to different places — TX > 0, RX: 0 on both sides, completely silent call. Root cause: the dual-path race doesn't have a negotiation step. Each side picks the first transport that completes a QUIC handshake, which may be a different path than the other side picked. On same-LAN this doesn't matter because direct always wins on both (the 500ms relay delay guarantees it). On cross- network, the asymmetry bites. Heuristic fix: compare own_reflex_addr IP to peer_reflex_addr IP. If they're different → different networks → force relay-only (set role = None, which skips the dual-path race entirely). Same public IP means same LAN / same NAT: → LAN host candidates work, direct always wins on both sides → Safe for P2P Different public IPs means cross-network: → Direct may work on one side but not the other → Relay is the safe choice for both This preserves the proven same-LAN P2P and eliminates the broken cross-network case. The full fix is ICE-style path negotiation (Phase 6) where both sides exchange connectivity check results through the signal plane and agree on a winner before committing media — but that's a 500+ line protocol change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:50:56 +04:00
Siavash Sameni	0a973b234b	fix(engine): import tauri::Emitter for AppHandle::emit on Android target	2026-04-12 09:29:56 +04:00
Siavash Sameni	026940d492	fix(federation): diagnostic logging for cross-relay media routing Added warn-level log in handle_datagram when a federation datagram arrives but no matching local room is found. Prints: - room_hash (8-byte tag from the datagram) - active_rooms (all rooms the relay currently has) - seq + peer label This diagnoses the cross-relay recv_fr=0 issue: if media IS arriving from the peer relay but the room hash doesn't match any active room, the log tells us exactly what hash is expected vs what rooms exist locally. If no datagram log fires at all, the issue is upstream (peer relay not forwarding, federation link down, etc.). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:27:34 +04:00
Siavash Sameni	0ccf4ed6b5	feat(call): media health watchdog — warn user when no audio arrives When a P2P direct call establishes successfully but the underlying network path dies (phone switched from WiFi to LTE mid-call, or cross-relay media forwarding isn't working), the call stays up silently with recv_fr frozen at 0. No feedback to the user. New watchdog in the Android recv task: tracks consecutive heartbeat ticks (2s each) where recv_fr hasn't advanced. After 3 ticks (6s) with no new packets, emits: - call-event { kind: "media-degraded" } — user-facing warning banner: "No audio — connection may be lost. Try hanging up and reconnecting, or switch to a different relay." - call-debug media:no_recv_timeout for the debug log If packets resume (recv_fr advances), clears the banner via: - call-event { kind: "media-recovered" } JS listener creates/removes a red-tinted banner dynamically at the top of the call screen. Banner is also cleaned up on showConnectScreen (call end). This covers: - Direct P2P that established on WiFi but died when the phone switched to LTE (stale NAT mapping, unreachable peer) - Cross-relay calls where federation media isn't forwarding (relay not upgraded, not federated, etc.) - Any other "connected but silent" scenario Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:18:38 +04:00
Siavash Sameni	847699bf66	fix(ui): pre-flight ping + cancel button for register Two UX issues when the selected relay is unreachable (e.g. user switched from WiFi to LTE and the LAN relay is gone): 1. Pressing Register blocked the UI for ~30s while the QUIC connect timed out against a dead host. No way to abort. 2. No feedback that the relay was unreachable — just a long wait followed by a cryptic error. Fix: Pre-flight ping: before attempting the full register flow, run `ping_relay` (existing Tauri command, 3s QUIC handshake timeout). If it fails, immediately show "Server unavailable: <error>" and re-enable the Register button. No blocking, no wasted time. If it succeeds, proceed to register_signal. Cancel button: during the register_signal await, the Register button becomes "Cancel". Tapping it calls `deregister` which closes the in-flight transport and makes the connect fail immediately, breaking the await. The button goes back to "Register on Relay" with a "Registration cancelled" message. Flow: [Register] → "Checking..." (disabled, 3s ping) → ping fails → "Server unavailable" (re-enabled) ping ok → "Cancel" (enabled, register in flight) → user taps Cancel → "Registration cancelled" (re-enabled) register succeeds → registered panel shown register fails → error shown (re-enabled) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:13:35 +04:00
Siavash Sameni	6cd61fc63b	feat(federation): Phase 4.1 — call-* rooms are implicitly global All rooms with names starting with 'call-' are now treated as global rooms by the federation pipeline. This enables relay- mediated media fallback for cross-relay direct calls: when Alice on Relay A and Bob on Relay B both join the same call-<id> room, the federation media forwarding pipeline (GlobalRoomActive announcements + datagram forwarding + presence replication) kicks in automatically without any runtime registration step. Previously, cross-relay direct calls that couldn't go P2P (symmetric NAT on either side) failed with "no media path" because the call-<id> room wasn't in the configured global_rooms set and media datagrams weren't forwarded across the federation link. The relay's existing ACL for call-* rooms (only the two authorized fingerprints from the call registry can join) prevents random clients from creating or eavesdropping on call rooms. ## Changes ### `is_global_room` (federation.rs) Added `room.starts_with("call-")` check before the static global_rooms set lookup. Returns true immediately for any call-prefixed room. ### `resolve_global_room` (federation.rs) Return type changed from `Option<&str>` to `Option<String>` (owned) because call-* room names aren't stored on `self` — they come from the caller and resolve to themselves as the canonical name. The 13 callers continue to work via String/&str auto-deref; 4 HashMap lookups needed explicit `.as_str()` or `&` borrows. Full workspace test: 423 passing (no regressions). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:55:01 +04:00
Siavash Sameni	50e6a50de4	feat(ui): phone-style layout for direct calls The call screen now shows two different layouts depending on whether the call is a 1:1 direct call or a room/group call: Direct call (directCallPeer set): - Large centered identicon (96px circular with glow) - Peer name (22px bold) + fingerprint (11px mono) - Connection badge: "P2P Direct" (green), "Via Relay" (blue), or "Connecting..." (yellow) — auto-detected from the call-debug buffer's dual_path_race_won event - Room name header shows the peer's alias/fp instead of "general" - Group participant list is hidden Room/group call (directCallPeer null): - Existing group participant list layout — unchanged The badge updates live from pollStatus by scanning the debug buffer for the connect:dual_path_race_won event. If the path was "Direct" → green P2P badge; if "Relay" → blue relay badge. Before the race resolves, shows yellow "Connecting...". directCallView is cleared on showConnectScreen (call end). CSS in style.css: .direct-call-view, .dc-identicon, .dc-name, .dc-fp, .dc-badge with .relay and .connecting modifiers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:47:13 +04:00
Siavash Sameni	0cb8d34b21	fix(ui): show peer identity on direct P2P calls instead of "Waiting for participants" On relay-mediated calls, the relay broadcasts RoomUpdate with the participant list and pollStatus renders it. On direct P2P calls neither peer joins the relay's media room, so RoomUpdate never fires and the UI showed "Waiting for participants..." even though audio was flowing bidirectionally. Fix: track the peer's identity (fingerprint + alias) from the signal plane in a `directCallPeer` variable: - Set on incoming call from the DirectCallOffer (caller_fp + caller_alias) - Set on outgoing call from the Call button click (target_fp) - Cleared on showConnectScreen (call ended) pollStatus now checks: if the engine's participant list is empty AND directCallPeer is set, inject a synthetic participant entry with relay_label = "P2P Direct". The participant row renders with identicon + fingerprint + alias as normal, but grouped under a "P2P Direct" header instead of "This Relay". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:26:17 +04:00
Siavash Sameni	2427630472	fix(connect): make peerLocalAddrs optional + skip handshake on direct P2P Two regressions from Phase 5.5/5.6: 1. Room connect broken: the connect Tauri command required peerLocalAddrs as a Vec<String>, but the room-join JS path doesn't pass it (only the direct-call setup handler does). Error: "invalid args 'peerLocalAddrs' for command 'connect': command connect missing required key peerLocalAddrs". Fix: change to Option<Vec<String>>, unwrap_or_default() at usage sites. Room connect works again with zero peer addrs. 2. Direct P2P call connects but then CallEngine fails with "expected CallAnswer, got Discriminant(0)". Root cause: after the dual-path race picked a direct P2P transport, CallEngine still ran perform_handshake() on it. That handshake is a relay-specific protocol — sends a CallOffer signal and waits for CallAnswer back. On a direct QUIC connection to a phone, there's nobody running accept_handshake, so the handshake reads garbage from the peer's first media packet and errors. Fix: track is_direct_p2p = pre_connected_transport.is_some() and skip perform_handshake when true. The direct connection is already TLS-encrypted by QUIC, and both peers' identities were verified through the signal channel (DirectCallOffer/ Answer carry identity_pub + ephemeral_pub + signature). Both android and desktop branches updated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:09:32 +04:00
Siavash Sameni	16793be36f	fix(p2p): Phase 5.6 — direct-path head start + hangup propagation + media debug events Three fixes from a field-test log where same-LAN calls were still losing the dual-path race to the relay path, peers were getting stuck on an empty call screen when the other side hung up, and 1-way audio was hard to diagnose because the GUI debug log had no media-level events. ## 1. Direct-path 500ms head start (dual_path.rs) The race was resolving in ~105ms with Relay winning even when both phones were on the same MikroTik LAN with valid IPv6 host candidates. Root cause: the relay dial is a plain outbound QUIC connect that completes in whatever the client→relay RTT is (~100ms), while the direct path needs the PEER to also process its CallSetup, spin up its own race, and complete at least one LAN dial back to us. That cross-client sequence reliably takes longer than 100ms, so relay always won. Fix: delay the relay_fut with `tokio::time::sleep(500ms)` before starting its connect. Same-LAN direct dials complete in 30-50ms typically, so the head start gives direct plenty of time to win cleanly. Users on setups where direct genuinely can't work (LTE-to-LTE cross-carrier) pay 500ms extra on the relay fallback, which is invisible for a call setup. ## 2. Hangup propagation via a new hangup_call command (lib.rs + main.ts) The hangup button was calling `disconnect` which stopped the local media engine but never sent a SignalMessage::Hangup to the relay. The peer never got notified and was stuck on the call screen with silent audio. My earlier fix (commit `e75b045`) only handled the RECEIVE side — auto-dismiss call screen on recv:Hangup — but the SEND side was still missing. New Tauri command `hangup_call`: 1. Acquire state.signal.lock(), send SignalMessage::Hangup over the signal transport (best-effort; log + continue if signal is down) 2. Acquire state.engine.lock(), stop the CallEngine JS hangupBtn click handler now calls hangup_call with a fallback to raw disconnect if the command is missing (older builds). ## 3. Media debug events (engine.rs + lib.rs) Threaded tauri::AppHandle into CallEngine::start so the send/ recv tasks can emit call-debug events when the user has debug logs enabled. Added on the Android branch (desktop branch accepts the arg for API symmetry but doesn't emit yet): - media:first_send — emitted when the first encoded frame is handed to the transport. Useful for 1-way audio diagnosis: if this fires on side A but side B never sees media:first_recv, A's outbound is broken. - media:first_recv — emitted when the first packet from the peer arrives. Mirror of first_send. - media:send_heartbeat — every 2s with frames_sent, last_rms, last_pkt_bytes, short_reads, drops. A stalled last_rms (== 0) tells you the mic isn't producing samples; a frozen frames_sent tells you the encode pipeline hung. - media:recv_heartbeat — every 2s with recv_fr, decoded_frames, last_written, written_samples, decode_errs, codec. Mirror invariants for the inbound direction. All four are gated by `call_debug_logs_enabled()` via `emit_call_debug`, so they only show up in the GUI log when the user has the Call Flow Debug Logs checkbox on. Tracing::info! still runs unconditionally so logcat (adb) keeps its copy regardless. The `emit_call_debug` fn in lib.rs is now `pub(crate)` so engine.rs can call it via `crate::emit_call_debug`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 07:55:41 +04:00
Siavash Sameni	fa038df057	feat(p2p): Phase 5.5 — ICE LAN host candidates (IPv4 + IPv6) Same-LAN P2P was failing because MikroTik masquerade (like most consumer NATs) doesn't support NAT hairpinning — the advertised WAN reflex addr is unreachable from a peer on the same LAN as the advertiser. Phase 5 got us Cone NAT classification and fixed the measurement artifact, but same-LAN direct dials still had nowhere to land. Phase 5.5 adds ICE-style host candidates: each client enumerates its LAN-local network interface addresses, includes them in the DirectCallOffer/Answer alongside the reflex addr, and the dual-path race fans out to ALL peer candidates in parallel. Same-LAN peers find each other via their RFC1918 IPv4 + ULA / global-unicast IPv6 addresses without touching the NAT at all. Dual-stack IPv6 is in scope from the start — on modern ISPs (including Starlink) the v6 path often works even when v4 hairpinning doesn't, because there's no NAT on the v6 side. ## Changes ### `wzp_client::reflect::local_host_candidates(port)` (new) Enumerates network interfaces via `if-addrs` and returns SocketAddrs paired with the caller's port. Filters: - IPv4: RFC1918 (10/8, 172.16/12, 192.168/16) + CGNAT (100.64/10) - IPv6: global unicast (2000::/3) + ULA (fc00::/7) - Skipped: loopback, link-local (169.254, fe80::), public v4 (already covered by reflex-addr), unspecified Safe from any thread, one `getifaddrs(3)` syscall. ### Wire protocol (wzp-proto/packet.rs) Three new `#[serde(default, skip_serializing_if = "Vec::is_empty")]` fields, backward-compat with pre-5.5 clients/relays by construction: - `DirectCallOffer.caller_local_addrs: Vec<String>` - `DirectCallAnswer.callee_local_addrs: Vec<String>` - `CallSetup.peer_local_addrs: Vec<String>` ### Call registry (wzp-relay/call_registry.rs) `DirectCall` gains `caller_local_addrs` + `callee_local_addrs` Vec<String> fields. New `set_caller_local_addrs` / `set_callee_local_addrs` setters. Follow the same pattern as the reflex addr fields. ### Relay cross-wiring (wzp-relay/main.rs) Both the local-call and cross-relay-federation paths now track the local_addrs through the registry and inject them into the CallSetup's peer_local_addrs. Cross-wiring is identical to the existing peer_direct_addr logic — each party's CallSetup carries the OTHER party's LAN candidates. ### Client side (desktop/src-tauri/lib.rs) - `place_call`: gathers local host candidates via `local_host_candidates(signal_endpoint.local_addr().port())` and includes them in `DirectCallOffer.caller_local_addrs`. The port match is critical — it's the Phase 5 shared signal socket, so incoming dials to these addrs land on the same endpoint that's already listening. - `answer_call`: same, AcceptTrusted only (privacy mode keeps LAN addrs hidden too, for consistency with the reflex addr). - `connect` Tauri command: new `peer_local_addrs: Vec<String>` arg. Builds a `PeerCandidates` bundle and passes it to the dual-path race. - Recv loop's CallSetup handler: destructures + forwards the new field to JS via the signal-event payload. ### `dual_path::race` (wzp-client/dual_path.rs) Signature change: takes `PeerCandidates` (reflex + local Vec) instead of a single SocketAddr. The D-role branch now fans out N parallel dials via `tokio::task::JoinSet` — one per candidate — and the first successful dial wins (losers are aborted immediately via `set.abort_all()`). Only when ALL candidates have failed do we return Err; individual candidate failures are just traced at debug level and the race waits for the others. LAN host candidates are tried BEFORE the reflex addr in `PeerCandidates::dial_order()` — they're faster when they work, and the reflex addr is the fallback for the not-on-same-LAN case. ### JS side (desktop/main.ts) `connect` invoke now passes `peerLocalAddrs: data.peer_local_addrs ?? []` alongside the existing `peerDirectAddr`. ### Tests All existing test callsites updated for the new Vec<String> fields (defaults to Vec::new() in tests — they don't exercise the multi-candidate path). `dual_path.rs` integration tests wrap the single `dead_peer` / `acceptor_listen_addr` in a `PeerCandidates { reflexive: Some(_), local: Vec::new() }`. Full workspace test: 423 passing (same as before 5.5). ## Expected behavior on the reporter's setup Two phones behind MikroTik, both on the same LAN: place_call:host_candidates {"local_addrs": ["192.168.88.21:XXX", "2001:...:YY:XXX"]} recv:DirectCallAnswer {"callee_local_addrs": ["192.168.88.22:ZZZ", "2001:...:WW:ZZZ"]} recv:CallSetup {"peer_direct_addr":"150.228.49.65:NN", "peer_local_addrs":["192.168.88.22:ZZZ","2001:...:WW:ZZZ"]} connect:dual_path_race_start {"peer_reflex":"...","peer_local":[...]} dual_path: direct dial succeeded on candidate 0 ← LAN v4 wins connect:dual_path_race_won {"path":"Direct"} Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 07:34:49 +04:00
Siavash Sameni	8990514417	fix(call): default Accept to AcceptTrusted + add log Copy/Share buttons ## Accept button regression — diagnosed from a user log Field report: incoming call → callee taps Accept → debug log shows the dual-path race being skipped with `connect:dual_path_skipped {"has_own":false,"has_peer":true, "role":"None"}` and the call falling to relay-only on the callee side. Root cause: the Accept button was calling `answer_call` with `mode: 2` which falls through to `AcceptGeneric` (privacy mode). By design, privacy mode SKIPS the reflex query on the callee so the callee's IP stays hidden from the caller — but the side effect is that `own_reflex_addr` never gets cached in `SignalState`. When `connect` runs a moment later, it sees `own_reflex_addr = None`, can't compute the deterministic role for the dual-path race, and falls back to relay. For a normal VoIP app where P2P is the desired default, the right behavior is `AcceptTrusted` — which queries reflect, advertises the callee's addr in the answer, and enables direct P2P. Privacy mode can come back as a dedicated second button if anyone actually needs it. Changed `acceptCallBtn` click handler from `mode: 2` to `mode: 1`. The next call from a Phase-5 APK should show `connect:dual_path_race_start` + `connect:dual_path_race_won {"path":"Direct"}` on a cone-NAT-to-cone-NAT pair. ## Debug log export — new Copy / Share buttons Field-testing the GUI debug log required me to keep asking the user to type out what they saw. Added two new buttons next to Clear: - Copy log — serialises the rolling buffer as plain text (same HH:MM:SS.mmm format the on-screen panel uses) and writes to `navigator.clipboard`. Falls back to the old selection-based `execCommand("copy")` for WebViews that refuse the new API without a permission prompt. - Share — tries the Web Share API (`navigator.share(...)`) first. On Android WebView this opens the system share sheet so the user can send the text straight to a messaging app. Falls back to clipboard copy on WebViews that don't expose navigator.share (most desktop ones). Also falls back if the user cancels the share sheet. Flash status line below the buttons shows a 2.5s confirmation ("✓ Copied 47 entries") or an error hint. The log is plain text so anyone can paste a log fragment into a message and send it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 07:04:46 +04:00
Siavash Sameni	1618ff6c9d	feat(p2p): Phase 5 — single-socket architecture (Nebula-style) Before Phase 5 WarzonePhone used THREE separate UDP sockets per client: 1. Signal endpoint (register_signal, client-only) 2. Reflect probe endpoints (one fresh socket per relay probe) 3. Dual-path race endpoint (fresh per call setup) This broke two things in production on port-preserving NATs (MikroTik masquerade, most consumer routers): a. Phase 2 NAT detection was WRONG. Each probe used a fresh internal port, so MikroTik mapped each one to a different external port, and the classifier saw "different port per relay" and labeled it SymmetricPort. The real NAT was cone-like but measurement via fresh sockets hid that. b. Phase 3.5 dual-path P2P race was BROKEN. The reflex addr we advertised in DirectCallOffer was observed by the signal endpoint's socket. The actual dual-path race listened on a DIFFERENT fresh socket, on a different internal (and therefore external) port. Peers dialed the advertised addr and hit MikroTik's mapping for the signal socket, which forwarded to the signal endpoint — a client-only endpoint that doesn't accept incoming connections. Direct path silently failed, relay always won the race. Nebula-style fix: one socket for everything. The signal endpoint is now dual-purpose (client + server_config), and both the reflect probes and the dual-path race reuse it instead of creating fresh ones. MikroTik's port-preservation then gives us a stable external port across all flows → classifier correctly sees Cone NAT → advertised reflex addr is the actual listening port → direct dials from peers land on the right socket → `endpoint.accept()` in the A-role branch of the dual-path race picks up the incoming connection. ## Changes ### `register_signal` (desktop/src-tauri/src/lib.rs) - Endpoint now created with `Some(server_config())` instead of `None`. The socket can now accept incoming QUIC connections as well as dial outbound. - Every code path that previously read `sig.endpoint` for the relay-dial reuse benefits automatically — same socket is now ALSO listening for peer dials. ### `probe_reflect_addr` (wzp-client/src/reflect.rs) - New `existing_endpoint: Option<Endpoint>` arg. `Some` reuses the caller's socket (production: pass the signal endpoint). `None` creates a fresh one (tests + pre-registration). - Removed the `drop(endpoint)` at the end — was correct for fresh endpoints (explicit early socket close) but incorrect for shared ones. End-of-scope drop does the right thing in both cases via Arc semantics. ### `detect_nat_type` (wzp-client/src/reflect.rs) - New `shared_endpoint: Option<Endpoint>` arg, forwarded to every probe in the JoinSet fan-out. One shared socket means the classifier sees the true NAT type. ### `detect_nat_type` Tauri command (desktop/src-tauri/src/lib.rs) - Reads `state.signal.endpoint` and passes it as the shared endpoint. Falls back to None when not registered. NAT detection now produces accurate classifications against MikroTik / most consumer NATs. ### `dual_path::race` (wzp-client/src/dual_path.rs) - New `shared_endpoint: Option<Endpoint>` arg. - A-role: when `Some`, reuses it for `accept()`. This is the critical change — the reflex addr advertised to peers is now the address listening for incoming direct dials. - D-role: when `Some`, reuses it for the outbound direct dial. MikroTik keeps the same external port for the dial as for the signal flow → direct dial through a cone-mapped NAT. - Relay path: also reuses the shared endpoint so MikroTik has a single consistent mapping across the whole call (saves one extra external port and makes firewall traces cleaner). - When `None`, falls back to fresh per-role endpoints as before. ### `connect` Tauri command (desktop/src-tauri/src/lib.rs) - Reads `state.signal.endpoint` once when acquiring own reflex addr and passes it through to `dual_path::race`. ### Tests - `wzp-client/tests/dual_path.rs` and `wzp-relay/tests/multi_reflect.rs` updated to pass `None` for the new endpoint arg — tests use fresh sockets and that's fine because the loopback harness doesn't care about port-preserving NAT behavior. Full workspace test: 423 passing (no regressions). ## Expected behavior after this commit on real hardware Behind MikroTik + Starlink-bypass (the reporter's setup): - Phase 2 NAT detect → Cone NAT (was SymmetricPort — false positive from the measurement artifact) - Phase 3.5 direct-P2P dial → succeeds for both cone-cone and cone-CGNAT cases where the remote side was previously blocked by our own socket mismatch - LTE ↔ LTE cross-carrier → still likely relay fallback; that's genuinely strict symmetric and needs Phase 5.5 port prediction. ## Phase 5.5 (next, separate PRD) Multi-candidate port prediction + ICE-style candidate aggregation for truly strict symmetric NATs. Not needed for the 95% case — Phase 5 alone fixes most consumer-router setups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 19:47:20 +04:00
Siavash Sameni	05ec926317	fix(ui): don't nuke the registered panel's children on status update Regression from `20375ec`: the `signal-event reconnecting` and `signal-event registered` handlers were assigning to `directRegistered.textContent`, which is the PARENT element that holds the entire registered UI — the "Registered — waiting" header, incoming-call panel, recent-contacts section, call history, the fingerprint-input bar, and the Call button. Setting textContent on that parent wiped every child with a single text node, so after registration the user saw "✅ Registered" with NOTHING below it — no call input, no history, no call button. App unusable post-registration. Fix: - Add a dedicated `#registered-status` <p> inside the header of `#direct-registered` (this element already existed as a plain paragraph without an id; just giving it an id). - Rewrite both handlers to target that element by id instead of the parent, so `textContent =` only touches the status line and leaves the rest of the panel intact. - The `registered` handler now also explicitly `registerBtn.classList.add("hidden")` and `directRegistered.classList.remove("hidden")` so the first register event correctly reveals the UI. Belt-and-braces for the transparent-reconnect case too — if the supervisor re-registers after a drop, the UI stays in the registered state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 19:28:16 +04:00
Siavash Sameni	b7a48bf13b	feat(ui): incoming-call ring tone + system notification Previously: incoming calls silently popped an "Accept/Reject" panel. Easy to miss — no audible cue, no system-level alert if the app was backgrounded. Now the incoming-call path triggers both a synthesized ring tone and a system notification banner. ## Ring tone (desktop/src/main.ts) New `Ringer` class using Web Audio API directly — no external asset files, no new npm dep. Synthesizes a classic NANP two-tone cadence (440Hz + 480Hz sine mix, 2s tone + 4s silence, looped) through an envelope-gated gain node that ramps on/off to avoid clicks. Audible on every Tauri-supported platform because WebView carries Web Audio. - `start()` — lazily creates AudioContext on first use (platforms that require a user gesture for AudioContext creation still work because the incoming-call event is user-adjacent from the webview's perspective), starts setInterval(6000) loop. - `stop()` — clears the timer AND disconnects any active oscillators so there's no tail audio. - Active-nodes array is swept every cycle so it doesn't grow unbounded across long rings. Hooked into signal-event handlers: - `"incoming"` → `ringer.start()` + notifyIncomingCall - `"answered"`, `"setup"`, `"hangup"` → `ringer.stop()` - Accept/Reject button click handlers → `ringer.stop()` as the first thing they do (before any await) ## System notification (desktop/src-tauri + main.ts) Added `tauri-plugin-notification = "2"` to the Tauri app and registered in the builder. Capabilities updated with the four notification permissions. Frontend calls the plugin commands via the generic `invoke` instead of adding `@tauri-apps/plugin-notification` as a JS dep — Tauri plugins expose `plugin:notification\|notify` etc. directly. Flow: 1. `is_permission_granted` — check cached 2. If not granted → `request_permission` (Android prompts the user once, cached thereafter) 3. `notify` with title="Incoming call", body="From <alias>" All wrapped in try/catch with console.debug fallback — plugin missing or permission denied is non-fatal, the visible panel + ring tone still alert the user. ## Known gaps (deferred) - Android native system ringtone (RingtoneManager) + full- screen intent for lockscreen-visible ringer. Requires platform-specific Java/Kotlin glue in the Tauri Android shell — bigger lift. - Desktop window flash / taskbar attention-seek on incoming call when app is backgrounded. - Vibration pattern on Android. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 18:46:13 +04:00
Siavash Sameni	e75b045470	fix(ui): auto-dismiss call screen when peer hangs up Previously: peer hangs up → Rust emits signal-event {type:hangup} → JS clears callStatusText + hides incoming panel, but the call screen stays on with a dangling Hangup button the user has to press to acknowledge a call that's already over. Dead UX. Now: the hangup event handler tears down our side of the media engine via `invoke("disconnect")` and transitions back to the connect screen when we're currently in the call screen. Incoming-call panel still hides as before. `userDisconnected = true` is set so the existing call-event "disconnected" auto-reconnect path (which fires on transport drop) doesn't kick in — the peer-hangup signal is an intentional end-of-call, not a transport blip worth retrying. Also documented: "not connected" errors from the `disconnect` command are silently swallowed because they happen when there's no engine to tear down (e.g. incoming call that was never answered — caller bailed), which is the correct outcome there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 18:41:26 +04:00
Siavash Sameni	20375eceb9	feat(signal): transparent reconnect + auto-swap on relay change Two related UX fixes, same state-machine surface: 1. Relay drops / goes offline / restarts: the client now auto- reconnects in the background instead of silently falling to "not registered" and requiring the user to tap Deregister + Register. 2. User switches relay in settings: client auto-swaps — close old transport, register against new, all transparent. ## Signal state additions (desktop/src-tauri/src/lib.rs) - `SignalState.desired_relay_addr: Option<String>` — what the user CURRENTLY wants. `Some(x)` means "keep me connected to x", `None` means "user explicitly asked for idle". This is the pivot that distinguishes "connection dropped, retry" from "user deregistered, stop". - `SignalState.reconnect_in_progress: bool` — single-flight guard so concurrent triggers (recv-loop exit + manual register_signal + another recv-loop exit after a brief success) don't spawn duplicate supervisors. ## Refactor The old `register_signal` Tauri command was doing the whole connect + Register + spawn-recv-loop flow inline. Split into: - `internal_deregister(signal_state, keep_desired)` — shared teardown helper that nulls out transport/endpoint/call state and optionally clears `desired_relay_addr`. - `do_register_signal(signal_state, app, relay)` — core connect + register + spawn-recv-loop flow, callable from both the Tauri command and the reconnect supervisor. Returns an explicit `impl Future<...> + Send` to avoid auto-trait inference bailing inside the tokio::spawn chain (rustc loses the Send trail through the recv-loop spawn inside the fn body). - `register_signal` Tauri command — now thin: if already registered to the same relay, no-op; otherwise internal_deregister(keep_desired=false), set desired_relay_addr = Some(new), call do_register_signal. The Rust side handles the "change of server" transition entirely on its own, no deregister+register dance from JS needed. - `deregister` Tauri command — internal_deregister(keep_desired = false) so the recv-loop exit path sees the cleared desired addr and does NOT spawn a supervisor. ## Reconnect supervisor New `signal_reconnect_supervisor(signal_state, app, relay)` task. Spawned from the recv-loop exit path when the loop exits unexpectedly AND `desired_relay_addr.is_some()` AND no supervisor is already running. - Exponential backoff: 1s, 2s, 4s, 8s, 15s, 30s (capped at 30s, never gives up). First attempt is immediate (attempt 0 skips the wait). - On each iteration checks whether `desired_relay_addr` was cleared (user deregistered mid-flight) or another path already re-registered; either short-circuits the supervisor. - Also detects if the user changed relays while the supervisor was sleeping — resets the backoff counter and retries against the new addr. - On success, exits so the newly-spawned recv loop owns the connection from that point. If THAT drops again, a fresh supervisor spawns. - Emits `call-debug-log` and `signal-event` events at every state transition so the GUI can display "reconnecting...", "registered" banners. ## UI wiring (desktop/src/main.ts) - signal-event handler gets two new cases: - `"reconnecting"` — amber "🔄 reconnecting to <relay>…" in the registered banner area - `"registered"` — green "✓ registered (<fp prefix>…)" to clear the reconnecting badge - Relay-selection click handler checks if a signal is currently registered and, if the user picked a different relay, fires `register_signal` with the new address. Rust side handles the swap transparently. Full workspace test: 423 passing (no regressions). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 18:40:11 +04:00
Siavash Sameni	00deb97a5d	fix(reflect): drop LAN/private reflex addrs from NAT classification Real-world report: a user with one LAN relay + one internet relay got "Multiple IPs — treating as symmetric" because the LAN relay saw the client's LAN IP (172.16.81.172) while the internet relay saw the WAN IP (150.228.49.65). Two observations of "different public IPs" from the classifier's perspective, but semantically they describe two different network paths and shouldn't be compared. The LAN relay's reflection is always true, just not useful for public NAT classification: there's no NAT between the client and the LAN relay, so that path's reflex addr is always the LAN interface IP regardless of what the public-facing NAT beyond it looks like. Fix: new `is_private_or_loopback` helper filters the probe set before classification. Drops: - 127.0.0.0/8 loopback - 10/8, 172.16/12, 192.168/16 RFC1918 private - 169.254/16 link-local - 100.64/10 CGNAT shared-transition (same reasoning: a relay that sees the client with a CGNAT addr is on the same carrier network and can't describe public NAT state) - IPv6 loopback, unspecified, fe80::/10 link-local Failed probes still filtered out of classification (they were already) but now dimmed in the UI list instead of highlighted amber. Same rationale: a momentarily-offline probe target isn't a warning-worthy state, it's just a fact about the probe run. UI palette rebalance: only Cone gets green, everything else neutral text-dim. Wording changed from warning-tone "⚠ must use relay" to informational "ℹ P2P falls back to relay, calls still work" — symmetric NAT isn't broken state, it just means media takes the relay path. Tests added (4 new in wzp_client::reflect): - classify_drops_private_ip_probes — LAN + public → Unknown - classify_drops_loopback_probes — loopback + 2 public → Cone - classify_drops_cgnat_probes — CGNAT + 2 public same-IP- diff-port → SymmetricPort - classify_two_lan_probes_is_unknown_not_cone — all LAN → Unknown Existing multi_reflect integration test updated: two loopback relays now correctly classify as Unknown (because loopback reflex addrs are filtered) with the plumbing-works invariant preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 18:29:09 +04:00
Siavash Sameni	da08723fe7	fix(signal): forward-compat — log+continue on unknown SignalMessage variants Both sides of the signal channel previously broke their recv loop on any deserialize error, which meant adding a new variant in one build silently killed signal connections from peers running an older build. This bit us during Phase 1 testing: a new client sending SignalMessage::Reflect to a pre-Phase-1 relay caused the relay to drop the whole signal connection, which looked like "Error: not registered" on the next place_call. Fix: - New TransportError::Deserialize(String) variant in wzp-proto carries serde errors as a distinct category. - wzp-transport/reliable.rs::recv_signal returns Deserialize on serde_json::from_slice failures (was wrapped in Internal). - wzp-relay/main.rs signal loop matches on Deserialize → warn + continue (instead of break). - desktop/src-tauri/lib.rs recv loop does the same. Other TransportError variants (ConnectionLost, Io, Internal) still break the loop — only pure parse failures are recoverable. This means future SignalMessage variant additions are backward- compat by construction: older peers will see "unknown variant, continuing" in their logs while newer peers can keep evolving the protocol. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 18:13:31 +04:00
Siavash Sameni	8cdf8d486a	feat(p2p): Phase 4 cross-relay direct calling over federation Teaches the relay pair to route direct-call signaling across an existing federation link. Alice on Relay A can now place a direct call to Bob on Relay B if A and B are federation peers — the wire protocol, call registry, and signal dispatch all learn to track and route the cross-relay flow. Phase 3.5's dual-path QUIC race then carries the media directly peer-to-peer using the advertised reflex addrs, with zero changes needed on the client side. ## Wire protocol (wzp-proto) New `SignalMessage::FederatedSignalForward { inner, origin_relay_fp }` envelope variant, appended at end of enum — JSON serde is name-tagged so pre-Phase-4 relays just log "unknown variant" and drop it. 2 new roundtrip tests (any-inner nesting + single DirectCallOffer case). ## Call registry (wzp-relay) `DirectCall.peer_relay_fp: Option<String>` — federation TLS fp of the peer relay that forwarded the offer/answer for this call. `None` on local calls, `Some` on cross-relay. Used by the answer path to route the reply back through the same federation link instead of trying (and failing) to deliver via local signal_hub. New `set_peer_relay_fp` setter + 1 new unit test. ## FederationManager (wzp-relay) Three new methods: - `local_tls_fp()` — exposes the relay's own federation TLS fp so main.rs can build `origin_relay_fp` fields. - `broadcast_signal(msg) -> usize` — fan out any signal message (in practice `FederatedSignalForward`) to every active peer link, returning the reach count. Used when Relay A doesn't know which peer has the target fingerprint. - `send_signal_to_peer(fp, msg)` — targeted send for the reply path where the registry already knows which peer relay to hit. Plus a new `cross_relay_signal_tx: Mutex<Option<Sender<...>>>` field that `set_cross_relay_tx()` wires at startup so the federation `handle_signal` can push unwrapped inner messages into the main signal dispatcher. ## Federation handle_signal (wzp-relay) New match arm for `FederatedSignalForward`: - Loop prevention: drops forwards whose `origin_relay_fp` equals this relay's own fp (prevents A→B→A echo loops without needing TTL yet). - Otherwise pulls the inner message out and pushes it through `cross_relay_signal_tx` so the main loop's dispatcher task handles it as if it had arrived locally. ## Main signal loop (wzp-relay) ### DirectCallOffer when target not local Before falling through to Hangup, try the federation path: - Wrap the offer in `FederatedSignalForward` with `origin_relay_fp = this relay's tls_fp` - `fm.broadcast_signal(forward)` — returns peer count - If any peers reached, stash the call in local registry with `caller_reflexive_addr` set, `peer_relay_fp` still None (broadcast — the answer-side will identify itself when it replies) - Send `CallRinging` to caller immediately for UX feedback - Only if no federation or no peers → legacy Hangup path ### DirectCallAnswer when peer is remote - Registry lookup now reads both `peer_fingerprint` and `peer_relay_fp` in one acquisition - If `peer_relay_fp.is_some()`: * Reject → forward a `Hangup` over federation via `send_signal_to_peer` instead of local signal_hub * Accept → wrap the raw answer in `FederatedSignalForward`, route to the specific origin peer, then emit the LOCAL CallSetup to our callee with `peer_direct_addr = caller_reflexive_addr` (caller is remote; this side only has the callee) - If `peer_relay_fp.is_none()` → existing Phase 3 same-relay path with both CallSetups (caller + callee) ### Cross-relay signal dispatcher task New long-running task reading `(inner, origin_relay_fp)` from `cross_relay_rx`. In Phase 4 MVP handles: - `DirectCallOffer` — if target is local, create the call in the registry with `peer_relay_fp = origin_relay_fp`, stash caller addr, deliver offer to local callee. If target isn't local, drop (no multi-hop in Phase 4 MVP). - `DirectCallAnswer` — look up local caller by call_id, stash callee addr, forward raw answer to local caller via signal_hub, emit local CallSetup with `peer_direct_addr = callee_reflexive_addr` (peer is local now; this side only has the caller). - `CallRinging` — best-effort forward to local caller for UX. - `Hangup` — logged for now; Phase 4.1 will target by call_id. ## Integration tests `crates/wzp-relay/tests/cross_relay_direct_call.rs` — 3 tests that reproduce the main.rs cross-relay dispatcher logic inline and assert the invariants without spinning up real binaries: 1. `cross_relay_offer_forwards_and_stashes_peer_relay_fp` — Relay A gets Alice's offer, broadcasts. Relay B's dispatcher creates the call with `peer_relay_fp = relay_a_tls_fp`. 2. `cross_relay_answer_crosswires_peer_direct_addrs` — full round trip; both CallSetups (one on each relay) carry the OTHER party's reflex addr. 3. `cross_relay_loop_prevention_drops_self_sourced_forward` — explicit loop-prevention check. Full workspace test goes from 413 → 419 passing. Clippy clean on touched files. ## Non-goals (deferred to Phase 4.1+) - Relay-mediated media fallback across federation — if P2P direct fails (symmetric NAT on either side), the call errors out with "no media path". Making the existing federation media pipeline carry ephemeral call-<id> rooms is the Phase 4.1 lift. - Multi-hop federation (A → B → C). Phase 4 MVP supports a direct federation link between A and B only. - Fingerprint → peer-relay routing gossip. PRD: .taskmaster/docs/prd_phase4_cross_relay_p2p.txt Tasks: 70-78 all completed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 17:31:43 +04:00
Siavash Sameni	59ce52f8e8	feat(p2p): Phase 3.5 dual-path QUIC race + GUI call-flow debug logs Two features in one commit because they ship and test together: Phase 3.5 closes the hole-punching loop and the call-flow debug logs give the user live visibility into every step of a call so real-hardware testing of the new P2P path is debuggable. ## Phase 3.5 — dual-path QUIC connect race Completes the hole-punching work Phase 3 scaffolded. On receiving a CallSetup with peer_direct_addr, the client now actually races a direct QUIC handshake against the relay dial and uses whichever completes first. Symmetric role assignment avoids the two-conns- per-call problem: - Both peers compare `own_reflex_addr` vs `peer_reflex_addr` lexicographically. - Smaller addr → Acceptor (A-role): builds a server-capable dual endpoint, awaits an incoming QUIC session. Does NOT dial. - Larger addr → Dialer (D-role): builds a client-only endpoint, dials the peer's addr with `call-<id>` SNI. Does NOT listen. - Both sides always dial the relay in parallel as fallback. - `tokio::select!` with `biased` preference for direct, `tokio::pin!` so each branch can await the losing opposite as fallback. - Direct timeout 2s, relay fallback timeout 5s (so 7s worst case from CallSetup to "no media path" error). New crate module `wzp_client::dual_path::{race, WinningPath}` (moved here from desktop/src-tauri so it's testable from a workspace test). `determine_role` in `wzp_client::reflect` is pure-function and unit-tested. ### CallEngine integration - New `pre_connected_transport: Option<Arc<QuinnTransport>>` arg on both android + desktop `CallEngine::start` branches. Skips the internal wzp_transport::connect step when Some. Backward- compat: None keeps Phase 0 relay-only behavior. - `connect` Tauri command reads own_reflex_addr from SignalState, computes role, runs the race, passes the winning transport into CallEngine. If ANY input is missing (no peer addr, no own addr, equal addrs), falls back to classic relay path — identical to pre-Phase-3.5 behavior. ### Tests (9 new, all passing) - 6 unit tests for `determine_role` truth table in `wzp-client/src/reflect.rs` (smaller=Acceptor, larger=Dialer, port-only diff, equal, missing-side, symmetry) - 3 integration tests in `crates/wzp-client/tests/dual_path.rs`: * `dual_path_direct_wins_on_loopback` — two-endpoint test rig, Dialer wins direct path vs loopback mock relay * `dual_path_relay_wins_when_direct_is_dead` — dead peer port, 2s direct timeout, relay fallback wins * `dual_path_errors_cleanly_when_both_paths_dead` — <10s error, no hang ## GUI call-flow debug logs Runtime-toggled structured events at every step of a call so the user can see where a call progressed or stalled on real hardware. Modeled on the existing DRED_VERBOSE_LOGS pattern. ### Rust side - `static CALL_DEBUG_LOGS: AtomicBool` + `emit_call_debug(&app, step, details)` helper. Always logs via `tracing::info!` (logcat always has a copy); GUI Tauri `call-debug-log` event only fires when the flag is on. - Tauri commands `set_call_debug_logs` / `get_call_debug_logs`. ### Instrumented steps (24 emit_call_debug sites) - `register_signal`: start, identity loaded, endpoint created, connect failed/ok, RegisterPresence sent, ack received/failed, recv loop spawning - Recv loop: CallRinging, DirectCallOffer (w/ caller_reflexive_addr), DirectCallAnswer (w/ callee_reflexive_addr), CallSetup (w/ peer_direct_addr), Hangup - `place_call`: start, reflect query start/ok/none, offer sent, send failed - `answer_call`: start, reflect query start/ok/none or privacy skip, answer sent, send failed - `connect`: start, dual_path_race_start (w/ role), won (w/ path), failed, skipped (w/ reasons), call_engine_starting/ started/failed ### JS side - New `callDebugLogs: boolean` field on Settings type. - Boot-time hydrate of the Rust flag from localStorage so the choice survives restarts (like `dredDebugLogs`). - Settings panel: new "Call flow debug logs" checkbox alongside the DRED toggle. - New "Call Debug Log" section that ONLY shows when the flag is on. Rolling in-memory buffer of the last 200 events, rendered as monospace `HH:MM:SS.mmm step {details}` lines with auto- scroll and a Clear button. - `listen("call-debug-log", ...)` subscribed at app startup, appends to the buffer, re-renders on every event. Full workspace test goes from 404 → 413 passing. Clippy clean on touched crates. PRD: .taskmaster/docs/prd_phase35_dual_path_race.txt Tasks: 61-69 all completed Next: APK + desktop build carrying everything — Phase 2 NAT detect, Phase 3 advertising, Phase 3.5 dual-path + call debug logs, plus the earlier Android first-join diagnostics — so the user can validate the P2P path on real hardware with live per-step visibility into where any failures happen. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:06:44 +04:00
Siavash Sameni	39277bf3a0	feat(hole-punching): advertise peer reflexive addrs in DirectCall flow — Phase 3 Completes the signal-plane plumbing for P2P direct calling: both peers now learn their own server-reflexive address (Phase 1 Reflect), include it in DirectCallOffer / DirectCallAnswer, and the relay cross-wires them into each side's CallSetup so the client knows the OTHER party's direct addr. Dual-path QUIC race is scaffolded but deferred to Phase 3.5 — this commit ships the full advertising layer so real-hardware testing can confirm the addrs flow end-to-end before adding the concurrent-connect logic. Wire protocol (wzp-proto/src/packet.rs): - DirectCallOffer gains optional `caller_reflexive_addr` - DirectCallAnswer gains optional `callee_reflexive_addr` - CallSetup gains optional `peer_direct_addr` - All #[serde(default, skip_serializing_if = "Option::is_none")] so pre-Phase-3 peers and relays stay backward compatible by construction — the new fields are elided from the JSON on the wire when None, and older clients parse the JSON ignoring any fields they don't know. - 2 new roundtrip tests (Some + None cases, old-JSON parse-back). Call registry (wzp-relay/src/call_registry.rs): - DirectCall gains caller_reflexive_addr + callee_reflexive_addr. - set_caller_reflexive_addr / set_callee_reflexive_addr setters. - 2 new unit tests: stores and returns addrs, clearing works. Relay cross-wiring (wzp-relay/src/main.rs): - On DirectCallOffer: stash the caller's addr in the registry. - On DirectCallAnswer: stash the callee's addr (only set by AcceptTrusted answers — privacy-mode leaves it None). - Send two different CallSetup messages: one to the caller with peer_direct_addr=callee_addr, and one to the callee with peer_direct_addr=caller_addr. The cross-wiring means each side gets the OTHER party's direct addr, not its own. - Logs `p2p_viable=true` when both sides advertised. Client advertising (desktop/src-tauri/src/lib.rs): - New `try_reflect_own_addr` helper that reuses the Phase 1 oneshot pattern WITHOUT holding state.signal.lock() across the await (critical: the recv loop reacquires the same mutex to fire the oneshot, so holding it would deadlock). - `place_call` queries reflect first and includes the returned addr in DirectCallOffer. Falls back to None on any failure — call still proceeds via the relay path. - `answer_call` queries reflect ONLY on AcceptTrusted so AcceptGeneric keeps the callee's IP private by design. Reject and AcceptGeneric both pass None. - recv loop's CallSetup handler destructures and forwards peer_direct_addr to the JS layer in the signal-event payload. Client scaffolding for dual-path (desktop/src-tauri/src/lib.rs + desktop/src/main.ts): - `connect` Tauri command gets a new optional `peer_direct_addr` argument. Currently LOGS the addr but still uses the relay path for the media connection — Phase 3.5 will swap in a tokio::select! race between direct dial + relay dial. Scaffolding lands here so the JS wire is stable, real-hardware testing can confirm advertising works end-to-end, and Phase 3.5 is a pure Rust change with no JS touches. - JS setup handler forwards `data.peer_direct_addr` to invoke. Back-compat with the CLI client (crates/wzp-client/src/cli.rs): - CLI test harness updated for the new fields — always passes None for both reflex addrs (no hole-punching). Also destructures peer_direct_addr: _ in its CallSetup handler. Tests (8 new, all passing): - wzp-proto: hole_punching_optional_fields_roundtrip, hole_punching_backward_compat_old_json_parses - wzp-relay call_registry: call_registry_stores_reflexive_addrs, call_registry_clearing_reflex_addr_works - wzp-relay integration: crates/wzp-relay/tests/hole_punching.rs * both_peers_advertise_reflex_addrs_cross_wire_in_setup * privacy_mode_answer_omits_callee_addr_from_setup * pre_phase3_caller_leaves_both_setups_relay_only * neither_peer_advertises_both_setups_are_relay_only Full workspace test goes from 396 → 404 passing. PRD: .taskmaster/docs/prd_hole_punching.txt Tasks: 53-60 all completed (58 = scaffolding-only; 3.5 follow-up) Next up: Phase 3.5 — dual-path QUIC connect race. With the advertising layer live, this becomes a focused change: on CallSetup-with-peer_direct_addr, start a server-capable dual endpoint, and tokio::select! across (direct dial, relay dial, inbound accept). Whichever QUIC handshake completes first wins, the losers drop, 2s direct timeout falls back to relay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 13:37:04 +04:00
Siavash Sameni	8d903f16c6	feat(reflect): multi-relay NAT type detection — Phase 2 Builds on Phase 1's SignalMessage::Reflect to probe N relays in parallel through transient QUIC connections and classify the client's NAT type for the future P2P hole-punching path. No wire protocol changes — Phase 1's Reflect/ReflectResponse pair is reused unchanged. New client-side module (crates/wzp-client/src/reflect.rs): - probe_reflect_addr(relay, timeout_ms): opens a throwaway quinn::Endpoint (fresh ephemeral source port per probe, essential for NAT-type detection — sharing one endpoint would make a symmetric NAT look like a cone NAT), connects to _signal, sends RegisterPresence with zero identity, consumes the Ack, sends Reflect, awaits ReflectResponse, cleanly closes. - detect_nat_type(relays, timeout_ms): parallel probes via tokio::task::JoinSet (bounded by slowest probe not sum) and returns a NatDetection with per-probe results + aggregate classification. - classify_nat(probes): pure-function classifier split out for network-free unit tests. Rules: * 0-1 successful probes → Unknown * 2+ successes, same ip same port → Cone (P2P viable) * 2+ successes, same ip diff ports → SymmetricPort (relay) * 2+ successes, different ips → Multiple (treat as symmetric) Tauri command (desktop/src-tauri/src/lib.rs): - detect_nat_type({ relays: [{ name, address }] }) -> NatDetection as JSON. Takes the relay list from JS because localStorage owns the config. Parse-up-front so a malformed entry fails clean instead of as a probe error. 1500ms per-probe timeout. UI (desktop/index.html + src/main.ts): - New "NAT type" row + "Detect NAT" button in the Network settings section. Renders per-probe status (name, address, observed addr, latency, or error) plus the colored verdict: * green Cone — shows consensus addr * amber SymmetricPort / Multiple — must relay * gray Unknown — not enough data Tests: - 7 unit tests in wzp-client/src/reflect.rs covering every classifier branch (empty, 1 success, 2 identical, 2 diff ports, 2 diff ips, success+failure mix, pure-failure). - 3 integration tests in crates/wzp-relay/tests/multi_reflect.rs: * probe_reflect_addr_happy_path — single mock relay end-to-end * detect_nat_type_two_loopback_relays_is_cone — two concurrent relays, asserts both see 127.0.0.1 and classifier returns Cone or SymmetricPort (accepted because the test harness uses fresh ephemeral ports per probe which look like SymmetricPort on single-host loopback) * detect_nat_type_dead_relay_is_unknown — alive + dead port mix, asserts the dead probe surfaces an error string and the aggregator returns Unknown (only 1 success) Full workspace test goes from 386 → 396 passing. PRD: .taskmaster/docs/prd_multi_relay_reflect.txt Tasks: 47-52 all completed Next up: hole-punching (Phase 3) — use the reflected address in DirectCallOffer/Answer and CallSetup so peers attempt a direct QUIC handshake to each other, with relay fallback on timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 12:47:12 +04:00
Siavash Sameni	921856eba9	feat(reflect): QUIC-native NAT reflection ("STUN for QUIC") — Phase 1 Lets a client ask its registered relay "what IP:port do you see for me?" over the existing TLS-authenticated signal channel, returning the client's server-reflexive address as a SocketAddr. Replaces the need for a classic STUN deployment and becomes the bootstrap step for future P2P hole-punching: once both peers know their own reflex addrs, they can advertise them in DirectCallOffer and attempt a direct QUIC handshake to each other. Wire protocol (wzp-proto): - SignalMessage::Reflect — unit variant, client -> relay - SignalMessage::ReflectResponse { observed_addr: String } — relay -> client - JSON-serde, appended at end of enum: zero ordinal concerns, backward compat with pre-Phase-1 relays by construction (older relays log "unexpected message" and drop; newer clients time out cleanly within 1s). Relay handler (wzp-relay/src/main.rs, signal loop): - New match arm next to Ping reuses the already-bound `addr` from connection.remote_address() and replies with observed_addr as a string. debug!-level log on success, warn!-level on send failure. Client side (desktop/src-tauri/src/lib.rs): - SignalState gains pending_reflect: Option<oneshot::Sender<SocketAddr>>. - get_reflected_address Tauri command installs the oneshot before sending Reflect and awaits it with a 1s timeout; cleans up on every exit path (send failure, timeout, parse error). - recv loop's new ReflectResponse arm fires the pending sender or emits a debug log for unsolicited responses — never crashes the loop on malformed input. - Integrated into invoke_handler! alongside the other signal commands. UI (desktop/index.html + src/main.ts): - New "Network" section in settings panel with a "Detect" button that displays the reflected address or a categorized warning ("register first" / "relay does not support reflection" / error). Tests (crates/wzp-relay/tests/reflect.rs — 3 new, all passing): - reflect_happy_path: client on loopback gets back 127.0.0.1:<its own port> - reflect_two_clients_distinct_ports: two concurrent clients see their own distinct ports, proving per-connection remote_address - reflect_old_relay_times_out: mock relay that ignores Reflect — client times out between 1000-1200ms and does not hang Also pre-existing test bit-rot unrelated to this PR — fixed so the full workspace `cargo test` goes green: - handshake_integration tests in wzp-client, wzp-relay and featherchat_compat in wzp-crypto all missed the `alias` field addition to CallOffer and the 3-arg form of perform_handshake plus 4-tuple return of accept_handshake. Updated to the current API surface. Results: cargo test --workspace --exclude wzp-android: 386 passed cargo check --workspace: clean cargo clippy: no new warnings in touched files Verification excludes wzp-android because it's dead code on this branch (Tauri mobile uses wzp-native instead) and can't link -llog on macOS host — unchanged status quo. PRD: .taskmaster/docs/prd_reflect_over_quic.txt Tasks: 39-46 all completed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 12:29:07 +04:00
Siavash Sameni	7e7968b2f9	diag(android-engine): first-join no-audio ordering instrumentation Adds a single call_t0 = Instant::now() at the top of the Android CallEngine::start path, threaded through send + recv tasks as send_t0 / recv_t0, and tags the following milestones with t_ms_since_call_start so we can build a clean side-by-side log of first-call vs rejoin: 1. QUIC connection established 2. handshake complete 3. wzp-native audio_start returned (+ how long audio_start itself took) 4. send task spawned 5. send: first full capture frame read (+ short_reads_before count) 6. send: first non-zero capture RMS 7. recv task spawned 8. recv: first media packet received 9. recv: first successful decode 10. recv: first playout-ring write Combined with the existing C++-side cb#0 logs in crates/wzp-native/cpp/oboe_bridge.cpp ("capture cb#0", "playout cb#0") this gives us full-pipeline ordering with no native-side changes needed. PRD: .taskmaster/docs/prd_android_first_join_no_audio.txt Task: 32 (first task in the chain — diagnostics before any fix attempts so we know which of the 5 suspect causes is real). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 10:00:20 +04:00
Siavash Sameni	578ff8cff4	feat(debug): GUI toggle for DRED verbose logs + macOS mic permission DRED verbose logs (off by default — keeps logcat clean in normal use): - wzp-codec: DRED_VERBOSE_LOGS atomic flag with dred_verbose_logs() / set_dred_verbose_logs() helpers - opus_enc: gate "DRED enabled" + libopus version logs behind the flag - desktop/src-tauri/engine.rs: gate DredRecvState parse log, reconstruction log, classical PLC log, and DRED-counter fields in the Android recv heartbeat (non-verbose path still logs basic recv stats) - Tauri commands set_dred_verbose_logs / get_dred_verbose_logs - Settings panel gets a "DRED debug logs (verbose, dev only)" checkbox; preference persists in wzp-settings localStorage and is pushed to Rust on save and on app boot macOS mic permission: - Add desktop/src-tauri/Info.plist with NSMicrophoneUsageDescription. Without it, modern macOS silently denies CoreAudio capture for ad-hoc-signed Tauri builds — capture starts but every callback hands you zeros. Symptom: phones could not hear desktop client, desktop could still hear phones (playout has no TCC gate). The Tauri 2 bundler auto-merges this file into WarzonePhone.app's Contents/Info.plist on the next build, so first launch will pop the standard mic prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 09:48:32 +04:00
Siavash Sameni	16890576fb	feat(observability): logcat-visible DRED proof of life on Android Adds enough INFO-level logging that an opus-DRED-v2 APK on Android can be verified end-to-end by reading logcat alone — no debugger, no Prometheus, no telemetry pipeline required. Three observation points: 1. Encoder construction (opus_enc.rs) - Bumped the "DRED enabled" log from debug! to info! so the per-call DRED config is in logcat by default. Each call's first OpusEncoder construction logs codec, dred_frames, dred_ms, loss_floor_pct. - Added a one-shot static OnceLock that logs `opusic_c::version()` the first time an OpusEncoder is built in the process. This is the smoking gun for "is the new libopus actually loaded" — pre- Phase-0 audiopus shipped libopus 1.3 with no DRED, post-Phase-0 should print 1.5.2 here. 2. DRED state ingest (DredRecvState::ingest_opus in desktop/src-tauri/src/engine.rs) - First successful parse on a call logs immediately so we can see "DRED is on the wire" in logcat. - Subsequent parses sample every 100th to confirm steady-state samples_available without drowning the log. - New parses_total / parses_with_data counters track the parse rate vs the success rate (a packet without DRED in it returns `available == 0`, so a low ratio means the encoder isn't emitting DRED bytes). 3. DRED reconstruction events (DredRecvState::fill_gap_to) - Every DRED reconstruction logs at INFO with missing_seq, anchor_seq, offset_samples, offset_ms, samples_available, gap_size, and the running total. These events are rare on a clean network and we want to know exactly which gap was filled. - First three classical PLC fills + every 50th thereafter log so we can see when DRED couldn't cover a gap (offset out of range, no good state, or reconstruct error). 4. Recv heartbeat (Android start() in engine.rs) - Existing 2-second heartbeat now includes dred_recv, classical_plc, dred_parses_with_data, dred_parses_total so a steady-state call shows the cumulative counters in logcat without parsing. How to verify on a real call: adb logcat -s 'RustStdoutStderr:*' \| grep -i 'dred\\|libopus version' Expected output sequence on a successful Opus call: - "linked libopus version libopus_version=libopus 1.5.2-..." (once per process) - "opus encoder: DRED enabled codec=Opus24k dred_frames=20 dred_ms=200 loss_floor_pct=15" (per call) - "DRED state parsed from Opus packet seq=N samples_available=4560 ms=95 ..." (after first DRED-bearing packet) - "recv heartbeat (android) ... dred_recv=0 classical_plc=0 dred_parses_with_data=58 dred_parses_total=58" (every 2s) If you see "linked libopus version libopus 1.3" — the FFI swap didn't take. If dred_parses_with_data stays at 0 while dred_parses_total climbs — the sender isn't emitting DRED (check the encoder's loss floor and the receiver's libopus version). If gaps trigger "classical PLC fill" instead of "DRED reconstruction fired" — DRED state coverage is too small for the observed loss pattern, and the loss floor or DRED duration policy needs tuning. Verification: - cargo check -p wzp-codec -p wzp-client: 0 errors - cargo check -p wzp-desktop: 0 Rust errors (only the pre-existing tauri::generate_context!() proc macro panic on missing ../dist which fires at host check time, irrelevant on the remote build) - cargo test -p wzp-codec --lib: 69 passing (no regressions) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 08:58:03 +04:00
Siavash Sameni	daf7bcd9ba	chore(warnings): sweep the workspace — zero warnings on lib + bin targets Addressed every rustc warning surfaced by \`cargo check --workspace --release --lib --bins\` on opus-DRED-v2. Split across three categories: ## Real bugs surfaced by the audit (fix, don't silence) - crates/wzp-relay/src/federation.rs — the per-peer RTT monitor task computed \`rtt_ms\` every 5 s and threw it on the floor. The \`wzp_federation_peer_rtt_ms\` gauge has been registered in metrics.rs the whole time but was never receiving samples, leaving the Grafana panel blank. Wired it up: the task now calls \`fm_rtt.metrics.federation_peer_rtt_ms.with_label_values(&[&label_rtt]).set(rtt_ms)\` on every sample. Fixes three warnings (\`rtt_ms\`, \`fm_rtt\`, \`label_rtt\` were all captured for this task and all dead). ## Dead code removal - crates/wzp-relay/src/federation.rs — removed \`local_delivery_seq: AtomicU16\` field and its initializer. It was described in comments as "per-room seq counter for federation media delivered to local clients" but was declared, initialized to 0, and never read or written anywhere else. Genuine half-wired feature; deletable with zero behavior change. - crates/wzp-relay/src/room.rs — removed \`let recv_start = Instant::now()\` at the top of a recv loop that was never read. Separate variable \`last_recv_instant\` already measures the actual gap that's used for the \`max_recv_gap_ms\` stat. - crates/wzp-client/src/cli.rs — removed \`let my_fp = fp.clone()\` from the signal loop setup. Cloned but never used in any match arm. ## Stub-intent warnings (underscore + explanatory comment) - crates/wzp-relay/src/handshake.rs — \`choose_profile\` hardcodes \`QualityProfile::GOOD\` and ignores its \`supported\` parameter. Comment already documented "Cap at GOOD (24k) for now — studio tiers not yet tested for federation reliability". Renamed to \`_supported\`, expanded the comment to explicitly note the future plan (pick highest supported ≤ relay ceiling). - crates/wzp-relay/src/federation.rs — \`forward_to_peers\` takes \`room_name: &str\` but only uses \`room_hash\`. The caller (handle_datagram) passes the name for caller-site symmetry with other helpers; kept the param shape and underscored the binding with a comment noting it's reserved for future per-name logging. ## Cosmetic fixes - crates/wzp-relay/src/event_log.rs — dropped \`use std::sync::Arc\` (unused). - crates/wzp-relay/src/signal_hub.rs — trimmed \`use tracing::{info, warn}\` to \`use tracing::info\`. Also removed unnecessary \`mut\` on \`hub\` binding in the \`register_unregister\` test. - crates/wzp-relay/src/room.rs — trimmed \`use tracing::{debug, error, info, trace, warn}\` to \`{error, info, warn}\`. Also removed unnecessary \`mut\` on \`mgr\` binding in the \`room_join_leave\` test. - crates/wzp-relay/src/main.rs — removed unnecessary \`mut\` on the \`config\` destructured binding from \`parse_args()\`; and dropped \`ref caller_alias\` from the \`DirectCallOffer\` match pattern since the relay just forwards the full \`msg\` (caller_alias is preserved end-to-end, we don't need to read it on the relay). - crates/wzp-crypto/tests/featherchat_compat.rs — dropped \`CallSignalType\` from a \`use wzp_client::featherchat::{...}\` (unused in the test body). Note: this test file has pre-existing compile errors from SignalMessage schema drift unrelated to this sweep; that's tracked separately. ## Crate-level annotation - crates/wzp-android/src/lib.rs — added \`#![allow(dead_code, unused_imports, unused_variables, unused_mut)]\` with a doc block explaining the crate is dead code since the Tauri mobile rewrite. The legacy Kotlin+JNI Android app that consumed this crate was replaced by desktop/src-tauri (live Android recv path) + crates/wzp-native (Oboe bridge). Rather than piecemeal cleanup of a crate that shouldn't be maintained, the whole-crate allow keeps CI clean until someone removes the crate entirely. Kills all 6 wzp-android warnings (4 unused imports/vars, 1 unused \`mut\` on a JNI env param, 1 dead \`command_rx\` field) in one line. ## Not touched - deps/featherchat/warzone/crates/warzone-protocol/src/x3dh.rs — 3 unused-variable warnings in \`alice_spk_secret\`, \`alice_bundle\`, \`bob_bundle_bytes\`. This is a vendored third-party submodule; upstream's problem, not ours. Would need to be reported to featherchat upstream if we care. ## Verification - \`cargo check --workspace --release --lib --bins\` → 0 warnings, 0 errors - \`cargo check --workspace --release --all-targets\` → only the 3 featherchat submodule warnings remain, plus the pre-existing 3 broken integration tests (SignalMessage schema drift from Phase 2, tracked separately and explicitly out of scope). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 08:28:26 +04:00
Siavash Sameni	df1a45a5f5	fix(cli): port live mode to ring API (read_frame/write_frame removed) AudioCapture and AudioPlayback no longer expose the old read_frame() and write_frame() methods — they were replaced with ring() returning &Arc<AudioRing> when the lock-free SPSC ring was introduced. The CLI live-mode loop still referenced the removed methods, which broke every workspace build that touched wzp-client bin (including the remote Linux x86_64 docker build). - Send loop: allocate a 960-sample scratch buffer, fill it in a loop via capture.ring().read() until a full 20 ms frame is available, sleep 2 ms between empty reads to avoid hot-spinning. - Recv loop: write decoded PCM into playback.ring() instead of calling write_frame(). Short writes on full ring drop the tail, which is the correct real-time behavior for CLI live mode. No behavioral change on the wire or in the call pipeline — this is purely a compile fix for cli.rs bitrot that accumulated since the ring API landed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 08:08:14 +04:00
Siavash Sameni	dd0c714caa	Revert "fix(deps): restore Cargo.lock from `8ceb6f4` — minimize dep drift from Phase 0" This reverts commit `575a39d07a`.	2026-04-11 08:06:04 +04:00
Siavash Sameni	a7b2f850f1	build(script): parametrize branch via WZP_BRANCH (default opus-DRED-v2) The Linux build script was hardcoded to feat/android-voip-client, which is an older branch that doesn't have the current DRED work or the relay fixes from `8c4d640`. Default the branch to opus-DRED-v2 (current active development branch), thread it through to the remote script as a third positional arg, and allow override via `WZP_BRANCH=<name> ./build-linux-docker.sh`. This is also what let us discover that the relay at 172.16.81.175:4433 was running `d0c1731` (android-rewrite) and missing the `8c4d640` CallSetup/advertised-IP fix — direct calls failed until the relay was rebuilt locally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 08:05:56 +04:00
Siavash Sameni	575a39d07a	fix(deps): restore Cargo.lock from `8ceb6f4` — minimize dep drift from Phase 0 Phase 0 cherry-pick regenerated the lockfile from scratch via `cargo generate-lockfile`, which bumped at least tokio (1.50.0 → 1.51.1) and downgraded the lockfile format from version 4 → version 3. Many other transitive deps may have shifted silently. Symptoms that pointed here: 1. Direct-call media QUIC handshake silently stalls for exactly the client-side 10s timeout, with no errors in the log. Classic tokio runtime / async waker mismatch — tasks queued from one runtime never run because the endpoint's I/O driver is on another runtime. 2. Every `place_call` gets an immediate `signal: Hangup reason=Normal` back from the signal recv loop, as if it's consuming stale state. 3. Eventually hits `FORTIFY: pthread_mutex_lock called on a destroyed mutex` and the process dies. All three are consistent with a tokio async primitive being shared across runtimes in a way that tokio 1.51.1 handles differently than 1.50.0 (which was the version on the user's known-good build). Rather than chase the specific bisection, restore the exact base lockfile and let cargo add only the three deps Phase 0 actually needs (opusic-c, opusic-sys, bytemuck). Verification: - `git diff 8ceb6f4..HEAD -- Cargo.lock \| grep -c '^[+-]version = '` → 0 (no version-line changes beyond what Cargo auto-pulls for new crates) - tokio back to 1.50.0 - rustls, quinn, quinn-proto, quinn-udp all unchanged - Lockfile version restored to 4 - cargo test -p wzp-codec --lib: 69 passing (unchanged) - cargo test -p wzp-client --lib: 35 passing + 1 ignored (unchanged) Does not fix the pre-existing relay-side advertised-IP bug (CallSetup may still contain a relay address that the callee cannot reach from its network), but that is an orthogonal issue that existed on `8ceb6f4` too. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 22:13:35 +04:00
Siavash Sameni	d63d50cdc0	fix(build): remove apostrophe from libc++_shared comment (broke docker bash -c quoting) Previous commit `d269600` added the libc++_shared.so copy step but the comment block included "Android's dynamic linker" — the apostrophe closed the enclosing `bash -c '...'` single-quoted string prematurely. Everything after "Android" was interpreted as wrapper-script bash instead of docker-container bash, so JNI_ABI_DIR (set inside the docker context) was unbound when the wrapper tried to use it. Build failed with: /tmp/wzp-tauri-build.sh: line 149: JNI_ABI_DIR: unbound variable Note the pre-existing script uses backticks in its comments ("cargo- tauri`s linker wiring") exactly to avoid this trap. Matched that style and added an explicit NOTE to the comment explaining the quoting hazard for future editors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:49:54 +04:00
Siavash Sameni	d269600aa7	fix(build): build-tauri-android.sh — copy libc++_shared.so into jniLibs Root cause of "wzp-native not loaded" at runtime on opus-DRED-v2 APK: libwzp_native.so has a NEEDED entry for libc++_shared.so (because crates/wzp-native/build.rs uses cpp_link_stdlib(Some("c++_shared"))), but the APK only contained: lib/arm64-v8a/libwzp_desktop_lib.so (192 MB) lib/arm64-v8a/libwzp_native.so (683 KB) No libc++_shared.so → Android's dynamic linker fails the dlopen of libwzp_native.so at runtime with "library libc++_shared.so not found", and every audio path that routes through wzp_native (capture, playout, register, direct call) refuses to start. Diagnosis: - readelf -d libwzp_native.so shows NEEDED libc++_shared.so - python zipfile listing of the APK confirms libc++_shared.so is absent from lib/arm64-v8a/ - scripts/build-and-notify.sh (the legacy wzp-android build path) already had this fix at lines 126-134 with an explicit comment: "cargo-ndk may not copy libc++_shared.so — grab it from the NDK if missing". That fix was never ported to build-tauri-android.sh when the Tauri mobile pipeline was set up. Fix: after `cargo ndk build -p wzp-native --release` produces libwzp_native.so into jniLibs, copy libc++_shared.so from the NDK sysroot (same find pattern as build-and-notify.sh) into the same jniLibs dir. Abort with a clear error if the NDK doesn't have the file. Also noting the 191 MB vs 359 MB size discrepancy the user saw: that's almost entirely libwzp_desktop_lib.so being a 192 MB debug build. The old working APK was probably a release build (smaller main lib) or included multiple arches (doubling/tripling the .so count). The size is cosmetic — the crash is the real issue, and libc++_shared.so is ~2 MB so this fix doesn't close the size gap. Can investigate the size difference separately after register + direct call work again. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:43:47 +04:00
Siavash Sameni	dfbe21fe6e	feat(tauri-engine): Phase 3b/3c re-port — DRED reconstruction on the live Tauri mobile engine The original Phase 3b landed on wzp-client/CallDecoder and Phase 3c landed on wzp-android/src/engine.rs. Both of those are DEAD CODE on feat/desktop-audio-rewrite: the legacy Kotlin app in android/app/ is not built by the Tauri mobile pipeline, and the Tauri engine bypasses CallDecoder by calling wzp_codec::create_decoder directly. The live Android call engine lives at desktop/src-tauri/src/engine.rs with two `pub async fn start<F>` functions — one cfg-gated on Android (Oboe via wzp-native) and one for desktop (CPAL). Both recv tasks were using `let mut decoder = wzp_codec::create_decoder(...)` which returns `Box<dyn AudioDecoder>` and doesn't expose the inherent `reconstruct_from_dred` method. Changes: New helper struct `DredRecvState` at the top of engine.rs, wrapping: - DredDecoderHandle (libopus DRED side-channel parser) - DredState scratch (for parse_into) - DredState last_good (cached valid state, swapped on success) - last_good_seq: Option<u16> (DRED anchor sequence) - expected_seq: Option<u16> (for gap detection) - dred_reconstructions / classical_plc_invocations counters With three methods: - ingest_opus(seq, payload): parse DRED, swap on success - fill_gap_to(decoder, current_seq, frame_samples, scratch, emit): detect gap back from expected_seq, reconstruct each missing frame via DRED if state covers it, fall through to classical decoder.decode_lost() when it doesn't. Calls emit() once per frame with a slice the caller uses for AGC + playout write. - reset_on_profile_switch(): invalidate tracking when codec changes Both recv tasks (Android @ ~line 297 and desktop @ ~line 907): - Decoder type changed from `Box<dyn AudioDecoder>` via `wzp_codec::create_decoder` to concrete `AdaptiveDecoder::new(profile)` so we can call the inherent reconstruct_from_dred method. - Added `use wzp_proto::traits::AudioDecoder;` at the top of engine.rs to bring decode/decode_lost/set_profile trait methods into scope on the concrete type. - New `current_profile` local alongside `current_codec` (used for frame_duration lookups that drive the DRED sample offset math). - On codec/profile switch, call dred_recv.reset_on_profile_switch() because the cached DRED state is tied to the old profile's frame rate. - For each arriving Opus source packet: 1. dred_recv.ingest_opus(seq, payload) — parse DRED 2. dred_recv.fill_gap_to(...) — detect gap and reconstruct missing frames, each emitted through a closure that does AGC + playout write (wzp_native on Android, playout_ring on desktop) 3. Normal decoder.decode() fallthrough for the current packet (unchanged) - Codec2 packets skip the DRED path entirely (is_opus() gate) — libopus can't reconstruct Codec2 audio. Ordering invariant: gap reconstruction writes to playout BEFORE the current packet's decoded audio, preserving temporal order since the playout ring is FIFO. The closure captures the `spk_muted` flag once before the gap loop to avoid mid-gap-fill state changes. Kept `crates/wzp-android/src/engine.rs` and `crates/wzp-android/src/ stats.rs` from the earlier Phase 3c commit as-is — they're dead code on feat/desktop-audio-rewrite but harmless, and deleting them would diverge this branch from an independently-useful intermediate state. The old Phase 3c commit (`505a834`) stays as historical reference. Verification: - cargo check -p wzp-codec -p wzp-client -p wzp-relay: 0 errors - cargo check -p wzp-desktop: only pre-existing `tauri::generate_context!()` panic on missing ../dist (Vite output not built on host) — no Rust compile errors from our changes - cargo test -p wzp-codec --lib: 69 passing (unchanged) - cargo test -p wzp-client --lib: 35 passing + 1 ignored (unchanged) Next: scripts/build-tauri-android.sh to get the actual Tauri APK — NOT build-and-notify.sh which builds the dead legacy android/app. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:31:09 +04:00
Siavash Sameni	b83c31b5d1	fix(android): remove duplicate TextAlign import in InCallScreen.kt Pre-existing build breakage on feat/desktop-audio-rewrite @ `8ceb6f4` — TextAlign was imported twice (line 5 and line 50), causing Kotlin compilation to fail with: e: InCallScreen.kt:5:39 Conflicting import, imported name 'TextAlign' is ambiguous e: InCallScreen.kt:50:39 Conflicting import, imported name 'TextAlign' is ambiguous The line-5 copy was squeezed into the middle of the foundation.* block (alphabetically out of place) — an accidental extra paste. The line-50 copy sits in the correct alphabetical position. Removed the former. This blocks the APK build for the opus-DRED-v2 rebase. Unrelated to DRED itself but the error surfaced because the cherry-picked phases caused a clean Gradle build (no UP-TO-DATE short-circuit) that re-compiled InCallScreen.kt against the fresh class graph. Also noting that the previous working APK (unridden-alfonso.apk) was built from the stale `d0c1731` baseline which didn't have this bug — one more reason the stale-branch build problem went unnoticed until the opus-DRED-v2 rebase forced a clean Gradle pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:12:23 +04:00
Siavash Sameni	1f607281fd	fix(build): build-and-notify.sh — parameterize branch, fail loud on pull errors Same fix that landed on the old opus-DRED branch as `c95255d`: the remote build script hardcoded `feat/android-voip-client` and swallowed the reset failure with `\|\| true`, silently leaving the tree on whatever branch was there. This ported the fix forward to feat/desktop-audio- rewrite (which had the same bug). Fix: Local side: - Auto-detect current branch via `git branch --show-current` - Accept `--branch NAME` override - Pass branch as a third positional arg to the remote script - Abort on detached HEAD - Updated usage docs for the "build what I'm working on" default Remote side: - Read BRANCH from $3, abort if empty - `git fetch origin "$BRANCH"` — errors surface - `git reset --hard "origin/$BRANCH"` — no `\|\| true`, failures abort - Echo the resolved commit hash + subject after reset - Notifications include both branch and hash: "WZP Android [opus-DRED-v2 @ <hash>] done! APK: ..." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:07:15 +04:00
Siavash Sameni	7515417202	feat(telemetry): Phase 4 — LossRecoveryUpdate protocol + relay metrics + DebugReporter Phase 4 lays the telemetry foundation for distinguishing DRED recoveries from classical PLC in production: a new SignalMessage variant, two new per-session Prometheus counters on the relay side, and a highlighted loss-recovery section in the Android DebugReporter. The periodic emitter (client → relay) and Grafana panel are deferred to Phase 4b — this commit ships the protocol surface, the relay sink, and the immediate user-visible debug output. Once 4b lands the full path (emitter → relay → Prometheus → Grafana), the metrics here will automatically start receiving data. Scope decision — why not extend QualityReport instead: The existing wire-format QualityReport is a fixed 4-byte media packet trailer. Adding counter fields to it would shift the binary layout and break backward compatibility (old receivers would parse the last 4 bytes of the extended trailer as QR, corrupting audio). Using a new SignalMessage variant on the reliable QUIC signal stream sidesteps the wire-format problem entirely — serde JSON enums tolerate unknown variants gracefully on old receivers, and the signal channel is the right layer for periodic telemetry aggregates. Changes: wzp-proto/src/packet.rs: - New SignalMessage::LossRecoveryUpdate variant carrying: * dred_reconstructions: u64 (monotonic since call start) * classical_plc_invocations: u64 (monotonic) * frames_decoded: u64 (for rate calculation) - All three fields tagged #[serde(default)] for forward compat. wzp-client/src/featherchat.rs: - Added a match arm so signal_to_call_type() handles the new variant (treat as Offer for featherChat bridging purposes). wzp-relay/src/metrics.rs: - Two new IntCounterVec metrics on the relay, labeled by session_id: * wzp_relay_session_dred_reconstructions_total * wzp_relay_session_classical_plc_total - New method update_session_loss_recovery(session_id, dred, plc) applies monotonic deltas: if the incoming totals exceed the current counter, the difference is inc_by'd. If the incoming totals are LOWER (client restart or counter reset), the Prometheus counter holds steady until the client catches up. This matches the existing update_session_buffer delta pattern. - remove_session_metrics() now cleans up the two new labels. - New test session_loss_recovery_monotonic_delta exercises: * initial population (10 DRED, 2 PLC) * forward advance (25, 5 → delta +15, +3) * lower values ignored (client reset → counters unchanged) * client catches up (30, 8 → advances to new max) - Existing session_metrics_cleanup test extended to cover the new counters. android/app/src/main/java/com/wzp/debug/DebugReporter.kt: - Phase 4 users — and incident responders — need to quickly see whether DRED is actually firing during a call. The stats JSON already carries the counters (after Phase 3c), but they were buried in the trailing JSON dump. Added a dedicated "=== Loss Recovery ===" section to the meta preamble that extracts dred_reconstructions, classical_plc_invocations, frames_decoded, and fec_recovered from the JSON and displays them plainly, plus computed percentages when frames_decoded > 0. - New extractLongField helper: tiny hand-rolled JSON integer extractor. We don't want to pull in a full JSON parser for this single use case and CallStats has a flat, well-known schema. Verification: - cargo check --workspace: zero errors - cargo test -p wzp-proto --lib: 63 passing - cargo test -p wzp-codec --lib: 68 passing - cargo test -p wzp-client --lib: 35 passing (+1 ignored probe) - cargo test -p wzp-relay --lib: 68 passing (+1 new Phase 4 test) - cargo check -p wzp-android --lib: zero errors - Android APK build verified earlier today (unridden-alfonso.apk via the remote Docker builder) — Phase 0–3c confirmed to compile end-to-end on the NDK target. Phase 4b remaining (not blocking this commit): - Periodic LossRecoveryUpdate emitter in wzp-client/src/call.rs and wzp-android/src/engine.rs (every ~5 s) - Relay-side handler in main.rs that matches the new variant and calls metrics.update_session_loss_recovery - Grafana "Loss recovery breakdown" panel in docs/grafana-dashboard.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:03:39 +04:00
Siavash Sameni	505a834c5b	feat(codec): Phase 3c — Android engine.rs DRED reconstruction on packet loss Phase 3c mirrors Phase 3b on the Android receive path. With Phase 0-3b landed on desktop + Android encoder, this commit completes codec-layer loss recovery on the Android decoder side. Architectural difference vs desktop: engine.rs has NO jitter buffer. The recv task reads packets directly from the transport via recv_media().await and writes decoded audio straight into the playout ring. There is no PlayoutResult::Missing equivalent. Gap detection therefore has to be done via sequence-number tracking — when a packet arrives with seq > expected_seq, the frames in between are missing and we attempt to reconstruct them via DRED before decoding the newly- arrived packet. Implementation: Imports & types: - Added wzp_codec::AdaptiveDecoder, wzp_codec::dred_ffi::{ DredDecoderHandle, DredState} imports. - Changed the `decoder` local from Box<dyn AudioDecoder> (via wzp_codec::create_decoder) to concrete AdaptiveDecoder::new(profile). Same reasoning as Phase 3b: reconstruct_from_dred is an inherent method, not a trait method, so we need the concrete type. Recv task state (all task-local, no new struct fields): - dred_decoder: DredDecoderHandle - dred_parse_scratch: DredState (reused, overwritten per parse) - last_good_dred: DredState (cached most-recent valid state) - last_good_dred_seq: Option<u16> - expected_seq: Option<u16> (for gap detection) - dred_reconstructions: u64 (telemetry) - classical_plc_invocations: u64 (telemetry) Recv loop body (Opus source packets only): 1. Parse DRED from the new packet first so last_good_dred reflects the freshest state available for gap recovery. 2. Detect a gap: gap = pkt.seq.wrapping_sub(expected_seq). Cap at MAX_GAP_FRAMES = 16 (320 ms) to avoid huge wraparound scenarios. 3. For each missing seq in the gap: offset = (last_good_dred_seq - missing_seq) * frame_samples if 0 < offset <= last_good_dred.samples_available(): reconstruct_from_dred + write to playout ring bump dred_reconstructions else: decoder.decode_lost (classical PLC) + write + bump plc counter 4. Decode the current packet normally and write to playout ring (unchanged from Phase 2). 5. Update expected_seq = pkt.seq.wrapping_add(1). Profile-switch handling: when the incoming codec changes (triggering decoder.set_profile), reset last_good_dred_seq and expected_seq to None. The cached DRED state is tied to the old profile's frame rate and would produce wrong offsets after the switch; starting fresh is correct. Decode-error fallback: the existing `Err(e) => decode_lost` branch now also increments classical_plc_invocations so the counter accurately reflects all PLC invocations (gap-detected AND decode- error-triggered). Telemetry (CallStats additions): - stats.dred_reconstructions: u64 - stats.classical_plc_invocations: u64 Both updated on every packet arrival in the existing stats.lock() block alongside frames_decoded/fec_recovered, so the Android UI and JNI bridge already have these values without any further plumbing. The periodic recv stats log now includes both counters. Ordering note: DRED gap reconstruction happens BEFORE decoding the new packet's audio because the playout ring is FIFO. Gap samples must be written before the new packet's samples so temporal order is preserved. Out-of-order late arrivals (seq < expected_seq) are naturally dropped as stale by the gap detection (gap would be a large wraparound value exceeding MAX_GAP_FRAMES). Verification: - cargo check --workspace: zero errors - cargo test -p wzp-codec --lib: 68 passing (unchanged from Phase 3b) - cargo test -p wzp-client --lib: 35 passing (unchanged from Phase 3b) - cargo check -p wzp-android --lib: zero errors - cargo test -p wzp-android cannot run on macOS host (pre-existing -llog linker dep, unrelated). Real end-to-end verification happens via the Android APK build on the remote Docker builder (scripts/build-and-notify.sh). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:03:31 +04:00
Siavash Sameni	27bc264738	feat(codec): Phase 3b — CallDecoder DRED reconstruction on packet loss Phase 3b of the DRED integration — wires the Phase 3a FFI primitives into the desktop receive path. When the jitter buffer reports a missing Opus frame, CallDecoder now attempts to reconstruct the audio from the most recently parsed DRED side-channel state before falling through to classical PLC. Architectural refinement vs the PRD's literal wording: the PRD said "jitter buffer takes a Box<dyn DredReconstructor>". After checking deps, wzp-transport depends only on wzp-proto (not wzp-codec). Putting DRED state in the jitter buffer would require a new cross-crate dep and couple the codec-agnostic buffer to libopus. Instead, this commit keeps the DRED state ring and reconstruction dispatch inside CallDecoder (one layer up from the jitter buffer), intercepting the existing PlayoutResult::Missing signal. Same lookahead/backfill semantics, cleaner layering, zero change to wzp-transport. Changes: CallDecoder field type: Box<dyn AudioDecoder> → AdaptiveDecoder. Required because Phase 3b calls the inherent reconstruct_from_dred method, which cannot live on the AudioDecoder trait without dragging libopus DredState through wzp-proto. In practice AdaptiveDecoder was the only AudioDecoder implementor anyway — the trait abstraction was buying nothing. Method call sites unchanged because AdaptiveDecoder also implements AudioDecoder. New CallDecoder fields: - dred_decoder: DredDecoderHandle - dred_parse_scratch: DredState (scratch for parse_into) - last_good_dred: DredState (cached most-recent valid state) - last_good_dred_seq: Option<u16> - dred_reconstructions: u64 (Phase 4 telemetry) - classical_plc_invocations: u64 (Phase 4 telemetry) CallDecoder::ingest — on Opus non-repair packets, parse DRED into the scratch state. On success (samples_available > 0), std::mem::swap the scratch into last_good_dred and record the seq. This is O(1) per packet, zero allocation after construction (the two DredState buffers are allocated once in new() and reused forever). CallDecoder::decode_next — on PlayoutResult::Missing(seq) for Opus profiles: if last_good_dred_seq > seq and the seq delta × frame_samples fits within samples_available, call audio_dec.reconstruct_from_dred and bump dred_reconstructions. Otherwise fall through to classical PLC and bump classical_plc_invocations. The Codec2 path always falls through to classical PLC since DRED is libopus-only and AdaptiveDecoder::reconstruct_from_dred rejects Codec2 tiers explicitly. OpusDecoder and AdaptiveDecoder: new inherent reconstruct_from_dred method that delegates to the underlying DecoderHandle. Needed to bridge CallDecoder's wzp-client code to the Phase 3a FFI wrappers without touching the AudioDecoder trait. CRITICAL FINDING — raised DRED loss floor from 5% to 15%: Phase 3b testing discovered that libopus 1.5's DRED emission window scales aggressively with OPUS_SET_PACKET_LOSS_PERC. Empirical data (see probe_dred_samples_available_by_loss_floor, an #[ignore]'d diagnostic test in call.rs): loss_pct samples_available effective_ms 5% 720 15 ms (useless!) 10% 2640 55 ms 15% 4560 95 ms 20% 6480 135 ms 25%+ 8400 (capped) 175 ms (~87% of 200 ms configured) The Phase 1 default of 5% produced only a 15 ms reconstruction window — too small to even cover a single 20 ms Opus frame. DRED was effectively disabled even though it was emitting bytes. Raised the floor to 15% (95 ms window) as the minimum that actually provides single-frame loss recovery. This updates Phase 1's DRED_LOSS_FLOOR_PCT constant in opus_enc.rs and the accompanying module docstring. Trade-off: 15% assumed loss slightly increases encoder bitrate overhead on clean networks. Measured via the existing phase1 bitrate probe: Before (5% floor): 3649 bytes/sec at Opus 24k + 300 Hz sine After (15% floor): 3568 bytes/sec at Opus 24k + 300 Hz sine The delta is within noise — 15% isn't meaningfully more expensive than 5% on this signal, which suggests the DRED emission size is signal- dependent rather than loss-dependent for small values. Net result: we get a 6x larger reconstruction window for essentially free. Tests (+3 DRED recovery, +1 #[ignore]'d probe): - opus_single_packet_loss_is_recovered_via_dred — full encode → ingest → decode_next loop with one packet dropped mid-stream. Asserts dred_reconstructions ≥ 1 and observes the exact counter deltas. - opus_lossless_ingest_never_triggers_dred_or_plc — baseline behavior, lossless stream never takes the Missing branch. - codec2_loss_falls_through_to_classical_plc — Codec2 never reconstructs via DRED even if state were populated (which it won't be — Codec2 packets don't carry DRED bytes). - probe_dred_samples_available_by_loss_floor — #[ignore]'d diagnostic that sweeps loss_pct values and prints the resulting DRED window sizes. Kept for future tuning work. New CallDecoder introspection accessors (public but undocumented in the PRD): last_good_dred_seq() and last_good_dred_samples_available() for test diagnostics and future telemetry surfaces in Phase 4. Verification: - cargo check --workspace: zero errors - cargo test -p wzp-codec --lib: 68 passing (Phase 3a baseline held) - cargo test -p wzp-client --lib: 35 passing (+3 Phase 3b tests, +1 ignored diagnostic, no regressions) Next up: Phase 3c mirrors this on the Android engine.rs receive path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:03:24 +04:00
Siavash Sameni	c27b39d553	feat(codec): Phase 3a — DRED FFI primitives (DredDecoderHandle + DredState) Phase 3a of the DRED integration — the foundation for codec-layer loss recovery. Adds three new safe wrappers to crates/wzp-codec/src/dred_ffi.rs over the raw opusic-sys FFI, plus the reconstruction method on the existing DecoderHandle. No call-site integration yet — that lands in Phase 3b (desktop) and Phase 3c (Android). New types: - `DredDecoderHandle`: owns mut OpusDREDDecoder from opus_dred_decoder_create. Used for parsing DRED side-channel data out of arriving Opus packets. This is a SEPARATE libopus object from OpusDecoder — it has its own internal state. Freed via opus_dred_decoder_destroy on Drop. - `DredState`: owns mut OpusDRED from opus_dred_alloc (a fixed ~10.6 KB buffer per libopus 1.5). Holds parsed DRED data between the parse and reconstruct steps. Reusable — parse_into overwrites contents. Tracks samples_available as a cached u32 so callers don't thread the value separately. Freed via opus_dred_free on Drop. New methods: - `DredDecoderHandle::parse_into(&mut self, state: &mut DredState, packet)` wraps opus_dred_parse with max_dred_samples=48000 (1s max), sampling_rate =48000, defer_processing=0. Returns the positive sample offset of the first decodable DRED sample, 0 if no DRED is present, or an error. Populates state.samples_available so subsequent reconstruct calls know the valid offset range. - `DecoderHandle::reconstruct_from_dred(&mut self, state, offset_samples, output)` wraps opus_decoder_dred_decode. Reconstructs audio at a specific sample position (positive, measured backward from the DRED anchor packet) into a caller-provided output buffer. Validates that 0 < offset_samples <= state.samples_available() before calling the FFI to catch range bugs. Tests (+7, wzp-codec total: 68 passing): - dred_decoder_handle_creates_and_drops - dred_state_creates_and_drops - dred_state_reset_zeroes_counter - dred_parse_and_reconstruct_roundtrip — end-to-end validation. Encodes 60 frames of a 300 Hz sine wave through a DRED-enabled Opus 24k encoder, parses DRED state out of each arriving packet, asserts that at least one packet carries non-zero samples_available (DRED warm-up completes within the first second), then reconstructs 20 ms of audio from inside the window and asserts non-zero total energy. This is the hard signal that the full libopus 1.5 DRED FFI chain is correctly wired on our side. - reconstruct_with_out_of_range_offset_errors — offset > samples_available is rejected at the Rust layer before the FFI call. - reconstruct_with_zero_offset_errors — offset <= 0 rejected. - dred_parse_empty_packet_returns_zero — graceful handling of empty input. Architectural note (divergence from PRD's literal wording): The PRD said "jitter buffer takes a Box<dyn DredReconstructor>". After checking Cargo.toml for wzp-transport, it does NOT depend on wzp-codec — only wzp-proto. Adding a DRED state ring inside the jitter buffer would require a new cross-crate dependency and couple the codec-agnostic jitter buffer to libopus internals. Instead, Phase 3b will put the DRED state ring and reconstruction dispatch in CallDecoder (one layer up from the jitter buffer), intercepting the existing PlayoutResult::Missing signal and attempting reconstruction before falling through to classical PLC. The jitter buffer itself stays unchanged. Same lookahead/backfill semantics, cleaner layering. PRD's intent preserved, implementation refined. Verification: - cargo check --workspace: zero errors - cargo test -p wzp-codec --lib: 68 passing (61 Phase 2 baseline + 7 new) - The roundtrip test is the acceptance gate — it proves that opus_dred_decoder_create, opus_dred_alloc, opus_dred_parse, and opus_decoder_dred_decode all work correctly through our wrappers on real libopus 1.5.2 output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:03:14 +04:00
Siavash Sameni	6db5c25b54	feat(codec): Phase 2 — remove RaptorQ from Opus tiers, Codec2 unchanged Phase 2 of the DRED integration (docs/PRD-dred-integration.md). With Phase 1 having enabled DRED on every Opus profile, the app-level RaptorQ layer is now redundant overhead on those tiers: +20% bitrate, +40–100 ms receive-side latency (block wait), +CPU for stats we never used. This phase removes RaptorQ from the Opus encode and decode paths on both the desktop (wzp-client/call.rs) and Android (wzp-android/engine.rs) sides. Codec2 tiers keep RaptorQ with their current ratios unchanged — DRED is libopus-only and Codec2 has no neural equivalent. Encoder changes (the real bandwidth / CPU win): - CallEncoder::encode_frame and engine.rs encode loop now gate the RaptorQ path on !codec.is_opus(): - Opus source packets emit fec_block=0, fec_symbol=0, fec_ratio_encoded=0 in the MediaHeader - fec_enc.add_source_symbol is skipped on Opus - generate_repair + repair packet emission is skipped on Opus - block_id and frame_in_block counters stay frozen at 0 for Opus - Codec2 path is byte-for-byte identical to pre-Phase-2 behavior. Decoder changes (mostly cleanup, since both live decoder paths were already reading audio directly from source packets and only using the RaptorQ decoder output for stats): - CallDecoder::ingest skips fec_dec.add_symbol on Opus packets. Source packets still flow to the jitter buffer; Opus repair packets from old senders are dropped cleanly (repair packets never hit the jitter buffer either). - engine.rs recv loop skips fec_dec.add_symbol, fec_dec.try_decode, and fec_dec.expire_before on Opus packets. The `fec_recovered` stat counter becomes Codec2-only (a separate DRED reconstruction counter lands in Phase 4). Wire-format backward compat verified at pre-flight: - Old receiver + new sender: engine.rs pipeline.rs path gates on non-zero fec_block/fec_symbol which now never fire for Opus, so the RaptorQ decoder simply isn't fed. Audio flows normally. Desktop CallDecoder's old path accumulated packets into the stale-eviction HashMap, which cleans up after 2s — harmless. - New receiver + old sender: new receiver skips RaptorQ on Opus so old-sender repair packets are ignored entirely (no crash, no double- decode). Loses the (previously vestigial) RaptorQ recovery benefit, which was never actually active in the audio path. Source packets still decode normally. - No wire format version bump required. MediaHeader is unchanged; we just zero the FEC fields on Opus packets. Test changes: - Removed `encoder_generates_repair_on_full_block` — asserted the old (pre-Phase-2) RaptorQ-on-Opus behavior and is now incorrect. Replaced with two symmetric tests: - `opus_source_packets_have_zero_fec_header_fields` — verifies Phase 2 invariants on Opus packets - `opus_encoder_never_emits_repair_packets` — runs 20 frames of non-silent sine wave through a GOOD-profile encoder, asserts exactly 20 output packets, zero repair - `codec2_encoder_generates_repair_on_full_block` — same shape as the old test but on CATASTROPHIC profile (Codec2 1200, 8 frames/block, ratio 1.0) to verify Codec2 path still emits repairs as before Verification: - cargo check --workspace: zero errors - cargo test -p wzp-codec --lib: 61 passing (Phase 1 baseline held) - cargo test -p wzp-client --lib: 32 passing (+3 new Phase 2 tests, -1 old test removed) - cargo check -p wzp-android --lib: zero errors (host link of wzp-android tests fails on -llog per pre-existing Android-only build.rs, unrelated to this work; integration build via build-and-notify.sh will validate Android end-to-end) - Pre-existing broken integration test in crates/wzp-client/tests/handshake_integration.rs (SignalMessage schema drift) is NOT caused by this commit — baseline had the same 3 compile errors before Phase 2. Flagged as a separate cleanup task. Expected observable effects on a real call: - Opus 24k outgoing bitrate drops from ~28.8 kbps (ratio 0.2 RaptorQ) to ~25 kbps (base 24 kbps + DRED ~1–10 kbps signal-dependent) - Opus receive-side latency drops ~40 ms on clean network (no more block wait — jitter buffer emits as soon as a source packet arrives) - Codec2 calls show no latency or bitrate change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:02:42 +04:00
Siavash Sameni	54cbebd34e	feat(codec): Phase 1 — enable DRED on all Opus profiles, disable inband FEC Phase 1 of the DRED integration (docs/PRD-dred-integration.md). The Opus encoder now emits DRED (Deep REDundancy) bytes in every packet, carrying a neural-coded history of recent audio that the decoder can use to reconstruct loss bursts up to the configured window. Opus inband FEC (LBRR) is disabled because DRED does the same job better and running both wastes bitrate on overlapping protection. Tiered DRED duration policy per PRD: Studio (Opus 32k/48k/64k): 10 frames = 100 ms Normal (Opus 16k/24k): 20 frames = 200 ms Degraded (Opus 6k): 50 frames = 500 ms Each profile switch (via adaptive quality) updates the DRED duration to match the new tier. A 5% packet_loss floor is applied whenever DRED is active, because libopus 1.5 gates DRED emission on non-zero packet_loss. Real loss measurements from the quality adapter override upward. Escape hatch: AUDIO_USE_LEGACY_FEC=1 reverts the encoder to Phase 0 behavior (inband FEC Mode1, DRED off, no loss floor). Read once at OpusEncoder::new; call-scoped, not re-read mid-call. Trait-level set_inband_fec becomes a no-op in DRED mode to preserve the invariant even if external callers forget. Observations from the bitrate probe test (dred_mode_roundtrip_voice_pattern): DRED mode: 3649 bytes/sec (~29.2 kbps) on Opus 24k + 300 Hz sine Legacy mode: 2383 bytes/sec (~19.1 kbps) Delta: +10.1 kbps The delta is considerably larger than the "+1 kbps flat" figure I carried into the PRD from hazy memory of published DRED benchmarks. Likely because the input (300 Hz sine) is very compressible so the base Opus rate in legacy mode is well below the 24 kbps target, making the delta look disproportionate. Signal-dependent — real speech would probably show a different ratio. If production telemetry shows the overhead is excessive, we can cut DRED duration on the normal tier from 200 ms to 100 ms as a first tuning lever. Not blocking Phase 1 since the test still passes within the reasonable 2000–8000 bytes/sec bounds. Test changes (+8 tests, total wzp-codec: 61 passing): - dred_duration_for_studio_tiers_is_100ms (per-profile policy) - dred_duration_for_normal_tiers_is_200ms - dred_duration_for_degraded_tier_is_500ms - dred_duration_for_codec2_is_zero - default_mode_is_dred_not_legacy (sanity check on fresh construction) - dred_mode_roundtrip_voice_pattern (observes DRED bitrate, asserts bounds) - profile_switch_refreshes_dred_duration (verifies set_profile updates DRED) - set_inband_fec_noop_in_dred_mode (trait-level inband FEC no-op) Verification: - cargo check --workspace: zero errors, no new warnings - cargo test -p wzp-codec: 61/61 passing (53 pre-Phase-1 baseline + 8 new) - Empirical DRED bitrate observed via `rtk proxy cargo test dred_mode_roundtrip_voice_pattern -- --nocapture` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:02:35 +04:00
Siavash Sameni	86526a7ad4	feat(codec): Phase 0 — swap audiopus → opusic-c + opusic-sys (libopus 1.5.2) Phase 0 of the DRED integration (docs/PRD-dred-integration.md). No behavior change: inband FEC stays ON, no DRED, same bitrate, same quality. This commit unblocks Phase 1+ by getting us onto libopus 1.5.2 where DRED lives. Rationale for going straight to a custom DecoderHandle: opusic-c::Decoder's inner mut OpusDecoder pointer is pub(crate), so we cannot reach it for the Phase 3 DRED reconstruction path. Running two parallel decoders (one for audio, one for DRED) would drift because the DRED decoder wouldn't see normal decode calls. Single unified DecoderHandle over raw opusic-sys is the only correct architecture, so we build it in Phase 0 rather than rewriting opus_dec.rs twice. Changes: - Cargo.toml (workspace + wzp-codec): remove audiopus 0.3.0-rc.0, add opusic-c 1.5.5 (bundled + dred features), opusic-sys 0.6.0 (bundled), bytemuck 1. Pinned exactly for reproducible libopus 1.5.2. - opus_enc.rs: rewritten against opusic_c::Encoder. Argument order for Encoder::new swapped (Channels first). set_inband_fec(bool) now maps to InbandFec::Mode1 (the libopus 1.5 equivalent of 1.3's LBRR). encode uses bytemuck::cast_slice<i16,u16> at the &[u16] boundary. - dred_ffi.rs (new): DecoderHandle wrapping mut OpusDecoder directly via opusic-sys. Owns the allocation, frees on Drop. Exposes decode, decode_lost, and a pub(crate) as_raw_ptr() for the future Phase 3 DRED reconstruction. Send+Sync justified via &mut self access discipline. - opus_dec.rs: rewritten as a thin AudioDecoder impl over DecoderHandle. Behavior identical to pre-swap. Verification (Phase 0 acceptance gates): - cargo check --workspace: clean (30 pre-existing warnings in jni_bridge.rs unrelated to this work; zero in changed files). - cargo test -p wzp-codec: 53 tests pass (50 pre-swap + 6 new: 3 in dred_ffi.rs for DecoderHandle lifecycle, 3 in opus_enc.rs for version check and roundtrip). - linked_libopus_is_1_5 test asserts opusic_c::version() contains "1.5" — hard signal that the swap landed correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:02:15 +04:00

1 2 3 4 5 ...

425 Commits