Commit Graph

219 Commits

Author SHA1 Message Date
Siavash Sameni
9ae9441de4 fix(audio): check capture ring available before read (fixes Opus6k choppy)
Some checks failed
Mirror to GitHub / mirror (push) Failing after 32s
Build Release Binaries / build-amd64 (push) Failing after 3m58s
Partial reads from the capture ring consumed samples that were then
discarded when the send loop retried from buf[0]. For 20ms codecs this
was invisible (single Oboe burst fills 960 samples in one read), but
40ms codecs (Opus6k, 1920 samples) needed 2 bursts — the first partial
read consumed 960 real samples and threw them away.

Result: Opus6k produced ~11 frames/s instead of 25 (~44% of expected).

Fix: expose wzp_native_audio_capture_available() and check it before
reading, matching the desktop capture_ring.available() pattern. Partial
reads no longer occur because we only read when enough samples exist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 11:46:15 +04:00
Siavash Sameni
d424515542 feat: 5-tier quality classification, QualityDirective handling, debug tap stats
Some checks failed
Mirror to GitHub / mirror (push) Failing after 31s
Build Release Binaries / build-amd64 (push) Failing after 3m49s
- Extend Tier enum from 3 to 6 levels: Studio64k/48k/32k + Good +
  Degraded + Catastrophic with asymmetric hysteresis (down:3, up:5,
  studio:10)
- Handle QualityDirective signals in both desktop and Android engines
  — relay-coordinated codec switching now works end-to-end
- Add periodic TAP STATS to debug tap: packets in/out, fan-out avg,
  seq gaps, codecs seen (every 5s)
- Mark task #2 done (ParticipantInfo in federation signals already
  implemented)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:23:48 +04:00
Siavash Sameni
ea5fc17c34 fix(relay): debug tap signal logging, dual_path test regression, PRD updates
Some checks failed
Build Release Binaries / build-amd64 (push) Failing after 3m39s
Mirror to GitHub / mirror (push) Failing after 28s
- Add log_signal() and log_event() to DebugTap for RoomUpdate,
  QualityDirective, join/leave lifecycle events (task #11)
- Fix dual_path.rs Phase 7 regression: add missing ipv6_endpoint arg
  to 3 race() call sites
- Update PRDs to reflect actual implementation status: mark adaptive
  quality, coordinated codec, P2P, network awareness, protocol analyzer
- Update PROGRESS.md with QualityDirective gap and dual_path regression

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 09:54:52 +04:00
Siavash Sameni
d249b32ee5 test+docs: add tests for QualityDirective, ParticipantQuality; update docs
- QualityDirective signal roundtrip tests (with/without reason)
- ParticipantQuality unit tests (initial tier, degradation, weakest-link)
- Updated PROGRESS.md with desktop adaptive quality, relay coordinated
  switching, Oboe state polling entries
- Updated ARCHITECTURE.md SFU fan-out rules with QualityDirective
- Updated PRD-coordinated-codec.md with implementation status
- 312 tests passing across all modified crates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:56:46 +04:00
Siavash Sameni
22045bc5e6 feat: adaptive quality in desktop, relay quality directive, Oboe state polling
- Wire AdaptiveQualityController into desktop engine send/recv tasks
  (mirrors Android pattern: AtomicU8 pending_profile, auto-mode check)
- Wire same into Android engine send task (was only in recv before)
- QualityDirective SignalMessage variant for relay-initiated codec switch
- ParticipantQuality tracking in relay RoomManager (per-participant
  AdaptiveQualityController, weakest-link tier computation)
- Relay broadcasts QualityDirective to all participants when room-wide
  tier degrades (coordinated codec switching)
- Oboe stream state polling: poll getState() for up to 2s after
  requestStart() to ensure both streams reach Started before proceeding
  (fixes intermittent silent calls on cold start, Nothing Phone A059)

Tasks: #7, #25, #26, #31, #35

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:54:04 +04:00
Siavash Sameni
766c9df442 feat(dred): continuous DRED tuning, PMTUD, extended Opus6k window
- DredTuner: maps live network metrics (loss/RTT/jitter) to continuous
  DRED duration every ~500ms instead of discrete tier-locked values.
  Includes jitter-spike detection for pre-emptive Starlink-style boost.
- Opus6k DRED extended from 500ms to 1040ms (max libopus 1.5 supports)
- PMTUD: quinn MtuDiscoveryConfig with upper_bound=1452, 300s interval
- TrunkedForwarder respects discovered MTU (was hard-coded 1200)
- QuinnPathSnapshot exposes quinn internal stats + discovered MTU
- AudioEncoder trait: set_expected_loss() + set_dred_duration() methods
- PathMonitor: sliding-window jitter variance for spike detection
- Integrated into both Android and desktop send tasks in engine.rs
- 14 new tests (10 tuner unit + 4 encoder integration)
- Updated ARCHITECTURE.md, PROGRESS.md, PRD-dred-integration, PRD-mtu

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:38:37 +04:00
Siavash Sameni
a37c8b30fe fix(native): add missing bt_active field to stall detector config
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 17:25:11 +04:00
Siavash Sameni
137fe5f084 fix(bluetooth): BT SCO mode skips 48kHz + VoiceCommunication on capture
Root cause: Oboe capture at 48kHz with InputPreset::VoiceCommunication
cannot open against a BT SCO device (only supports 8/16kHz). The stream
silently falls back to builtin mic, delivering zeros.

Fix: add bt_active flag to WzpOboeConfig. When set, capture skips
setSampleRate and setInputPreset, letting the system route to BT SCO
at its native rate. Oboe's SampleRateConversionQuality::Best resamples
to 48kHz for our ring buffers. Playout uses Usage::Media in BT mode.

New API: wzp_native_audio_start_bt() for BT mode, called from
set_bluetooth_sco(on=true). Normal audio_start() restores the
standard config when switching back to earpiece/speaker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 17:23:19 +04:00
Siavash Sameni
5dfb5b3581 fix(bluetooth): use Shared mode for Oboe + delay restart for BT route
Two fixes for BT audio silence:

1. Switch Oboe streams from Exclusive to Shared sharing mode. Exclusive
   mode bypasses Oboe's internal resampler, so opening a 48kHz stream
   against a BT SCO device (8/16kHz only) fails at the AudioPolicy
   level. Shared mode lets Oboe's resampler bridge the gap.

2. Add 500ms post-SCO delay before Oboe restart. The audio policy needs
   time to apply the bt-sco route after setCommunicationDevice returns.
   Without the delay, Oboe opens against the old device (handset).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 17:14:06 +04:00
Siavash Sameni
fd0ccf8e99 fix(bluetooth): enable Oboe sample rate conversion for BT SCO (8/16kHz)
BT SCO devices only support 8kHz or 16kHz but our Oboe streams request
48kHz. Without resampling, AudioPolicyManager rejects the input stream
("getInputProfile could not find profile for... sampling rate 48000").

Fix: add setSampleRateConversionQuality(Best) to both capture and
playout stream builders. Oboe resamples internally so our ring buffers
stay at 48kHz regardless of the hardware sample rate.

Also removes the broken setBluetoothScoOn/isBluetoothScoOn calls from
stop_bluetooth_sco — just call stopBluetoothSco() unconditionally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 17:08:48 +04:00
Siavash Sameni
a798634b3d fix(signal): add call_id to Hangup — prevents stale hangup killing new calls
Root cause: Hangup had no call_id field. The relay forwarded hangups to
ALL active calls for a user. When user A hung up call 1 and user B
immediately placed call 2, the relay's processing of A's hangup would
also kill call 2 (race window ~1-2s).

Fix: add optional call_id to Hangup (backwards-compatible via serde
skip_serializing_if). When present, the relay only ends the named call.
Old clients send call_id=None and get the legacy broadcast behavior.

Also: clear pending_path_report in Hangup recv handler and
internal_deregister to prevent stale oneshot channels from blocking
subsequent call setups.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 16:39:21 +04:00
Siavash Sameni
4c1ad841e1 feat(android): Bluetooth audio routing + network change detection + per-arch APK builds
Bluetooth: wire existing AudioRouteManager SCO support through both app
variants. Replace binary speaker toggle with 3-way route cycling
(Earpiece → Speaker → Bluetooth). Tauri side adds JNI bridge functions
(start/stop/query SCO, device availability) and Oboe stream restart.

Network awareness: integrate Android ConnectivityManager to detect
WiFi/cellular transitions and feed them to AdaptiveQualityController
via lock-free AtomicU8 signaling. Enables proactive quality downgrade
and FEC boost on network handoffs.

Build: add --arch flag to build-tauri-android.sh supporting arm64,
armv7, or all (separate per-arch APKs for smaller tester binaries).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 16:07:41 +04:00
Siavash Sameni
29cd23fe39 fix(p2p): connection cleanup — 4 fixes for stale/dead connections
PRD 4: Disable IPv6 direct dial/accept temporarily. IPv6 QUIC
handshakes succeed but connections die immediately on datagram
send ("connection lost"). IPv4 candidates work reliably. IPv6
candidates still gathered but filtered at dial time.

PRD 1: Close losing transport after Phase 6 negotiation. The
non-selected transport now gets an explicit QUIC close frame
instead of silently dropping after 30s idle timeout. Prevents
phantom connections from polluting future accept() calls.

PRD 2: Harden accept loop with max 3 stale retries. Stale
connections are explicitly closed (conn.close) and counted.
After 3 stale connections, the accept loop aborts instead of
spinning until the race timeout.

PRD 3: Resource cleanup — close old IPv6 endpoint before
creating a new one in place_call/answer_call. Add Drop impl
to CallEngine so tasks are signalled to stop on ungraceful
shutdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 15:11:50 +04:00
Siavash Sameni
4d66d3769d fix(relay): set peer_relay_fp on originating relay when answer arrives
The originating relay (where the caller is) never set peer_relay_fp
because the call was created locally. When the callee's answer
arrived via federation, the cross-relay dispatcher handled it but
didn't mark the call as cross-relay. This meant the caller's
MediaPathReport was delivered via local hub.send_to() to a peer
fingerprint that isn't connected locally — silently dropped.

Fix: in the cross-relay answer dispatcher, call
reg.set_peer_relay_fp(call_id, Some(origin_relay_fp)) so the
originating relay knows to forward MediaPathReport via federation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 14:49:34 +04:00
Siavash Sameni
002df15c5e fix(cli): add .. rest pattern for RegisterPresenceAck error arm
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 14:32:57 +04:00
Siavash Sameni
1eb82d77b8 feat(relay+client): relay reports build version in Ack
Add relay_build field to RegisterPresenceAck so the client logs
which relay version it connected to. Shows in the debug log as
register_signal:ack_received {"relay_build":"f843a93"}.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 14:27:58 +04:00
Siavash Sameni
f843a934fe fix(relay): forward MediaPathReport across federation
MediaPathReport was only delivered via local signal_hub, so calls
between peers on different relays always hit peer_report_timeout
and fell back to relay — even when direct P2P worked perfectly.

Fix: check peer_relay_fp in call_registry (same pattern as
DirectCallAnswer). If the peer is on a remote relay, wrap in
FederatedSignalForward and send via federation link. Also fix
the cross-relay dispatcher to deliver to BOTH caller and callee
(not just caller), since the report can come from either side.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 14:14:30 +04:00
Siavash Sameni
1904b19d05 fix(direct): validate A-role accepted connection, skip stale ones
The Acceptor's accept() on the shared signal endpoint can dequeue
a stale QUIC connection from a previous call that the Dialer has
already dropped. This results in "connection lost" errors when
media datagrams are sent — 100% drops on both sides.

Fix: after accepting a connection, check close_reason(). If the
connection is already closed, log a warning and re-accept. Also
verify max_datagram_size() is available before returning.

Additionally: emit transport details (remote addr, max_datagram,
close_reason) in the call_engine_starting debug event so stale
connection issues are visible in the user-facing debug log.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 13:50:21 +04:00
Siavash Sameni
40955bd11c debug(media): add connection diagnostics for direct P2P drops
When direct P2P calls show 100% datagram drops, we need to know
WHY send_media() fails. This commit adds:

- Remote address + stable_id logging on A-role accept and D-role
  dial success (dual_path.rs) — tells us which candidate won
- Remote address + max_datagram_size on engine transport init —
  verifies datagrams are negotiated
- last_send_err in send heartbeat — captures the actual error
  from send_datagram() failures
- QuinnTransport::remote_address() helper

Also fixes UI badge: was looking for wrong event name
("dual_path_race_won" → "path_negotiated").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 13:29:58 +04:00
Siavash Sameni
0b62d3e22f fix(cli): add missing build_version fields to Offer/Answer
CLI binary was missing the new caller_build_version and
callee_build_version fields, causing E0063 compile errors on
Linux relay/client builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 13:09:26 +04:00
Siavash Sameni
bd6733b2e5 feat(signal): advertise build version in Offer/Answer
Add caller_build_version / callee_build_version (git short hash)
to DirectCallOffer and DirectCallAnswer so peers can identify each
other's build in debug logs. Also log own build at register time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 12:43:55 +04:00
Siavash Sameni
7d1b8f1fdc fix(android): add missing CallSetup pattern fields (.. rest)
The CallSetup enum gained peer_direct_addr and peer_local_addrs
in Phase 5.5 but the wzp-android signal recv match arm was never
updated, breaking cargo ndk builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 12:09:44 +04:00
Siavash Sameni
c2d298beb5 feat(net): Phase 7 — dual-socket IPv4+IPv6 ICE
Adds a dedicated IPv6 QUIC endpoint (IPV6_V6ONLY=1 via socket2)
alongside the existing IPv4 signal endpoint for proper dual-stack
P2P connectivity. Previous [::]:0 dual-stack attempt broke IPv4
on Android; this uses separate sockets per address family like
WebRTC/libwebrtc.

- create_ipv6_endpoint(): socket2-based IPv6-only UDP socket,
  tries same port as IPv4 signal EP, falls back to ephemeral
- local_host_candidates(v4_port, v6_port): now gathers IPv6
  global-unicast (2000::/3) and unique-local (fc00::/7) addrs
- dual_path::race(): A-role accepts on both v4+v6 via select!,
  D-role routes each candidate to matching-AF endpoint
- Graceful fallback: if IPv6 unavailable, .ok() → None → pure
  IPv4 behavior identical to pre-Phase-7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 11:54:13 +04:00
Siavash Sameni
aee41a638d fix(audio+net): revert dual-stack [::]:0, add Oboe playout stall auto-restart
Two fixes:

## Revert [::]:0 dual-stack sockets → back to 0.0.0.0:0

Android's IPV6_V6ONLY=1 default on some kernels (confirmed on
Nothing Phone) makes [::]:0 IPv6-only, silently killing ALL
IPv4 traffic. This broke P2P direct calls: IPv4 LAN candidates
(172.16.81.x) couldn't complete QUIC handshakes through the
IPv6-only socket, causing local_direct_ok=false and relay
fallback on every call after the first.

Reverted all bind sites to 0.0.0.0:0 (reliable IPv4). IPv6 host
candidates are disabled in local_host_candidates() until a
proper dual-socket approach (one IPv4 + one IPv6 endpoint,
Phase 7) is implemented.

## Fix A (task #35): Oboe playout callback stall auto-restart

The Nothing Phone's Oboe playout callback fires once (cb#0) and
then stops draining the ring on ~50% of cold-launch calls. Fix
D+C (stop+prime from previous commit) didn't help because
audio_stop is a no-op on cold launch.

New approach: self-healing watchdog in audio_write_playout.
Tracks the playout ring's read_idx across writes. If read_idx
hasn't advanced in 50 consecutive writes (~1 second), the Oboe
playout callback has stopped:

1. Log "playout STALL detected"
2. Call wzp_oboe_stop() to tear down the stuck streams
3. Clear both ring buffers (prevent stale data reads)
4. Call wzp_oboe_start() to rebuild fresh streams
5. Log success/failure
6. Return 0 (caller retries on next frame)

This is the same teardown+rebuild that "rejoin" does — but
triggered automatically from the first stalled call instead of
requiring the user to hang up and redial. The watchdog runs
on every write so it fires within 1s of the stall starting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 11:24:16 +04:00
Siavash Sameni
9fb92967eb fix(net): bind all endpoints to [::]:0 for dual-stack IPv4+IPv6
Every QUIC endpoint was bound to 0.0.0.0:0 (IPv4-only). This
silently killed ALL IPv6 host candidates: the Dialer couldn't
send packets to [2a0d:...] addresses (wrong address family on
the socket), and the Acceptor couldn't receive incoming IPv6
QUIC handshakes. The IPv6 candidates were gathered and advertised
in DirectCallOffer/Answer but were completely non-functional.

On same-LAN with dual-stack (which both test phones have), this
meant:
- JoinSet fanned out 3+ candidates (2× IPv6 + 1× IPv4)
- IPv6 dials failed silently or timed out
- IPv4 dial worked but competed with failed IPv6 for JoinSet
  attention
- Sometimes the JoinSet returned an IPv6 failure before the
  IPv4 success, causing unnecessary fallback to relay

Fix: bind to [::]:0 (IPv6 any) instead of 0.0.0.0:0. On
dual-stack systems (Linux/Android default), [::]:0 creates a
socket that handles BOTH:
- IPv6 natively (global unicast, ULA)
- IPv4 via v4-mapped addresses (::ffff:172.16.81.x)

One socket, both protocols. All 7 bind sites updated:
- register_signal (signal endpoint)
- do_register_signal
- ping_relay
- probe_reflect_addr (fresh endpoint fallback)
- dual_path::race (A-role fresh, D-role fresh, relay fresh)

With this fix, same-LAN P2P should prefer the IPv6 path (no
NAT, direct routing, lower latency) and fall through to IPv4
if IPv6 fails — relay is the last resort after ALL candidates
are exhausted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 11:09:06 +04:00
Siavash Sameni
f5542ef822 feat(p2p): Phase 6 — ICE-style path negotiation
Before Phase 6, each side's dual-path race ran independently and
committed to whichever transport completed first. When one side
picked Direct and the other picked Relay, they sent media to
different places — TX > 0 RX: 0 on both, completely silent call.

Phase 6 adds a negotiation step: after the local race completes,
each side sends a MediaPathReport { call_id, direct_ok, winner }
to the peer through the relay. Both wait for the other's report
before committing a transport to the CallEngine. The decision
rule is simple: if BOTH report direct_ok = true, use direct; if
EITHER reports false, BOTH use relay.

## Wire protocol

New `SignalMessage::MediaPathReport { call_id, direct_ok,
race_winner }`. The relay forwards it to the call peer via the
same signal_hub routing used for DirectCallOffer/Answer. The
cross-relay dispatcher also forwards it.

## dual_path::race restructured

Returns `RaceResult` instead of `(Arc<QuinnTransport>, WinningPath)`:
- `direct_transport: Option<Arc<QuinnTransport>>`
- `relay_transport: Option<Arc<QuinnTransport>>`
- `local_winner: WinningPath`

Both paths are run as spawned tasks. After the first completes,
a 1s grace period lets the loser also finish. The connect
command gets BOTH transports (when available) and picks the
right one based on the negotiation outcome. The unused transport
is dropped.

## connect command flow (revised)

1. Run race() → RaceResult with both transports
2. Send MediaPathReport to relay with our direct_ok
3. Install oneshot; wait for peer's report (3s timeout)
4. Decision: both direct_ok → use direct; else → use relay
5. Start CallEngine with the agreed transport

If the peer never responds (old build, timeout), falls back to
relay — backward compatible.

## Relay forwarding

MediaPathReport is forwarded like DirectCallOffer/Answer: via
signal_hub.send_to(peer_fp) for same-relay calls, and via
cross-relay dispatcher for federated calls.

## Debug log events

- `connect:dual_path_race_done` — local race result
- `connect:path_report_sent` — our report to the peer
- `connect:peer_report_received` — peer's report
- `connect:peer_report_timeout` — peer didn't respond (3s)
- `connect:path_negotiated` — final agreed path with reasons

Full workspace test: 423 passing (no regressions).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 10:03:42 +04:00
Siavash Sameni
026940d492 fix(federation): diagnostic logging for cross-relay media routing
Added warn-level log in handle_datagram when a federation
datagram arrives but no matching local room is found. Prints:
- room_hash (8-byte tag from the datagram)
- active_rooms (all rooms the relay currently has)
- seq + peer label

This diagnoses the cross-relay recv_fr=0 issue: if media IS
arriving from the peer relay but the room hash doesn't match any
active room, the log tells us exactly what hash is expected vs
what rooms exist locally. If no datagram log fires at all, the
issue is upstream (peer relay not forwarding, federation link
down, etc.).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 09:27:34 +04:00
Siavash Sameni
6cd61fc63b feat(federation): Phase 4.1 — call-* rooms are implicitly global
All rooms with names starting with 'call-' are now treated as
global rooms by the federation pipeline. This enables relay-
mediated media fallback for cross-relay direct calls: when Alice
on Relay A and Bob on Relay B both join the same call-<id> room,
the federation media forwarding pipeline (GlobalRoomActive
announcements + datagram forwarding + presence replication)
kicks in automatically without any runtime registration step.

Previously, cross-relay direct calls that couldn't go P2P
(symmetric NAT on either side) failed with "no media path"
because the call-<id> room wasn't in the configured global_rooms
set and media datagrams weren't forwarded across the federation
link.

The relay's existing ACL for call-* rooms (only the two
authorized fingerprints from the call registry can join)
prevents random clients from creating or eavesdropping on
call rooms.

## Changes

### `is_global_room` (federation.rs)
Added `room.starts_with("call-")` check before the static
global_rooms set lookup. Returns true immediately for any
call-prefixed room.

### `resolve_global_room` (federation.rs)
Return type changed from `Option<&str>` to `Option<String>`
(owned) because call-* room names aren't stored on `self` —
they come from the caller and resolve to themselves as the
canonical name. The 13 callers continue to work via String/&str
auto-deref; 4 HashMap lookups needed explicit `.as_str()` or
`&` borrows.

Full workspace test: 423 passing (no regressions).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 08:55:01 +04:00
Siavash Sameni
16793be36f fix(p2p): Phase 5.6 — direct-path head start + hangup propagation + media debug events
Three fixes from a field-test log where same-LAN calls were
still losing the dual-path race to the relay path, peers were
getting stuck on an empty call screen when the other side
hung up, and 1-way audio was hard to diagnose because the
GUI debug log had no media-level events.

## 1. Direct-path 500ms head start (dual_path.rs)

The race was resolving in ~105ms with Relay winning even when
both phones were on the same MikroTik LAN with valid IPv6 host
candidates. Root cause: the relay dial is a plain outbound QUIC
connect that completes in whatever the client→relay RTT is
(~100ms), while the direct path needs the PEER to also process
its CallSetup, spin up its own race, and complete at least one
LAN dial back to us. That cross-client sequence reliably takes
longer than 100ms, so relay always won.

Fix: delay the relay_fut with `tokio::time::sleep(500ms)` before
starting its connect. Same-LAN direct dials complete in 30-50ms
typically, so the head start gives direct plenty of time to win
cleanly. Users on setups where direct genuinely can't work
(LTE-to-LTE cross-carrier) pay 500ms extra on the relay fallback,
which is invisible for a call setup.

## 2. Hangup propagation via a new hangup_call command (lib.rs + main.ts)

The hangup button was calling `disconnect` which stopped the
local media engine but never sent a SignalMessage::Hangup to
the relay. The peer never got notified and was stuck on the
call screen with silent audio. My earlier fix (commit e75b045)
only handled the RECEIVE side — auto-dismiss call screen on
recv:Hangup — but the SEND side was still missing.

New Tauri command `hangup_call`:
  1. Acquire state.signal.lock(), send SignalMessage::Hangup
     over the signal transport (best-effort; log + continue if
     signal is down)
  2. Acquire state.engine.lock(), stop the CallEngine

JS hangupBtn click handler now calls hangup_call with a fallback
to raw disconnect if the command is missing (older builds).

## 3. Media debug events (engine.rs + lib.rs)

Threaded tauri::AppHandle into CallEngine::start so the send/
recv tasks can emit call-debug events when the user has debug
logs enabled. Added on the Android branch (desktop branch
accepts the arg for API symmetry but doesn't emit yet):

  - media:first_send — emitted when the first encoded frame is
    handed to the transport. Useful for 1-way audio diagnosis:
    if this fires on side A but side B never sees media:first_recv,
    A's outbound is broken.
  - media:first_recv — emitted when the first packet from the
    peer arrives. Mirror of first_send.
  - media:send_heartbeat — every 2s with frames_sent, last_rms,
    last_pkt_bytes, short_reads, drops. A stalled last_rms
    (== 0) tells you the mic isn't producing samples; a frozen
    frames_sent tells you the encode pipeline hung.
  - media:recv_heartbeat — every 2s with recv_fr, decoded_frames,
    last_written, written_samples, decode_errs, codec. Mirror
    invariants for the inbound direction.

All four are gated by `call_debug_logs_enabled()` via
`emit_call_debug`, so they only show up in the GUI log when the
user has the Call Flow Debug Logs checkbox on. Tracing::info!
still runs unconditionally so logcat (adb) keeps its copy
regardless.

The `emit_call_debug` fn in lib.rs is now `pub(crate)` so
engine.rs can call it via `crate::emit_call_debug`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 07:55:41 +04:00
Siavash Sameni
fa038df057 feat(p2p): Phase 5.5 — ICE LAN host candidates (IPv4 + IPv6)
Same-LAN P2P was failing because MikroTik masquerade (like most
consumer NATs) doesn't support NAT hairpinning — the advertised
WAN reflex addr is unreachable from a peer on the same LAN as
the advertiser. Phase 5 got us Cone NAT classification and fixed
the measurement artifact, but same-LAN direct dials still had
nowhere to land.

Phase 5.5 adds ICE-style host candidates: each client enumerates
its LAN-local network interface addresses, includes them in the
DirectCallOffer/Answer alongside the reflex addr, and the
dual-path race fans out to ALL peer candidates in parallel.
Same-LAN peers find each other via their RFC1918 IPv4 + ULA /
global-unicast IPv6 addresses without touching the NAT at all.

Dual-stack IPv6 is in scope from the start — on modern ISPs
(including Starlink) the v6 path often works even when v4
hairpinning doesn't, because there's no NAT on the v6 side.

## Changes

### `wzp_client::reflect::local_host_candidates(port)` (new)

Enumerates network interfaces via `if-addrs` and returns
SocketAddrs paired with the caller's port. Filters:

- IPv4: RFC1918 (10/8, 172.16/12, 192.168/16) + CGNAT (100.64/10)
- IPv6: global unicast (2000::/3) + ULA (fc00::/7)
- Skipped: loopback, link-local (169.254, fe80::), public v4
  (already covered by reflex-addr), unspecified

Safe from any thread, one `getifaddrs(3)` syscall.

### Wire protocol (wzp-proto/packet.rs)

Three new `#[serde(default, skip_serializing_if = "Vec::is_empty")]`
fields, backward-compat with pre-5.5 clients/relays by
construction:

- `DirectCallOffer.caller_local_addrs: Vec<String>`
- `DirectCallAnswer.callee_local_addrs: Vec<String>`
- `CallSetup.peer_local_addrs: Vec<String>`

### Call registry (wzp-relay/call_registry.rs)

`DirectCall` gains `caller_local_addrs` + `callee_local_addrs`
Vec<String> fields. New `set_caller_local_addrs` /
`set_callee_local_addrs` setters. Follow the same pattern as
the reflex addr fields.

### Relay cross-wiring (wzp-relay/main.rs)

Both the local-call and cross-relay-federation paths now track
the local_addrs through the registry and inject them into the
CallSetup's peer_local_addrs. Cross-wiring is identical to the
existing peer_direct_addr logic — each party's CallSetup
carries the OTHER party's LAN candidates.

### Client side (desktop/src-tauri/lib.rs)

- `place_call`: gathers local host candidates via
  `local_host_candidates(signal_endpoint.local_addr().port())`
  and includes them in `DirectCallOffer.caller_local_addrs`.
  The port match is critical — it's the Phase 5 shared signal
  socket, so incoming dials to these addrs land on the same
  endpoint that's already listening.
- `answer_call`: same, AcceptTrusted only (privacy mode keeps
  LAN addrs hidden too, for consistency with the reflex addr).
- `connect` Tauri command: new `peer_local_addrs: Vec<String>`
  arg. Builds a `PeerCandidates` bundle and passes it to the
  dual-path race.
- Recv loop's CallSetup handler: destructures + forwards the
  new field to JS via the signal-event payload.

### `dual_path::race` (wzp-client/dual_path.rs)

Signature change: takes `PeerCandidates` (reflex + local Vec)
instead of a single SocketAddr. The D-role branch now fans out
N parallel dials via `tokio::task::JoinSet` — one per candidate
— and the first successful dial wins (losers are aborted
immediately via `set.abort_all()`). Only when ALL candidates
have failed do we return Err; individual candidate failures are
just traced at debug level and the race waits for the others.

LAN host candidates are tried BEFORE the reflex addr in
`PeerCandidates::dial_order()` — they're faster when they work,
and the reflex addr is the fallback for the not-on-same-LAN
case.

### JS side (desktop/main.ts)

`connect` invoke now passes `peerLocalAddrs: data.peer_local_addrs ?? []`
alongside the existing `peerDirectAddr`.

### Tests

All existing test callsites updated for the new Vec<String>
fields (defaults to Vec::new() in tests — they don't exercise
the multi-candidate path). `dual_path.rs` integration tests
wrap the single `dead_peer` / `acceptor_listen_addr` in a
`PeerCandidates { reflexive: Some(_), local: Vec::new() }`.

Full workspace test: 423 passing (same as before 5.5).

## Expected behavior on the reporter's setup

Two phones behind MikroTik, both on the same LAN:

  place_call:host_candidates {"local_addrs": ["192.168.88.21:XXX", "2001:...:YY:XXX"]}
  recv:DirectCallAnswer {"callee_local_addrs": ["192.168.88.22:ZZZ", "2001:...:WW:ZZZ"]}
  recv:CallSetup {"peer_direct_addr":"150.228.49.65:NN",
                  "peer_local_addrs":["192.168.88.22:ZZZ","2001:...:WW:ZZZ"]}
  connect:dual_path_race_start {"peer_reflex":"...","peer_local":[...]}
  dual_path: direct dial succeeded on candidate 0   ← LAN v4 wins
  connect:dual_path_race_won {"path":"Direct"}

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 07:34:49 +04:00
Siavash Sameni
1618ff6c9d feat(p2p): Phase 5 — single-socket architecture (Nebula-style)
Before Phase 5 WarzonePhone used THREE separate UDP sockets per
client:

  1. Signal endpoint         (register_signal, client-only)
  2. Reflect probe endpoints (one fresh socket per relay probe)
  3. Dual-path race endpoint (fresh per call setup)

This broke two things in production on port-preserving NATs
(MikroTik masquerade, most consumer routers):

  a. Phase 2 NAT detection was WRONG. Each probe used a fresh
     internal port, so MikroTik mapped each one to a different
     external port, and the classifier saw "different port per
     relay" and labeled it SymmetricPort. The real NAT was
     cone-like but measurement via fresh sockets hid that.

  b. Phase 3.5 dual-path P2P race was BROKEN. The reflex addr
     we advertised in DirectCallOffer was observed by the signal
     endpoint's socket. The actual dual-path race listened on a
     DIFFERENT fresh socket, on a different internal (and
     therefore external) port. Peers dialed the advertised addr
     and hit MikroTik's mapping for the signal socket, which
     forwarded to the signal endpoint — a client-only endpoint
     that doesn't accept incoming connections. Direct path
     silently failed, relay always won the race.

Nebula-style fix: one socket for everything. The signal endpoint
is now dual-purpose (client + server_config), and both the
reflect probes and the dual-path race reuse it instead of
creating fresh ones. MikroTik's port-preservation then gives us
a stable external port across all flows → classifier correctly
sees Cone NAT → advertised reflex addr is the actual listening
port → direct dials from peers land on the right socket →
`endpoint.accept()` in the A-role branch of the dual-path race
picks up the incoming connection.

## Changes

### `register_signal` (desktop/src-tauri/src/lib.rs)
- Endpoint now created with `Some(server_config())` instead of
  `None`. The socket can now accept incoming QUIC connections as
  well as dial outbound.
- Every code path that previously read `sig.endpoint` for the
  relay-dial reuse benefits automatically — same socket is now
  ALSO listening for peer dials.

### `probe_reflect_addr` (wzp-client/src/reflect.rs)
- New `existing_endpoint: Option<Endpoint>` arg. `Some` reuses
  the caller's socket (production: pass the signal endpoint).
  `None` creates a fresh one (tests + pre-registration).
- Removed the `drop(endpoint)` at the end — was correct for
  fresh endpoints (explicit early socket close) but incorrect
  for shared ones. End-of-scope drop does the right thing in
  both cases via Arc semantics.

### `detect_nat_type` (wzp-client/src/reflect.rs)
- New `shared_endpoint: Option<Endpoint>` arg, forwarded to
  every probe in the JoinSet fan-out. One shared socket means
  the classifier sees the true NAT type.

### `detect_nat_type` Tauri command (desktop/src-tauri/src/lib.rs)
- Reads `state.signal.endpoint` and passes it as the shared
  endpoint. Falls back to None when not registered. NAT detection
  now produces accurate classifications against MikroTik / most
  consumer NATs.

### `dual_path::race` (wzp-client/src/dual_path.rs)
- New `shared_endpoint: Option<Endpoint>` arg.
- A-role: when `Some`, reuses it for `accept()`. This is the
  critical change — the reflex addr advertised to peers is now
  the address listening for incoming direct dials.
- D-role: when `Some`, reuses it for the outbound direct dial.
  MikroTik keeps the same external port for the dial as for
  the signal flow → direct dial through a cone-mapped NAT.
- Relay path: also reuses the shared endpoint so MikroTik has
  a single consistent mapping across the whole call (saves one
  extra external port and makes firewall traces cleaner).
- When `None`, falls back to fresh per-role endpoints as before.

### `connect` Tauri command (desktop/src-tauri/src/lib.rs)
- Reads `state.signal.endpoint` once when acquiring own reflex
  addr and passes it through to `dual_path::race`.

### Tests
- `wzp-client/tests/dual_path.rs` and
  `wzp-relay/tests/multi_reflect.rs` updated to pass `None` for
  the new endpoint arg — tests use fresh sockets and that's
  fine because the loopback harness doesn't care about
  port-preserving NAT behavior.

Full workspace test: 423 passing (no regressions).

## Expected behavior after this commit on real hardware

Behind MikroTik + Starlink-bypass (the reporter's setup):
- Phase 2 NAT detect → **Cone NAT** (was SymmetricPort — false
  positive from the measurement artifact)
- Phase 3.5 direct-P2P dial → succeeds for both cone-cone and
  cone-CGNAT cases where the remote side was previously blocked
  by our own socket mismatch
- LTE ↔ LTE cross-carrier → still likely relay fallback; that's
  genuinely strict symmetric and needs Phase 5.5 port prediction.

## Phase 5.5 (next, separate PRD)

Multi-candidate port prediction + ICE-style candidate aggregation
for truly strict symmetric NATs. Not needed for the 95% case —
Phase 5 alone fixes most consumer-router setups.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 19:47:20 +04:00
Siavash Sameni
00deb97a5d fix(reflect): drop LAN/private reflex addrs from NAT classification
Real-world report: a user with one LAN relay + one internet relay
got "Multiple IPs — treating as symmetric" because the LAN relay
saw the client's LAN IP (172.16.81.172) while the internet relay
saw the WAN IP (150.228.49.65). Two observations of "different
public IPs" from the classifier's perspective, but semantically
they describe two different network paths and shouldn't be
compared.

The LAN relay's reflection is always true, just not useful for
public NAT classification: there's no NAT between the client and
the LAN relay, so that path's reflex addr is always the LAN
interface IP regardless of what the public-facing NAT beyond it
looks like.

Fix: new `is_private_or_loopback` helper filters the probe set
before classification. Drops:
 - 127.0.0.0/8 loopback
 - 10/8, 172.16/12, 192.168/16 RFC1918 private
 - 169.254/16 link-local
 - 100.64/10 CGNAT shared-transition (same reasoning: a relay
   that sees the client with a CGNAT addr is on the same carrier
   network and can't describe public NAT state)
 - IPv6 loopback, unspecified, fe80::/10 link-local

Failed probes still filtered out of classification (they were
already) but now dimmed in the UI list instead of highlighted
amber. Same rationale: a momentarily-offline probe target isn't
a warning-worthy state, it's just a fact about the probe run.

UI palette rebalance: only Cone gets green, everything else
neutral text-dim. Wording changed from warning-tone
"⚠ must use relay" to informational "ℹ P2P falls back to relay,
calls still work" — symmetric NAT isn't broken state, it just
means media takes the relay path.

Tests added (4 new in wzp_client::reflect):
- classify_drops_private_ip_probes — LAN + public → Unknown
- classify_drops_loopback_probes — loopback + 2 public → Cone
- classify_drops_cgnat_probes — CGNAT + 2 public same-IP-
  diff-port → SymmetricPort
- classify_two_lan_probes_is_unknown_not_cone — all LAN → Unknown

Existing multi_reflect integration test updated: two loopback
relays now correctly classify as Unknown (because loopback reflex
addrs are filtered) with the plumbing-works invariant preserved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 18:29:09 +04:00
Siavash Sameni
da08723fe7 fix(signal): forward-compat — log+continue on unknown SignalMessage variants
Both sides of the signal channel previously broke their recv loop
on any deserialize error, which meant adding a new variant in one
build silently killed signal connections from peers running an
older build. This bit us during Phase 1 testing: a new client
sending SignalMessage::Reflect to a pre-Phase-1 relay caused the
relay to drop the whole signal connection, which looked like
"Error: not registered" on the next place_call.

Fix:
- New TransportError::Deserialize(String) variant in wzp-proto
  carries serde errors as a distinct category.
- wzp-transport/reliable.rs::recv_signal returns Deserialize on
  serde_json::from_slice failures (was wrapped in Internal).
- wzp-relay/main.rs signal loop matches on Deserialize → warn +
  continue (instead of break).
- desktop/src-tauri/lib.rs recv loop does the same.

Other TransportError variants (ConnectionLost, Io, Internal) still
break the loop — only pure parse failures are recoverable.

This means future SignalMessage variant additions are backward-
compat by construction: older peers will see "unknown variant,
continuing" in their logs while newer peers can keep evolving the
protocol.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 18:13:31 +04:00
Siavash Sameni
8cdf8d486a feat(p2p): Phase 4 cross-relay direct calling over federation
Teaches the relay pair to route direct-call signaling across an
existing federation link. Alice on Relay A can now place a direct
call to Bob on Relay B if A and B are federation peers — the
wire protocol, call registry, and signal dispatch all learn to
track and route the cross-relay flow.

Phase 3.5's dual-path QUIC race then carries the media directly
peer-to-peer using the advertised reflex addrs, with zero
changes needed on the client side.

## Wire protocol (wzp-proto)

New `SignalMessage::FederatedSignalForward { inner, origin_relay_fp }`
envelope variant, appended at end of enum — JSON serde is
name-tagged so pre-Phase-4 relays just log "unknown variant" and
drop it. 2 new roundtrip tests (any-inner nesting + single
DirectCallOffer case).

## Call registry (wzp-relay)

`DirectCall.peer_relay_fp: Option<String>` — federation TLS fp
of the peer relay that forwarded the offer/answer for this call.
`None` on local calls, `Some` on cross-relay. Used by the answer
path to route the reply back through the same federation link
instead of trying (and failing) to deliver via local signal_hub.
New `set_peer_relay_fp` setter + 1 new unit test.

## FederationManager (wzp-relay)

Three new methods:
- `local_tls_fp()` — exposes the relay's own federation TLS fp
  so main.rs can build `origin_relay_fp` fields.
- `broadcast_signal(msg) -> usize` — fan out any signal message
  (in practice `FederatedSignalForward`) to every active peer
  link, returning the reach count. Used when Relay A doesn't
  know which peer has the target fingerprint.
- `send_signal_to_peer(fp, msg)` — targeted send for the reply
  path where the registry already knows which peer relay to
  hit.

Plus a new `cross_relay_signal_tx: Mutex<Option<Sender<...>>>`
field that `set_cross_relay_tx()` wires at startup so the
federation `handle_signal` can push unwrapped inner messages
into the main signal dispatcher.

## Federation handle_signal (wzp-relay)

New match arm for `FederatedSignalForward`:
- Loop prevention: drops forwards whose `origin_relay_fp` equals
  this relay's own fp (prevents A→B→A echo loops without needing
  TTL yet).
- Otherwise pulls the inner message out and pushes it through
  `cross_relay_signal_tx` so the main loop's dispatcher task
  handles it as if it had arrived locally.

## Main signal loop (wzp-relay)

### DirectCallOffer when target not local
Before falling through to Hangup, try the federation path:
- Wrap the offer in `FederatedSignalForward` with
  `origin_relay_fp = this relay's tls_fp`
- `fm.broadcast_signal(forward)` — returns peer count
- If any peers reached, stash the call in local registry with
  `caller_reflexive_addr` set, `peer_relay_fp` still None
  (broadcast — the answer-side will identify itself when it
  replies)
- Send `CallRinging` to caller immediately for UX feedback
- Only if no federation or no peers → legacy Hangup path

### DirectCallAnswer when peer is remote
- Registry lookup now reads both `peer_fingerprint` and
  `peer_relay_fp` in one acquisition
- If `peer_relay_fp.is_some()`:
  * Reject → forward a `Hangup` over federation via
    `send_signal_to_peer` instead of local signal_hub
  * Accept → wrap the raw answer in `FederatedSignalForward`,
    route to the specific origin peer, then emit the LOCAL
    CallSetup to our callee with `peer_direct_addr =
    caller_reflexive_addr` (caller is remote; this side only
    has the callee)
- If `peer_relay_fp.is_none()` → existing Phase 3 same-relay
  path with both CallSetups (caller + callee)

### Cross-relay signal dispatcher task
New long-running task reading `(inner, origin_relay_fp)` from
`cross_relay_rx`. In Phase 4 MVP handles:
- `DirectCallOffer` — if target is local, create the call in
  the registry with `peer_relay_fp = origin_relay_fp`, stash
  caller addr, deliver offer to local callee. If target isn't
  local, drop (no multi-hop in Phase 4 MVP).
- `DirectCallAnswer` — look up local caller by call_id, stash
  callee addr, forward raw answer to local caller via
  signal_hub, emit local CallSetup with `peer_direct_addr =
  callee_reflexive_addr` (peer is local now; this side only
  has the caller).
- `CallRinging` — best-effort forward to local caller for UX.
- `Hangup` — logged for now; Phase 4.1 will target by call_id.

## Integration tests

`crates/wzp-relay/tests/cross_relay_direct_call.rs` — 3 tests
that reproduce the main.rs cross-relay dispatcher logic inline
and assert the invariants without spinning up real binaries:

1. `cross_relay_offer_forwards_and_stashes_peer_relay_fp` —
   Relay A gets Alice's offer, broadcasts. Relay B's dispatcher
   creates the call with `peer_relay_fp = relay_a_tls_fp`.
2. `cross_relay_answer_crosswires_peer_direct_addrs` — full
   round trip; both CallSetups (one on each relay) carry the
   OTHER party's reflex addr.
3. `cross_relay_loop_prevention_drops_self_sourced_forward` —
   explicit loop-prevention check.

Full workspace test goes from 413 → 419 passing. Clippy clean
on touched files.

## Non-goals (deferred to Phase 4.1+)

- Relay-mediated media fallback across federation — if P2P
  direct fails (symmetric NAT on either side), the call errors
  out with "no media path". Making the existing federation
  media pipeline carry ephemeral call-<id> rooms is the Phase
  4.1 lift.
- Multi-hop federation (A → B → C). Phase 4 MVP supports a
  direct federation link between A and B only.
- Fingerprint → peer-relay routing gossip.

PRD: .taskmaster/docs/prd_phase4_cross_relay_p2p.txt
Tasks: 70-78 all completed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 17:31:43 +04:00
Siavash Sameni
59ce52f8e8 feat(p2p): Phase 3.5 dual-path QUIC race + GUI call-flow debug logs
Two features in one commit because they ship and test together:
Phase 3.5 closes the hole-punching loop and the call-flow debug
logs give the user live visibility into every step of a call so
real-hardware testing of the new P2P path is debuggable.

## Phase 3.5 — dual-path QUIC connect race

Completes the hole-punching work Phase 3 scaffolded. On receiving
a CallSetup with peer_direct_addr, the client now actually races a
direct QUIC handshake against the relay dial and uses whichever
completes first. Symmetric role assignment avoids the two-conns-
per-call problem:

- Both peers compare `own_reflex_addr` vs `peer_reflex_addr`
  lexicographically.
- Smaller addr → **Acceptor** (A-role): builds a server-capable
  dual endpoint, awaits an incoming QUIC session. Does NOT dial.
- Larger addr → **Dialer** (D-role): builds a client-only
  endpoint, dials the peer's addr with `call-<id>` SNI. Does NOT
  listen.
- Both sides always dial the relay in parallel as fallback.
- `tokio::select!` with `biased` preference for direct, `tokio::pin!`
  so each branch can await the losing opposite as fallback.
- Direct timeout 2s, relay fallback timeout 5s (so 7s worst case
  from CallSetup to "no media path" error).

New crate module `wzp_client::dual_path::{race, WinningPath}`
(moved here from desktop/src-tauri so it's testable from a
workspace test). `determine_role` in `wzp_client::reflect` is
pure-function and unit-tested.

### CallEngine integration
- New `pre_connected_transport: Option<Arc<QuinnTransport>>` arg
  on both android + desktop `CallEngine::start` branches. Skips
  the internal wzp_transport::connect step when Some. Backward-
  compat: None keeps Phase 0 relay-only behavior.
- `connect` Tauri command reads own_reflex_addr from SignalState,
  computes role, runs the race, passes the winning transport
  into CallEngine. If ANY input is missing (no peer addr, no own
  addr, equal addrs), falls back to classic relay path —
  identical to pre-Phase-3.5 behavior.

### Tests (9 new, all passing)
- 6 unit tests for `determine_role` truth table in
  `wzp-client/src/reflect.rs` (smaller=Acceptor, larger=Dialer,
  port-only diff, equal, missing-side, symmetry)
- 3 integration tests in `crates/wzp-client/tests/dual_path.rs`:
    * `dual_path_direct_wins_on_loopback` — two-endpoint test
      rig, Dialer wins direct path vs loopback mock relay
    * `dual_path_relay_wins_when_direct_is_dead` — dead peer
      port, 2s direct timeout, relay fallback wins
    * `dual_path_errors_cleanly_when_both_paths_dead` — <10s
      error, no hang

## GUI call-flow debug logs

Runtime-toggled structured events at every step of a call so the
user can see where a call progressed or stalled on real hardware.
Modeled on the existing DRED_VERBOSE_LOGS pattern.

### Rust side
- `static CALL_DEBUG_LOGS: AtomicBool` + `emit_call_debug(&app,
  step, details)` helper. Always logs via `tracing::info!`
  (logcat always has a copy); GUI Tauri `call-debug-log` event
  only fires when the flag is on.
- Tauri commands `set_call_debug_logs` / `get_call_debug_logs`.

### Instrumented steps (24 emit_call_debug sites)
- `register_signal`: start, identity loaded, endpoint created,
  connect failed/ok, RegisterPresence sent, ack received/failed,
  recv loop spawning
- Recv loop: CallRinging, DirectCallOffer (w/ caller_reflexive_addr),
  DirectCallAnswer (w/ callee_reflexive_addr), CallSetup (w/
  peer_direct_addr), Hangup
- `place_call`: start, reflect query start/ok/none, offer sent,
  send failed
- `answer_call`: start, reflect query start/ok/none or privacy
  skip, answer sent, send failed
- `connect`: start, dual_path_race_start (w/ role), won (w/
  path), failed, skipped (w/ reasons), call_engine_starting/
  started/failed

### JS side
- New `callDebugLogs: boolean` field on Settings type.
- Boot-time hydrate of the Rust flag from localStorage so the
  choice survives restarts (like `dredDebugLogs`).
- Settings panel: new "Call flow debug logs" checkbox alongside
  the DRED toggle.
- New "Call Debug Log" section that ONLY shows when the flag is
  on. Rolling in-memory buffer of the last 200 events, rendered
  as monospace `HH:MM:SS.mmm step {details}` lines with auto-
  scroll and a Clear button.
- `listen("call-debug-log", ...)` subscribed at app startup,
  appends to the buffer, re-renders on every event.

Full workspace test goes from 404 → 413 passing. Clippy clean
on touched crates.

PRD: .taskmaster/docs/prd_phase35_dual_path_race.txt
Tasks: 61-69 all completed

Next: APK + desktop build carrying everything — Phase 2 NAT
detect, Phase 3 advertising, Phase 3.5 dual-path + call debug
logs, plus the earlier Android first-join diagnostics — so the
user can validate the P2P path on real hardware with live
per-step visibility into where any failures happen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 14:06:44 +04:00
Siavash Sameni
39277bf3a0 feat(hole-punching): advertise peer reflexive addrs in DirectCall flow — Phase 3
Completes the signal-plane plumbing for P2P direct calling: both
peers now learn their own server-reflexive address (Phase 1
Reflect), include it in DirectCallOffer / DirectCallAnswer, and
the relay cross-wires them into each side's CallSetup so the
client knows the OTHER party's direct addr. Dual-path QUIC race
is scaffolded but deferred to Phase 3.5 — this commit ships the
full advertising layer so real-hardware testing can confirm the
addrs flow end-to-end before adding the concurrent-connect logic.

Wire protocol (wzp-proto/src/packet.rs):
- DirectCallOffer gains optional `caller_reflexive_addr`
- DirectCallAnswer gains optional `callee_reflexive_addr`
- CallSetup gains optional `peer_direct_addr`
- All #[serde(default, skip_serializing_if = "Option::is_none")] so
  pre-Phase-3 peers and relays stay backward compatible by
  construction — the new fields are elided from the JSON on the
  wire when None, and older clients parse the JSON ignoring any
  fields they don't know.
- 2 new roundtrip tests (Some + None cases, old-JSON parse-back).

Call registry (wzp-relay/src/call_registry.rs):
- DirectCall gains caller_reflexive_addr + callee_reflexive_addr.
- set_caller_reflexive_addr / set_callee_reflexive_addr setters.
- 2 new unit tests: stores and returns addrs, clearing works.

Relay cross-wiring (wzp-relay/src/main.rs):
- On DirectCallOffer: stash the caller's addr in the registry.
- On DirectCallAnswer: stash the callee's addr (only set by
  AcceptTrusted answers — privacy-mode leaves it None).
- Send two different CallSetup messages: one to the caller with
  peer_direct_addr=callee_addr, and one to the callee with
  peer_direct_addr=caller_addr. The cross-wiring means each side
  gets the OTHER party's direct addr, not its own.
- Logs `p2p_viable=true` when both sides advertised.

Client advertising (desktop/src-tauri/src/lib.rs):
- New `try_reflect_own_addr` helper that reuses the Phase 1
  oneshot pattern WITHOUT holding state.signal.lock() across the
  await (critical: the recv loop reacquires the same mutex to
  fire the oneshot, so holding it would deadlock).
- `place_call` queries reflect first and includes the returned
  addr in DirectCallOffer. Falls back to None on any failure —
  call still proceeds via the relay path.
- `answer_call` queries reflect ONLY on AcceptTrusted so
  AcceptGeneric keeps the callee's IP private by design. Reject
  and AcceptGeneric both pass None.
- recv loop's CallSetup handler destructures and forwards
  peer_direct_addr to the JS layer in the signal-event payload.

Client scaffolding for dual-path (desktop/src-tauri/src/lib.rs +
desktop/src/main.ts):
- `connect` Tauri command gets a new optional `peer_direct_addr`
  argument. Currently LOGS the addr but still uses the relay
  path for the media connection — Phase 3.5 will swap in a
  tokio::select! race between direct dial + relay dial. Scaffolding
  lands here so the JS wire is stable, real-hardware testing can
  confirm advertising works end-to-end, and Phase 3.5 is a pure
  Rust change with no JS touches.
- JS setup handler forwards `data.peer_direct_addr` to invoke.

Back-compat with the CLI client (crates/wzp-client/src/cli.rs):
- CLI test harness updated for the new fields — always passes
  None for both reflex addrs (no hole-punching). Also destructures
  peer_direct_addr: _ in its CallSetup handler.

Tests (8 new, all passing):
- wzp-proto: hole_punching_optional_fields_roundtrip,
  hole_punching_backward_compat_old_json_parses
- wzp-relay call_registry: call_registry_stores_reflexive_addrs,
  call_registry_clearing_reflex_addr_works
- wzp-relay integration: crates/wzp-relay/tests/hole_punching.rs
    * both_peers_advertise_reflex_addrs_cross_wire_in_setup
    * privacy_mode_answer_omits_callee_addr_from_setup
    * pre_phase3_caller_leaves_both_setups_relay_only
    * neither_peer_advertises_both_setups_are_relay_only

Full workspace test goes from 396 → 404 passing.

PRD: .taskmaster/docs/prd_hole_punching.txt
Tasks: 53-60 all completed (58 = scaffolding-only; 3.5 follow-up)

Next up: **Phase 3.5 — dual-path QUIC connect race**. With the
advertising layer live, this becomes a focused change: on
CallSetup-with-peer_direct_addr, start a server-capable dual
endpoint, and tokio::select! across (direct dial, relay dial,
inbound accept). Whichever QUIC handshake completes first wins,
the losers drop, 2s direct timeout falls back to relay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 13:37:04 +04:00
Siavash Sameni
8d903f16c6 feat(reflect): multi-relay NAT type detection — Phase 2
Builds on Phase 1's SignalMessage::Reflect to probe N relays in
parallel through transient QUIC connections and classify the
client's NAT type for the future P2P hole-punching path. No wire
protocol changes — Phase 1's Reflect/ReflectResponse pair is
reused unchanged.

New client-side module (crates/wzp-client/src/reflect.rs):
- probe_reflect_addr(relay, timeout_ms): opens a throwaway
  quinn::Endpoint (fresh ephemeral source port per probe,
  essential for NAT-type detection — sharing one endpoint would
  make a symmetric NAT look like a cone NAT), connects to _signal,
  sends RegisterPresence with zero identity, consumes the Ack,
  sends Reflect, awaits ReflectResponse, cleanly closes.
- detect_nat_type(relays, timeout_ms): parallel probes via
  tokio::task::JoinSet (bounded by slowest probe not sum) and
  returns a NatDetection with per-probe results + aggregate
  classification.
- classify_nat(probes): pure-function classifier split out for
  network-free unit tests. Rules:
    * 0-1 successful probes              → Unknown
    * 2+ successes, same ip same port    → Cone (P2P viable)
    * 2+ successes, same ip diff ports   → SymmetricPort (relay)
    * 2+ successes, different ips        → Multiple (treat as
                                             symmetric)

Tauri command (desktop/src-tauri/src/lib.rs):
- detect_nat_type({ relays: [{ name, address }] }) -> NatDetection
  as JSON. Takes the relay list from JS because localStorage
  owns the config. Parse-up-front so a malformed entry fails
  clean instead of as a probe error. 1500ms per-probe timeout.

UI (desktop/index.html + src/main.ts):
- New "NAT type" row + "Detect NAT" button in the Network
  settings section. Renders per-probe status (name, address,
  observed addr, latency, or error) plus the colored verdict:
    * green  Cone — shows consensus addr
    * amber  SymmetricPort / Multiple — must relay
    * gray   Unknown — not enough data

Tests:
- 7 unit tests in wzp-client/src/reflect.rs covering every
  classifier branch (empty, 1 success, 2 identical, 2 diff ports,
  2 diff ips, success+failure mix, pure-failure).
- 3 integration tests in crates/wzp-relay/tests/multi_reflect.rs:
    * probe_reflect_addr_happy_path — single mock relay end-to-end
    * detect_nat_type_two_loopback_relays_is_cone — two concurrent
      relays, asserts both see 127.0.0.1 and classifier returns
      Cone or SymmetricPort (accepted because the test harness
      uses fresh ephemeral ports per probe which look like
      SymmetricPort on single-host loopback)
    * detect_nat_type_dead_relay_is_unknown — alive + dead port
      mix, asserts the dead probe surfaces an error string and
      the aggregator returns Unknown (only 1 success)

Full workspace test goes from 386 → 396 passing.

PRD: .taskmaster/docs/prd_multi_relay_reflect.txt
Tasks: 47-52 all completed

Next up: hole-punching (Phase 3) — use the reflected address in
DirectCallOffer/Answer and CallSetup so peers attempt a direct
QUIC handshake to each other, with relay fallback on timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 12:47:12 +04:00
Siavash Sameni
921856eba9 feat(reflect): QUIC-native NAT reflection ("STUN for QUIC") — Phase 1
Lets a client ask its registered relay "what IP:port do you see for
me?" over the existing TLS-authenticated signal channel, returning
the client's server-reflexive address as a SocketAddr. Replaces the
need for a classic STUN deployment and becomes the bootstrap step
for future P2P hole-punching: once both peers know their own reflex
addrs, they can advertise them in DirectCallOffer and attempt a
direct QUIC handshake to each other.

Wire protocol (wzp-proto):
- SignalMessage::Reflect — unit variant, client -> relay
- SignalMessage::ReflectResponse { observed_addr: String } — relay -> client
- JSON-serde, appended at end of enum: zero ordinal concerns,
  backward compat with pre-Phase-1 relays by construction (older
  relays log "unexpected message" and drop; newer clients time out
  cleanly within 1s).

Relay handler (wzp-relay/src/main.rs, signal loop):
- New match arm next to Ping reuses the already-bound `addr` from
  connection.remote_address() and replies with observed_addr as a
  string. debug!-level log on success, warn!-level on send failure.

Client side (desktop/src-tauri/src/lib.rs):
- SignalState gains pending_reflect: Option<oneshot::Sender<SocketAddr>>.
- get_reflected_address Tauri command installs the oneshot before
  sending Reflect and awaits it with a 1s timeout; cleans up on
  every exit path (send failure, timeout, parse error).
- recv loop's new ReflectResponse arm fires the pending sender or
  emits a debug log for unsolicited responses — never crashes the
  loop on malformed input.
- Integrated into invoke_handler! alongside the other signal
  commands.

UI (desktop/index.html + src/main.ts):
- New "Network" section in settings panel with a "Detect" button
  that displays the reflected address or a categorized warning
  ("register first" / "relay does not support reflection" / error).

Tests (crates/wzp-relay/tests/reflect.rs — 3 new, all passing):
- reflect_happy_path: client on loopback gets back 127.0.0.1:<its own port>
- reflect_two_clients_distinct_ports: two concurrent clients see
  their own distinct ports, proving per-connection remote_address
- reflect_old_relay_times_out: mock relay that ignores Reflect —
  client times out between 1000-1200ms and does not hang

Also pre-existing test bit-rot unrelated to this PR — fixed so the
full workspace `cargo test` goes green:
- handshake_integration tests in wzp-client, wzp-relay and
  featherchat_compat in wzp-crypto all missed the `alias` field
  addition to CallOffer and the 3-arg form of perform_handshake
  plus 4-tuple return of accept_handshake. Updated to the current
  API surface.

Results:
  cargo test --workspace --exclude wzp-android: 386 passed
  cargo check --workspace: clean
  cargo clippy: no new warnings in touched files

Verification excludes wzp-android because it's dead code on this
branch (Tauri mobile uses wzp-native instead) and can't link -llog
on macOS host — unchanged status quo.

PRD: .taskmaster/docs/prd_reflect_over_quic.txt
Tasks: 39-46 all completed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 12:29:07 +04:00
Siavash Sameni
578ff8cff4 feat(debug): GUI toggle for DRED verbose logs + macOS mic permission
DRED verbose logs (off by default — keeps logcat clean in normal use):
- wzp-codec: DRED_VERBOSE_LOGS atomic flag with dred_verbose_logs() /
  set_dred_verbose_logs() helpers
- opus_enc: gate "DRED enabled" + libopus version logs behind the flag
- desktop/src-tauri/engine.rs: gate DredRecvState parse log,
  reconstruction log, classical PLC log, and DRED-counter fields in
  the Android recv heartbeat (non-verbose path still logs basic recv
  stats)
- Tauri commands set_dred_verbose_logs / get_dred_verbose_logs
- Settings panel gets a "DRED debug logs (verbose, dev only)"
  checkbox; preference persists in wzp-settings localStorage and is
  pushed to Rust on save and on app boot

macOS mic permission:
- Add desktop/src-tauri/Info.plist with NSMicrophoneUsageDescription.
  Without it, modern macOS silently denies CoreAudio capture for
  ad-hoc-signed Tauri builds — capture starts but every callback
  hands you zeros. Symptom: phones could not hear desktop client,
  desktop could still hear phones (playout has no TCC gate). The
  Tauri 2 bundler auto-merges this file into WarzonePhone.app's
  Contents/Info.plist on the next build, so first launch will pop
  the standard mic prompt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 09:48:32 +04:00
Siavash Sameni
16890576fb feat(observability): logcat-visible DRED proof of life on Android
Adds enough INFO-level logging that an opus-DRED-v2 APK on Android can
be verified end-to-end by reading logcat alone — no debugger, no
Prometheus, no telemetry pipeline required. Three observation points:

1. Encoder construction (opus_enc.rs)
   - Bumped the "DRED enabled" log from debug! to info! so the
     per-call DRED config is in logcat by default. Each call's first
     OpusEncoder construction logs codec, dred_frames, dred_ms,
     loss_floor_pct.
   - Added a one-shot static OnceLock that logs `opusic_c::version()`
     the first time an OpusEncoder is built in the process. This is
     the smoking gun for "is the new libopus actually loaded" — pre-
     Phase-0 audiopus shipped libopus 1.3 with no DRED, post-Phase-0
     should print 1.5.2 here.

2. DRED state ingest (DredRecvState::ingest_opus in
   desktop/src-tauri/src/engine.rs)
   - First successful parse on a call logs immediately so we can see
     "DRED is on the wire" in logcat.
   - Subsequent parses sample every 100th to confirm steady-state
     samples_available without drowning the log.
   - New parses_total / parses_with_data counters track the parse
     rate vs the success rate (a packet without DRED in it returns
     `available == 0`, so a low ratio means the encoder isn't
     emitting DRED bytes).

3. DRED reconstruction events (DredRecvState::fill_gap_to)
   - Every DRED reconstruction logs at INFO with missing_seq,
     anchor_seq, offset_samples, offset_ms, samples_available,
     gap_size, and the running total. These events are rare on a
     clean network and we want to know exactly which gap was filled.
   - First three classical PLC fills + every 50th thereafter log so
     we can see when DRED couldn't cover a gap (offset out of range,
     no good state, or reconstruct error).

4. Recv heartbeat (Android start() in engine.rs)
   - Existing 2-second heartbeat now includes dred_recv,
     classical_plc, dred_parses_with_data, dred_parses_total
     so a steady-state call shows the cumulative counters in
     logcat without parsing.

How to verify on a real call:

  adb logcat -s 'RustStdoutStderr:*' | grep -i 'dred\|libopus version'

Expected output sequence on a successful Opus call:
  - "linked libopus version libopus_version=libopus 1.5.2-..."  (once per process)
  - "opus encoder: DRED enabled codec=Opus24k dred_frames=20 dred_ms=200 loss_floor_pct=15"  (per call)
  - "DRED state parsed from Opus packet seq=N samples_available=4560 ms=95 ..."  (after first DRED-bearing packet)
  - "recv heartbeat (android) ... dred_recv=0 classical_plc=0 dred_parses_with_data=58 dred_parses_total=58"  (every 2s)

If you see "linked libopus version libopus 1.3" — the FFI swap didn't
take. If dred_parses_with_data stays at 0 while dred_parses_total
climbs — the sender isn't emitting DRED (check the encoder's loss
floor and the receiver's libopus version). If gaps trigger
"classical PLC fill" instead of "DRED reconstruction fired" —
DRED state coverage is too small for the observed loss pattern,
and the loss floor or DRED duration policy needs tuning.

Verification:
- cargo check -p wzp-codec -p wzp-client: 0 errors
- cargo check -p wzp-desktop: 0 Rust errors (only the pre-existing
  tauri::generate_context!() proc macro panic on missing ../dist
  which fires at host check time, irrelevant on the remote build)
- cargo test -p wzp-codec --lib: 69 passing (no regressions)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 08:58:03 +04:00
Siavash Sameni
daf7bcd9ba chore(warnings): sweep the workspace — zero warnings on lib + bin targets
Addressed every rustc warning surfaced by \`cargo check --workspace
--release --lib --bins\` on opus-DRED-v2. Split across three
categories:

## Real bugs surfaced by the audit (fix, don't silence)

- **crates/wzp-relay/src/federation.rs** — the per-peer RTT monitor
  task computed \`rtt_ms\` every 5 s and threw it on the floor. The
  \`wzp_federation_peer_rtt_ms\` gauge has been registered in
  metrics.rs the whole time but was never receiving samples, leaving
  the Grafana panel blank. Wired it up: the task now calls
  \`fm_rtt.metrics.federation_peer_rtt_ms.with_label_values(&[&label_rtt]).set(rtt_ms)\`
  on every sample. Fixes three warnings (\`rtt_ms\`, \`fm_rtt\`,
  \`label_rtt\` were all captured for this task and all dead).

## Dead code removal

- **crates/wzp-relay/src/federation.rs** — removed \`local_delivery_seq:
  AtomicU16\` field and its initializer. It was described in comments
  as "per-room seq counter for federation media delivered to local
  clients" but was declared, initialized to 0, and never read or
  written anywhere else. Genuine half-wired feature; deletable with
  zero behavior change.
- **crates/wzp-relay/src/room.rs** — removed \`let recv_start =
  Instant::now()\` at the top of a recv loop that was never read.
  Separate variable \`last_recv_instant\` already measures the actual
  gap that's used for the \`max_recv_gap_ms\` stat.
- **crates/wzp-client/src/cli.rs** — removed \`let my_fp = fp.clone()\`
  from the signal loop setup. Cloned but never used in any match arm.

## Stub-intent warnings (underscore + explanatory comment)

- **crates/wzp-relay/src/handshake.rs** — \`choose_profile\` hardcodes
  \`QualityProfile::GOOD\` and ignores its \`supported\` parameter.
  Comment already documented "Cap at GOOD (24k) for now — studio
  tiers not yet tested for federation reliability". Renamed to
  \`_supported\`, expanded the comment to explicitly note the future
  plan (pick highest supported ≤ relay ceiling).
- **crates/wzp-relay/src/federation.rs** — \`forward_to_peers\` takes
  \`room_name: &str\` but only uses \`room_hash\`. The caller
  (handle_datagram) passes the name for caller-site symmetry with
  other helpers; kept the param shape and underscored the binding
  with a comment noting it's reserved for future per-name logging.

## Cosmetic fixes

- **crates/wzp-relay/src/event_log.rs** — dropped \`use std::sync::Arc\`
  (unused).
- **crates/wzp-relay/src/signal_hub.rs** — trimmed \`use tracing::{info,
  warn}\` to \`use tracing::info\`. Also removed unnecessary \`mut\` on
  \`hub\` binding in the \`register_unregister\` test.
- **crates/wzp-relay/src/room.rs** — trimmed \`use tracing::{debug,
  error, info, trace, warn}\` to \`{error, info, warn}\`. Also removed
  unnecessary \`mut\` on \`mgr\` binding in the \`room_join_leave\` test.
- **crates/wzp-relay/src/main.rs** — removed unnecessary \`mut\` on the
  \`config\` destructured binding from \`parse_args()\`; and dropped
  \`ref caller_alias\` from the \`DirectCallOffer\` match pattern since
  the relay just forwards the full \`msg\` (caller_alias is preserved
  end-to-end, we don't need to read it on the relay).
- **crates/wzp-crypto/tests/featherchat_compat.rs** — dropped
  \`CallSignalType\` from a \`use wzp_client::featherchat::{...}\`
  (unused in the test body). Note: this test file has pre-existing
  compile errors from SignalMessage schema drift unrelated to this
  sweep; that's tracked separately.

## Crate-level annotation

- **crates/wzp-android/src/lib.rs** — added
  \`#![allow(dead_code, unused_imports, unused_variables, unused_mut)]\`
  with a doc block explaining the crate is dead code since the Tauri
  mobile rewrite. The legacy Kotlin+JNI Android app that consumed
  this crate was replaced by desktop/src-tauri (live Android recv
  path) + crates/wzp-native (Oboe bridge). Rather than piecemeal
  cleanup of a crate that shouldn't be maintained, the whole-crate
  allow keeps CI clean until someone removes the crate entirely. Kills
  all 6 wzp-android warnings (4 unused imports/vars, 1 unused \`mut\`
  on a JNI env param, 1 dead \`command_rx\` field) in one line.

## Not touched

- **deps/featherchat/warzone/crates/warzone-protocol/src/x3dh.rs** —
  3 unused-variable warnings in \`alice_spk_secret\`, \`alice_bundle\`,
  \`bob_bundle_bytes\`. This is a vendored third-party submodule;
  upstream's problem, not ours. Would need to be reported to
  featherchat upstream if we care.

## Verification

- \`cargo check --workspace --release --lib --bins\` → 0 warnings, 0 errors
- \`cargo check --workspace --release --all-targets\` → only the 3
  featherchat submodule warnings remain, plus the pre-existing 3
  broken integration tests (SignalMessage schema drift from Phase 2,
  tracked separately and explicitly out of scope).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 08:28:26 +04:00
Siavash Sameni
df1a45a5f5 fix(cli): port live mode to ring API (read_frame/write_frame removed)
AudioCapture and AudioPlayback no longer expose the old read_frame()
and write_frame() methods — they were replaced with ring() returning
&Arc<AudioRing> when the lock-free SPSC ring was introduced. The CLI
live-mode loop still referenced the removed methods, which broke every
workspace build that touched wzp-client bin (including the remote
Linux x86_64 docker build).

- Send loop: allocate a 960-sample scratch buffer, fill it in a loop
  via capture.ring().read() until a full 20 ms frame is available,
  sleep 2 ms between empty reads to avoid hot-spinning.
- Recv loop: write decoded PCM into playback.ring() instead of
  calling write_frame(). Short writes on full ring drop the tail,
  which is the correct real-time behavior for CLI live mode.

No behavioral change on the wire or in the call pipeline — this is
purely a compile fix for cli.rs bitrot that accumulated since the
ring API landed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 08:08:14 +04:00
Siavash Sameni
7515417202 feat(telemetry): Phase 4 — LossRecoveryUpdate protocol + relay metrics + DebugReporter
Phase 4 lays the telemetry foundation for distinguishing DRED recoveries
from classical PLC in production: a new SignalMessage variant, two new
per-session Prometheus counters on the relay side, and a highlighted
loss-recovery section in the Android DebugReporter.

The periodic emitter (client → relay) and Grafana panel are deferred to
Phase 4b — this commit ships the protocol surface, the relay sink, and
the immediate user-visible debug output. Once 4b lands the full path
(emitter → relay → Prometheus → Grafana), the metrics here will
automatically start receiving data.

Scope decision — why not extend QualityReport instead:
The existing wire-format QualityReport is a fixed 4-byte media packet
trailer. Adding counter fields to it would shift the binary layout and
break backward compatibility (old receivers would parse the last 4
bytes of the extended trailer as QR, corrupting audio). Using a
new SignalMessage variant on the reliable QUIC signal stream sidesteps
the wire-format problem entirely — serde JSON enums tolerate unknown
variants gracefully on old receivers, and the signal channel is the
right layer for periodic telemetry aggregates.

Changes:

  wzp-proto/src/packet.rs:
    - New SignalMessage::LossRecoveryUpdate variant carrying:
        * dred_reconstructions: u64 (monotonic since call start)
        * classical_plc_invocations: u64 (monotonic)
        * frames_decoded: u64 (for rate calculation)
    - All three fields tagged #[serde(default)] for forward compat.

  wzp-client/src/featherchat.rs:
    - Added a match arm so signal_to_call_type() handles the new
      variant (treat as Offer for featherChat bridging purposes).

  wzp-relay/src/metrics.rs:
    - Two new IntCounterVec metrics on the relay, labeled by session_id:
        * wzp_relay_session_dred_reconstructions_total
        * wzp_relay_session_classical_plc_total
    - New method update_session_loss_recovery(session_id, dred, plc)
      applies monotonic deltas: if the incoming totals exceed the
      current counter, the difference is inc_by'd. If the incoming
      totals are LOWER (client restart or counter reset), the
      Prometheus counter holds steady until the client catches up.
      This matches the existing update_session_buffer delta pattern.
    - remove_session_metrics() now cleans up the two new labels.
    - New test session_loss_recovery_monotonic_delta exercises:
        * initial population (10 DRED, 2 PLC)
        * forward advance (25, 5 → delta +15, +3)
        * lower values ignored (client reset → counters unchanged)
        * client catches up (30, 8 → advances to new max)
    - Existing session_metrics_cleanup test extended to cover the
      new counters.

  android/app/src/main/java/com/wzp/debug/DebugReporter.kt:
    - Phase 4 users — and incident responders — need to quickly see
      whether DRED is actually firing during a call. The stats JSON
      already carries the counters (after Phase 3c), but they were
      buried in the trailing JSON dump. Added a dedicated
      "=== Loss Recovery ===" section to the meta preamble that
      extracts dred_reconstructions, classical_plc_invocations,
      frames_decoded, and fec_recovered from the JSON and displays
      them plainly, plus computed percentages when frames_decoded > 0.
    - New extractLongField helper: tiny hand-rolled JSON integer
      extractor. We don't want to pull in a full JSON parser for this
      single use case and CallStats has a flat, well-known schema.

Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-proto --lib: 63 passing
- cargo test -p wzp-codec --lib: 68 passing
- cargo test -p wzp-client --lib: 35 passing (+1 ignored probe)
- cargo test -p wzp-relay --lib: 68 passing (+1 new Phase 4 test)
- cargo check -p wzp-android --lib: zero errors
- Android APK build verified earlier today (unridden-alfonso.apk
  via the remote Docker builder) — Phase 0–3c confirmed to compile
  end-to-end on the NDK target.

Phase 4b remaining (not blocking this commit):
- Periodic LossRecoveryUpdate emitter in wzp-client/src/call.rs and
  wzp-android/src/engine.rs (every ~5 s)
- Relay-side handler in main.rs that matches the new variant and
  calls metrics.update_session_loss_recovery
- Grafana "Loss recovery breakdown" panel in docs/grafana-dashboard.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:03:39 +04:00
Siavash Sameni
505a834c5b feat(codec): Phase 3c — Android engine.rs DRED reconstruction on packet loss
Phase 3c mirrors Phase 3b on the Android receive path. With Phase 0-3b
landed on desktop + Android encoder, this commit completes codec-layer
loss recovery on the Android decoder side.

Architectural difference vs desktop: engine.rs has NO jitter buffer.
The recv task reads packets directly from the transport via
recv_media().await and writes decoded audio straight into the playout
ring. There is no PlayoutResult::Missing equivalent. Gap detection
therefore has to be done via sequence-number tracking — when a packet
arrives with seq > expected_seq, the frames in between are missing and
we attempt to reconstruct them via DRED before decoding the newly-
arrived packet.

Implementation:

  Imports & types:
    - Added wzp_codec::AdaptiveDecoder, wzp_codec::dred_ffi::{
      DredDecoderHandle, DredState} imports.
    - Changed the `decoder` local from Box<dyn AudioDecoder> (via
      wzp_codec::create_decoder) to concrete AdaptiveDecoder::new(profile).
      Same reasoning as Phase 3b: reconstruct_from_dred is an inherent
      method, not a trait method, so we need the concrete type.

  Recv task state (all task-local, no new struct fields):
    - dred_decoder: DredDecoderHandle
    - dred_parse_scratch: DredState (reused, overwritten per parse)
    - last_good_dred: DredState (cached most-recent valid state)
    - last_good_dred_seq: Option<u16>
    - expected_seq: Option<u16> (for gap detection)
    - dred_reconstructions: u64 (telemetry)
    - classical_plc_invocations: u64 (telemetry)

  Recv loop body (Opus source packets only):
    1. Parse DRED from the new packet first so last_good_dred reflects
       the freshest state available for gap recovery.
    2. Detect a gap: gap = pkt.seq.wrapping_sub(expected_seq). Cap at
       MAX_GAP_FRAMES = 16 (320 ms) to avoid huge wraparound scenarios.
    3. For each missing seq in the gap:
         offset = (last_good_dred_seq - missing_seq) * frame_samples
         if 0 < offset <= last_good_dred.samples_available():
             reconstruct_from_dred + write to playout ring
             bump dred_reconstructions
         else:
             decoder.decode_lost (classical PLC) + write + bump plc counter
    4. Decode the current packet normally and write to playout ring
       (unchanged from Phase 2).
    5. Update expected_seq = pkt.seq.wrapping_add(1).

  Profile-switch handling: when the incoming codec changes (triggering
  decoder.set_profile), reset last_good_dred_seq and expected_seq to
  None. The cached DRED state is tied to the old profile's frame rate
  and would produce wrong offsets after the switch; starting fresh is
  correct.

  Decode-error fallback: the existing `Err(e) => decode_lost` branch
  now also increments classical_plc_invocations so the counter
  accurately reflects all PLC invocations (gap-detected AND decode-
  error-triggered).

Telemetry (CallStats additions):
  - stats.dred_reconstructions: u64
  - stats.classical_plc_invocations: u64
  Both updated on every packet arrival in the existing stats.lock()
  block alongside frames_decoded/fec_recovered, so the Android UI and
  JNI bridge already have these values without any further plumbing.
  The periodic recv stats log now includes both counters.

Ordering note: DRED gap reconstruction happens BEFORE decoding the new
packet's audio because the playout ring is FIFO. Gap samples must be
written before the new packet's samples so temporal order is preserved.
Out-of-order late arrivals (seq < expected_seq) are naturally dropped
as stale by the gap detection (gap would be a large wraparound value
exceeding MAX_GAP_FRAMES).

Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-codec --lib: 68 passing (unchanged from Phase 3b)
- cargo test -p wzp-client --lib: 35 passing (unchanged from Phase 3b)
- cargo check -p wzp-android --lib: zero errors
- cargo test -p wzp-android cannot run on macOS host (pre-existing
  -llog linker dep, unrelated). Real end-to-end verification happens
  via the Android APK build on the remote Docker builder
  (scripts/build-and-notify.sh).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:03:31 +04:00
Siavash Sameni
27bc264738 feat(codec): Phase 3b — CallDecoder DRED reconstruction on packet loss
Phase 3b of the DRED integration — wires the Phase 3a FFI primitives
into the desktop receive path. When the jitter buffer reports a missing
Opus frame, CallDecoder now attempts to reconstruct the audio from the
most recently parsed DRED side-channel state before falling through to
classical PLC.

Architectural refinement vs the PRD's literal wording: the PRD said
"jitter buffer takes a Box<dyn DredReconstructor>". After checking deps,
wzp-transport depends only on wzp-proto (not wzp-codec). Putting DRED
state in the jitter buffer would require a new cross-crate dep and
couple the codec-agnostic buffer to libopus. Instead, this commit keeps
the DRED state ring and reconstruction dispatch inside CallDecoder (one
layer up from the jitter buffer), intercepting the existing
PlayoutResult::Missing signal. Same lookahead/backfill semantics,
cleaner layering, zero change to wzp-transport.

Changes:

  CallDecoder field type: Box<dyn AudioDecoder> → AdaptiveDecoder.
  Required because Phase 3b calls the inherent reconstruct_from_dred
  method, which cannot live on the AudioDecoder trait without dragging
  libopus DredState through wzp-proto. In practice AdaptiveDecoder was
  the only AudioDecoder implementor anyway — the trait abstraction was
  buying nothing. Method call sites unchanged because AdaptiveDecoder
  also implements AudioDecoder.

  New CallDecoder fields:
    - dred_decoder: DredDecoderHandle
    - dred_parse_scratch: DredState  (scratch for parse_into)
    - last_good_dred: DredState      (cached most-recent valid state)
    - last_good_dred_seq: Option<u16>
    - dred_reconstructions: u64      (Phase 4 telemetry)
    - classical_plc_invocations: u64 (Phase 4 telemetry)

  CallDecoder::ingest — on Opus non-repair packets, parse DRED into the
  scratch state. On success (samples_available > 0), std::mem::swap the
  scratch into last_good_dred and record the seq. This is O(1) per
  packet, zero allocation after construction (the two DredState buffers
  are allocated once in new() and reused forever).

  CallDecoder::decode_next — on PlayoutResult::Missing(seq) for Opus
  profiles: if last_good_dred_seq > seq and the seq delta × frame_samples
  fits within samples_available, call audio_dec.reconstruct_from_dred
  and bump dred_reconstructions. Otherwise fall through to classical
  PLC and bump classical_plc_invocations. The Codec2 path always falls
  through to classical PLC since DRED is libopus-only and
  AdaptiveDecoder::reconstruct_from_dred rejects Codec2 tiers
  explicitly.

  OpusDecoder and AdaptiveDecoder: new inherent reconstruct_from_dred
  method that delegates to the underlying DecoderHandle. Needed to
  bridge CallDecoder's wzp-client code to the Phase 3a FFI wrappers
  without touching the AudioDecoder trait.

CRITICAL FINDING — raised DRED loss floor from 5% to 15%:

Phase 3b testing discovered that libopus 1.5's DRED emission window
scales aggressively with OPUS_SET_PACKET_LOSS_PERC. Empirical data
(see probe_dred_samples_available_by_loss_floor, an #[ignore]'d
diagnostic test in call.rs):

  loss_pct   samples_available   effective_ms
    5%        720                  15 ms  (useless!)
   10%        2640                 55 ms
   15%        4560                 95 ms
   20%        6480                135 ms
   25%+       8400 (capped)       175 ms  (~87% of 200 ms configured)

The Phase 1 default of 5% produced only a 15 ms reconstruction window
— too small to even cover a single 20 ms Opus frame. DRED was
effectively disabled even though it was emitting bytes. Raised the
floor to 15% (95 ms window) as the minimum that actually provides
single-frame loss recovery. This updates Phase 1's DRED_LOSS_FLOOR_PCT
constant in opus_enc.rs and the accompanying module docstring.

Trade-off: 15% assumed loss slightly increases encoder bitrate overhead
on clean networks. Measured via the existing phase1 bitrate probe:

  Before (5% floor):  3649 bytes/sec at Opus 24k + 300 Hz sine
  After  (15% floor): 3568 bytes/sec at Opus 24k + 300 Hz sine

The delta is within noise — 15% isn't meaningfully more expensive than
5% on this signal, which suggests the DRED emission size is signal-
dependent rather than loss-dependent for small values. Net result: we
get a 6x larger reconstruction window for essentially free.

Tests (+3 DRED recovery, +1 #[ignore]'d probe):
- opus_single_packet_loss_is_recovered_via_dred — full encode → ingest
  → decode_next loop with one packet dropped mid-stream. Asserts
  dred_reconstructions ≥ 1 and observes the exact counter deltas.
- opus_lossless_ingest_never_triggers_dred_or_plc — baseline behavior,
  lossless stream never takes the Missing branch.
- codec2_loss_falls_through_to_classical_plc — Codec2 never
  reconstructs via DRED even if state were populated (which it won't
  be — Codec2 packets don't carry DRED bytes).
- probe_dred_samples_available_by_loss_floor — #[ignore]'d diagnostic
  that sweeps loss_pct values and prints the resulting DRED window
  sizes. Kept for future tuning work.

New CallDecoder introspection accessors (public but undocumented in
the PRD): last_good_dred_seq() and last_good_dred_samples_available()
for test diagnostics and future telemetry surfaces in Phase 4.

Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-codec --lib: 68 passing (Phase 3a baseline held)
- cargo test -p wzp-client --lib: 35 passing (+3 Phase 3b tests,
  +1 ignored diagnostic, no regressions)

Next up: Phase 3c mirrors this on the Android engine.rs receive path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:03:24 +04:00
Siavash Sameni
c27b39d553 feat(codec): Phase 3a — DRED FFI primitives (DredDecoderHandle + DredState)
Phase 3a of the DRED integration — the foundation for codec-layer loss
recovery. Adds three new safe wrappers to crates/wzp-codec/src/dred_ffi.rs
over the raw opusic-sys FFI, plus the reconstruction method on the existing
DecoderHandle. No call-site integration yet — that lands in Phase 3b (desktop)
and Phase 3c (Android).

New types:
- `DredDecoderHandle`: owns *mut OpusDREDDecoder from opus_dred_decoder_create.
  Used for parsing DRED side-channel data out of arriving Opus packets.
  This is a SEPARATE libopus object from OpusDecoder — it has its own
  internal state. Freed via opus_dred_decoder_destroy on Drop.
- `DredState`: owns *mut OpusDRED from opus_dred_alloc (a fixed ~10.6 KB
  buffer per libopus 1.5). Holds parsed DRED data between the parse and
  reconstruct steps. Reusable — parse_into overwrites contents. Tracks
  samples_available as a cached u32 so callers don't thread the value
  separately. Freed via opus_dred_free on Drop.

New methods:
- `DredDecoderHandle::parse_into(&mut self, state: &mut DredState, packet)`
  wraps opus_dred_parse with max_dred_samples=48000 (1s max), sampling_rate
  =48000, defer_processing=0. Returns the positive sample offset of the
  first decodable DRED sample, 0 if no DRED is present, or an error.
  Populates state.samples_available so subsequent reconstruct calls know
  the valid offset range.
- `DecoderHandle::reconstruct_from_dred(&mut self, state, offset_samples,
  output)` wraps opus_decoder_dred_decode. Reconstructs audio at a specific
  sample position (positive, measured backward from the DRED anchor packet)
  into a caller-provided output buffer. Validates that 0 < offset_samples
  <= state.samples_available() before calling the FFI to catch range bugs.

Tests (+7, wzp-codec total: 68 passing):
- dred_decoder_handle_creates_and_drops
- dred_state_creates_and_drops
- dred_state_reset_zeroes_counter
- dred_parse_and_reconstruct_roundtrip — end-to-end validation. Encodes
  60 frames of a 300 Hz sine wave through a DRED-enabled Opus 24k encoder,
  parses DRED state out of each arriving packet, asserts that at least one
  packet carries non-zero samples_available (DRED warm-up completes within
  the first second), then reconstructs 20 ms of audio from inside the
  window and asserts non-zero total energy. This is the hard signal that
  the full libopus 1.5 DRED FFI chain is correctly wired on our side.
- reconstruct_with_out_of_range_offset_errors — offset > samples_available
  is rejected at the Rust layer before the FFI call.
- reconstruct_with_zero_offset_errors — offset <= 0 rejected.
- dred_parse_empty_packet_returns_zero — graceful handling of empty input.

Architectural note (divergence from PRD's literal wording):
The PRD said "jitter buffer takes a Box<dyn DredReconstructor>". After
checking Cargo.toml for wzp-transport, it does NOT depend on wzp-codec —
only wzp-proto. Adding a DRED state ring inside the jitter buffer would
require a new cross-crate dependency and couple the codec-agnostic jitter
buffer to libopus internals. Instead, Phase 3b will put the DRED state
ring and reconstruction dispatch in CallDecoder (one layer up from the
jitter buffer), intercepting the existing PlayoutResult::Missing signal
and attempting reconstruction before falling through to classical PLC.
The jitter buffer itself stays unchanged. Same lookahead/backfill
semantics, cleaner layering. PRD's intent preserved, implementation
refined.

Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-codec --lib: 68 passing (61 Phase 2 baseline + 7 new)
- The roundtrip test is the acceptance gate — it proves that
  opus_dred_decoder_create, opus_dred_alloc, opus_dred_parse, and
  opus_decoder_dred_decode all work correctly through our wrappers on
  real libopus 1.5.2 output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:03:14 +04:00
Siavash Sameni
6db5c25b54 feat(codec): Phase 2 — remove RaptorQ from Opus tiers, Codec2 unchanged
Phase 2 of the DRED integration (docs/PRD-dred-integration.md). With
Phase 1 having enabled DRED on every Opus profile, the app-level RaptorQ
layer is now redundant overhead on those tiers: +20% bitrate, +40–100 ms
receive-side latency (block wait), +CPU for stats we never used. This
phase removes RaptorQ from the Opus encode and decode paths on both the
desktop (wzp-client/call.rs) and Android (wzp-android/engine.rs) sides.
Codec2 tiers keep RaptorQ with their current ratios unchanged — DRED is
libopus-only and Codec2 has no neural equivalent.

Encoder changes (the real bandwidth / CPU win):
- CallEncoder::encode_frame and engine.rs encode loop now gate the
  RaptorQ path on !codec.is_opus():
    - Opus source packets emit fec_block=0, fec_symbol=0,
      fec_ratio_encoded=0 in the MediaHeader
    - fec_enc.add_source_symbol is skipped on Opus
    - generate_repair + repair packet emission is skipped on Opus
    - block_id and frame_in_block counters stay frozen at 0 for Opus
- Codec2 path is byte-for-byte identical to pre-Phase-2 behavior.

Decoder changes (mostly cleanup, since both live decoder paths were
already reading audio directly from source packets and only using the
RaptorQ decoder output for stats):
- CallDecoder::ingest skips fec_dec.add_symbol on Opus packets. Source
  packets still flow to the jitter buffer; Opus repair packets from old
  senders are dropped cleanly (repair packets never hit the jitter
  buffer either).
- engine.rs recv loop skips fec_dec.add_symbol, fec_dec.try_decode, and
  fec_dec.expire_before on Opus packets. The `fec_recovered` stat
  counter becomes Codec2-only (a separate DRED reconstruction counter
  lands in Phase 4).

Wire-format backward compat verified at pre-flight:
- Old receiver + new sender: engine.rs pipeline.rs path gates on
  non-zero fec_block/fec_symbol which now never fire for Opus, so the
  RaptorQ decoder simply isn't fed. Audio flows normally. Desktop
  CallDecoder's old path accumulated packets into the stale-eviction
  HashMap, which cleans up after 2s — harmless.
- New receiver + old sender: new receiver skips RaptorQ on Opus so
  old-sender repair packets are ignored entirely (no crash, no double-
  decode). Loses the (previously vestigial) RaptorQ recovery benefit,
  which was never actually active in the audio path. Source packets
  still decode normally.
- No wire format version bump required. MediaHeader is unchanged; we
  just zero the FEC fields on Opus packets.

Test changes:
- Removed `encoder_generates_repair_on_full_block` — asserted the old
  (pre-Phase-2) RaptorQ-on-Opus behavior and is now incorrect. Replaced
  with two symmetric tests:
    - `opus_source_packets_have_zero_fec_header_fields` — verifies
      Phase 2 invariants on Opus packets
    - `opus_encoder_never_emits_repair_packets` — runs 20 frames of
      non-silent sine wave through a GOOD-profile encoder, asserts
      exactly 20 output packets, zero repair
    - `codec2_encoder_generates_repair_on_full_block` — same shape as
      the old test but on CATASTROPHIC profile (Codec2 1200, 8
      frames/block, ratio 1.0) to verify Codec2 path still emits
      repairs as before

Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-codec --lib: 61 passing (Phase 1 baseline held)
- cargo test -p wzp-client --lib: 32 passing (+3 new Phase 2 tests,
  -1 old test removed)
- cargo check -p wzp-android --lib: zero errors (host link of
  wzp-android tests fails on -llog per pre-existing Android-only
  build.rs, unrelated to this work; integration build via
  build-and-notify.sh will validate Android end-to-end)
- Pre-existing broken integration test in
  crates/wzp-client/tests/handshake_integration.rs (SignalMessage
  schema drift) is NOT caused by this commit — baseline had the same
  3 compile errors before Phase 2. Flagged as a separate cleanup task.

Expected observable effects on a real call:
- Opus 24k outgoing bitrate drops from ~28.8 kbps (ratio 0.2 RaptorQ)
  to ~25 kbps (base 24 kbps + DRED ~1–10 kbps signal-dependent)
- Opus receive-side latency drops ~40 ms on clean network (no more
  block wait — jitter buffer emits as soon as a source packet arrives)
- Codec2 calls show no latency or bitrate change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:02:42 +04:00
Siavash Sameni
54cbebd34e feat(codec): Phase 1 — enable DRED on all Opus profiles, disable inband FEC
Phase 1 of the DRED integration (docs/PRD-dred-integration.md). The Opus
encoder now emits DRED (Deep REDundancy) bytes in every packet, carrying
a neural-coded history of recent audio that the decoder can use to
reconstruct loss bursts up to the configured window. Opus inband FEC
(LBRR) is disabled because DRED does the same job better and running both
wastes bitrate on overlapping protection.

Tiered DRED duration policy per PRD:
  Studio  (Opus 32k/48k/64k): 10 frames = 100 ms
  Normal  (Opus 16k/24k):     20 frames = 200 ms
  Degraded (Opus 6k):         50 frames = 500 ms

Each profile switch (via adaptive quality) updates the DRED duration to
match the new tier. A 5% packet_loss floor is applied whenever DRED is
active, because libopus 1.5 gates DRED emission on non-zero packet_loss.
Real loss measurements from the quality adapter override upward.

Escape hatch: AUDIO_USE_LEGACY_FEC=1 reverts the encoder to Phase 0
behavior (inband FEC Mode1, DRED off, no loss floor). Read once at
OpusEncoder::new; call-scoped, not re-read mid-call. Trait-level
set_inband_fec becomes a no-op in DRED mode to preserve the invariant
even if external callers forget.

Observations from the bitrate probe test (dred_mode_roundtrip_voice_pattern):
  DRED mode:   3649 bytes/sec (~29.2 kbps) on Opus 24k + 300 Hz sine
  Legacy mode: 2383 bytes/sec (~19.1 kbps)
  Delta:       +10.1 kbps

The delta is considerably larger than the "+1 kbps flat" figure I carried
into the PRD from hazy memory of published DRED benchmarks. Likely because
the input (300 Hz sine) is very compressible so the base Opus rate in
legacy mode is well below the 24 kbps target, making the delta look
disproportionate. Signal-dependent — real speech would probably show a
different ratio. If production telemetry shows the overhead is excessive,
we can cut DRED duration on the normal tier from 200 ms to 100 ms as a
first tuning lever. Not blocking Phase 1 since the test still passes
within the reasonable 2000–8000 bytes/sec bounds.

Test changes (+8 tests, total wzp-codec: 61 passing):
- dred_duration_for_studio_tiers_is_100ms  (per-profile policy)
- dred_duration_for_normal_tiers_is_200ms
- dred_duration_for_degraded_tier_is_500ms
- dred_duration_for_codec2_is_zero
- default_mode_is_dred_not_legacy  (sanity check on fresh construction)
- dred_mode_roundtrip_voice_pattern  (observes DRED bitrate, asserts bounds)
- profile_switch_refreshes_dred_duration  (verifies set_profile updates DRED)
- set_inband_fec_noop_in_dred_mode  (trait-level inband FEC no-op)

Verification:
- cargo check --workspace: zero errors, no new warnings
- cargo test -p wzp-codec: 61/61 passing (53 pre-Phase-1 baseline + 8 new)
- Empirical DRED bitrate observed via `rtk proxy cargo test
  dred_mode_roundtrip_voice_pattern -- --nocapture`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:02:35 +04:00
Siavash Sameni
86526a7ad4 feat(codec): Phase 0 — swap audiopus → opusic-c + opusic-sys (libopus 1.5.2)
Phase 0 of the DRED integration (docs/PRD-dred-integration.md). No behavior
change: inband FEC stays ON, no DRED, same bitrate, same quality. This
commit unblocks Phase 1+ by getting us onto libopus 1.5.2 where DRED lives.

Rationale for going straight to a custom DecoderHandle: opusic-c::Decoder's
inner *mut OpusDecoder pointer is pub(crate), so we cannot reach it for the
Phase 3 DRED reconstruction path. Running two parallel decoders (one for
audio, one for DRED) would drift because the DRED decoder wouldn't see
normal decode calls. Single unified DecoderHandle over raw opusic-sys is
the only correct architecture, so we build it in Phase 0 rather than
rewriting opus_dec.rs twice.

Changes:
- Cargo.toml (workspace + wzp-codec): remove audiopus 0.3.0-rc.0, add
  opusic-c 1.5.5 (bundled + dred features), opusic-sys 0.6.0 (bundled),
  bytemuck 1. Pinned exactly for reproducible libopus 1.5.2.
- opus_enc.rs: rewritten against opusic_c::Encoder. Argument order for
  Encoder::new swapped (Channels first). set_inband_fec(bool) now maps
  to InbandFec::Mode1 (the libopus 1.5 equivalent of 1.3's LBRR). encode
  uses bytemuck::cast_slice<i16,u16> at the &[u16] boundary.
- dred_ffi.rs (new): DecoderHandle wrapping *mut OpusDecoder directly via
  opusic-sys. Owns the allocation, frees on Drop. Exposes decode,
  decode_lost, and a pub(crate) as_raw_ptr() for the future Phase 3 DRED
  reconstruction. Send+Sync justified via &mut self access discipline.
- opus_dec.rs: rewritten as a thin AudioDecoder impl over DecoderHandle.
  Behavior identical to pre-swap.

Verification (Phase 0 acceptance gates):
- cargo check --workspace: clean (30 pre-existing warnings in jni_bridge.rs
  unrelated to this work; zero in changed files).
- cargo test -p wzp-codec: 53 tests pass (50 pre-swap + 6 new: 3 in
  dred_ffi.rs for DecoderHandle lifecycle, 3 in opus_enc.rs for version
  check and roundtrip).
- linked_libopus_is_1_5 test asserts opusic_c::version() contains "1.5" —
  hard signal that the swap landed correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:02:15 +04:00
Siavash Sameni
07873ea598 fix(linux-aec): fall back to 0.3 crate + apt lib (2.x bundled is broken)
Some checks failed
Build Release Binaries / build-amd64 (push) Failing after 4m6s
Mirror to GitHub / mirror (push) Failing after 45s
Switch the webrtc-audio-processing dep from the 2.x git source (bundled
mode) back to crates.io 0.3, and link against Debian's apt package
libwebrtc-audio-processing-dev (0.3-1+b1 on Bookworm). The 2.x path
fails because both the crates.io tarball and the upstream git main
branch of webrtc-audio-processing-sys 2.0.3 have a build.rs bug where
\`meson setup --reconfigure\` is passed unconditionally, panicking on
first-run empty build dirs with "Directory does not contain a valid
build tree". The 0.x line sidesteps bundled mode entirely by linking
the apt-provided library.

Trade-off: we get AEC2 (the older generation) instead of AEC3, but
it's the same algorithm family and is what PulseAudio's
module-echo-cancel and PipeWire's filter-chain use on current
Debian-family distros. Fine for shipping — we can revisit AEC3 once
the 2.x bundled build is fixed upstream.

API changes:
- 0.3's Processor::process_capture_frame and process_render_frame
  take &mut self, so wrap the module-level processor in a Mutex.
  Capture and playback threads each lock briefly (sub-ms per 10 ms
  frame); contention is minimal.
- Import NUM_SAMPLES_PER_FRAME from the crate directly instead of
  hardcoding 480, so the code tracks whatever sample rate the
  upstream C++ lib exposes (currently 48 kHz hardcoded -> 480).
- Helper fns drain_frames_through_apm / tee_render_samples / etc.
  take &Mutex<Processor> instead of &Processor.
- Use explicit EchoCancellationSuppressionLevel and
  NoiseSuppressionLevel imports rather than fully-qualified paths.

Dockerfile:
- Drop meson / ninja-build / python3 (only needed for bundled build).
- Add libwebrtc-audio-processing-dev for the system link path.
- Keep clang (may be needed by the bindgen step in some versions).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:06:56 +04:00