Phase 3c mirrors Phase 3b on the Android receive path. With Phase 0-3b
landed on desktop + Android encoder, this commit completes codec-layer
loss recovery on the Android decoder side.
Architectural difference vs desktop: engine.rs has NO jitter buffer.
The recv task reads packets directly from the transport via
recv_media().await and writes decoded audio straight into the playout
ring. There is no PlayoutResult::Missing equivalent. Gap detection
therefore has to be done via sequence-number tracking — when a packet
arrives with seq > expected_seq, the frames in between are missing and
we attempt to reconstruct them via DRED before decoding the newly-
arrived packet.
Implementation:
Imports & types:
- Added wzp_codec::AdaptiveDecoder, wzp_codec::dred_ffi::{
DredDecoderHandle, DredState} imports.
- Changed the `decoder` local from Box<dyn AudioDecoder> (via
wzp_codec::create_decoder) to concrete AdaptiveDecoder::new(profile).
Same reasoning as Phase 3b: reconstruct_from_dred is an inherent
method, not a trait method, so we need the concrete type.
Recv task state (all task-local, no new struct fields):
- dred_decoder: DredDecoderHandle
- dred_parse_scratch: DredState (reused, overwritten per parse)
- last_good_dred: DredState (cached most-recent valid state)
- last_good_dred_seq: Option<u16>
- expected_seq: Option<u16> (for gap detection)
- dred_reconstructions: u64 (telemetry)
- classical_plc_invocations: u64 (telemetry)
Recv loop body (Opus source packets only):
1. Parse DRED from the new packet first so last_good_dred reflects
the freshest state available for gap recovery.
2. Detect a gap: gap = pkt.seq.wrapping_sub(expected_seq). Cap at
MAX_GAP_FRAMES = 16 (320 ms) to avoid huge wraparound scenarios.
3. For each missing seq in the gap:
offset = (last_good_dred_seq - missing_seq) * frame_samples
if 0 < offset <= last_good_dred.samples_available():
reconstruct_from_dred + write to playout ring
bump dred_reconstructions
else:
decoder.decode_lost (classical PLC) + write + bump plc counter
4. Decode the current packet normally and write to playout ring
(unchanged from Phase 2).
5. Update expected_seq = pkt.seq.wrapping_add(1).
Profile-switch handling: when the incoming codec changes (triggering
decoder.set_profile), reset last_good_dred_seq and expected_seq to
None. The cached DRED state is tied to the old profile's frame rate
and would produce wrong offsets after the switch; starting fresh is
correct.
Decode-error fallback: the existing `Err(e) => decode_lost` branch
now also increments classical_plc_invocations so the counter
accurately reflects all PLC invocations (gap-detected AND decode-
error-triggered).
Telemetry (CallStats additions):
- stats.dred_reconstructions: u64
- stats.classical_plc_invocations: u64
Both updated on every packet arrival in the existing stats.lock()
block alongside frames_decoded/fec_recovered, so the Android UI and
JNI bridge already have these values without any further plumbing.
The periodic recv stats log now includes both counters.
Ordering note: DRED gap reconstruction happens BEFORE decoding the new
packet's audio because the playout ring is FIFO. Gap samples must be
written before the new packet's samples so temporal order is preserved.
Out-of-order late arrivals (seq < expected_seq) are naturally dropped
as stale by the gap detection (gap would be a large wraparound value
exceeding MAX_GAP_FRAMES).
Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-codec --lib: 68 passing (unchanged from Phase 3b)
- cargo test -p wzp-client --lib: 35 passing (unchanged from Phase 3b)
- cargo check -p wzp-android --lib: zero errors
- cargo test -p wzp-android cannot run on macOS host (pre-existing
-llog linker dep, unrelated). Real end-to-end verification happens
via the Android APK build on the remote Docker builder
(scripts/build-and-notify.sh).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3b of the DRED integration — wires the Phase 3a FFI primitives
into the desktop receive path. When the jitter buffer reports a missing
Opus frame, CallDecoder now attempts to reconstruct the audio from the
most recently parsed DRED side-channel state before falling through to
classical PLC.
Architectural refinement vs the PRD's literal wording: the PRD said
"jitter buffer takes a Box<dyn DredReconstructor>". After checking deps,
wzp-transport depends only on wzp-proto (not wzp-codec). Putting DRED
state in the jitter buffer would require a new cross-crate dep and
couple the codec-agnostic buffer to libopus. Instead, this commit keeps
the DRED state ring and reconstruction dispatch inside CallDecoder (one
layer up from the jitter buffer), intercepting the existing
PlayoutResult::Missing signal. Same lookahead/backfill semantics,
cleaner layering, zero change to wzp-transport.
Changes:
CallDecoder field type: Box<dyn AudioDecoder> → AdaptiveDecoder.
Required because Phase 3b calls the inherent reconstruct_from_dred
method, which cannot live on the AudioDecoder trait without dragging
libopus DredState through wzp-proto. In practice AdaptiveDecoder was
the only AudioDecoder implementor anyway — the trait abstraction was
buying nothing. Method call sites unchanged because AdaptiveDecoder
also implements AudioDecoder.
New CallDecoder fields:
- dred_decoder: DredDecoderHandle
- dred_parse_scratch: DredState (scratch for parse_into)
- last_good_dred: DredState (cached most-recent valid state)
- last_good_dred_seq: Option<u16>
- dred_reconstructions: u64 (Phase 4 telemetry)
- classical_plc_invocations: u64 (Phase 4 telemetry)
CallDecoder::ingest — on Opus non-repair packets, parse DRED into the
scratch state. On success (samples_available > 0), std::mem::swap the
scratch into last_good_dred and record the seq. This is O(1) per
packet, zero allocation after construction (the two DredState buffers
are allocated once in new() and reused forever).
CallDecoder::decode_next — on PlayoutResult::Missing(seq) for Opus
profiles: if last_good_dred_seq > seq and the seq delta × frame_samples
fits within samples_available, call audio_dec.reconstruct_from_dred
and bump dred_reconstructions. Otherwise fall through to classical
PLC and bump classical_plc_invocations. The Codec2 path always falls
through to classical PLC since DRED is libopus-only and
AdaptiveDecoder::reconstruct_from_dred rejects Codec2 tiers
explicitly.
OpusDecoder and AdaptiveDecoder: new inherent reconstruct_from_dred
method that delegates to the underlying DecoderHandle. Needed to
bridge CallDecoder's wzp-client code to the Phase 3a FFI wrappers
without touching the AudioDecoder trait.
CRITICAL FINDING — raised DRED loss floor from 5% to 15%:
Phase 3b testing discovered that libopus 1.5's DRED emission window
scales aggressively with OPUS_SET_PACKET_LOSS_PERC. Empirical data
(see probe_dred_samples_available_by_loss_floor, an #[ignore]'d
diagnostic test in call.rs):
loss_pct samples_available effective_ms
5% 720 15 ms (useless!)
10% 2640 55 ms
15% 4560 95 ms
20% 6480 135 ms
25%+ 8400 (capped) 175 ms (~87% of 200 ms configured)
The Phase 1 default of 5% produced only a 15 ms reconstruction window
— too small to even cover a single 20 ms Opus frame. DRED was
effectively disabled even though it was emitting bytes. Raised the
floor to 15% (95 ms window) as the minimum that actually provides
single-frame loss recovery. This updates Phase 1's DRED_LOSS_FLOOR_PCT
constant in opus_enc.rs and the accompanying module docstring.
Trade-off: 15% assumed loss slightly increases encoder bitrate overhead
on clean networks. Measured via the existing phase1 bitrate probe:
Before (5% floor): 3649 bytes/sec at Opus 24k + 300 Hz sine
After (15% floor): 3568 bytes/sec at Opus 24k + 300 Hz sine
The delta is within noise — 15% isn't meaningfully more expensive than
5% on this signal, which suggests the DRED emission size is signal-
dependent rather than loss-dependent for small values. Net result: we
get a 6x larger reconstruction window for essentially free.
Tests (+3 DRED recovery, +1 #[ignore]'d probe):
- opus_single_packet_loss_is_recovered_via_dred — full encode → ingest
→ decode_next loop with one packet dropped mid-stream. Asserts
dred_reconstructions ≥ 1 and observes the exact counter deltas.
- opus_lossless_ingest_never_triggers_dred_or_plc — baseline behavior,
lossless stream never takes the Missing branch.
- codec2_loss_falls_through_to_classical_plc — Codec2 never
reconstructs via DRED even if state were populated (which it won't
be — Codec2 packets don't carry DRED bytes).
- probe_dred_samples_available_by_loss_floor — #[ignore]'d diagnostic
that sweeps loss_pct values and prints the resulting DRED window
sizes. Kept for future tuning work.
New CallDecoder introspection accessors (public but undocumented in
the PRD): last_good_dred_seq() and last_good_dred_samples_available()
for test diagnostics and future telemetry surfaces in Phase 4.
Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-codec --lib: 68 passing (Phase 3a baseline held)
- cargo test -p wzp-client --lib: 35 passing (+3 Phase 3b tests,
+1 ignored diagnostic, no regressions)
Next up: Phase 3c mirrors this on the Android engine.rs receive path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3a of the DRED integration — the foundation for codec-layer loss
recovery. Adds three new safe wrappers to crates/wzp-codec/src/dred_ffi.rs
over the raw opusic-sys FFI, plus the reconstruction method on the existing
DecoderHandle. No call-site integration yet — that lands in Phase 3b (desktop)
and Phase 3c (Android).
New types:
- `DredDecoderHandle`: owns *mut OpusDREDDecoder from opus_dred_decoder_create.
Used for parsing DRED side-channel data out of arriving Opus packets.
This is a SEPARATE libopus object from OpusDecoder — it has its own
internal state. Freed via opus_dred_decoder_destroy on Drop.
- `DredState`: owns *mut OpusDRED from opus_dred_alloc (a fixed ~10.6 KB
buffer per libopus 1.5). Holds parsed DRED data between the parse and
reconstruct steps. Reusable — parse_into overwrites contents. Tracks
samples_available as a cached u32 so callers don't thread the value
separately. Freed via opus_dred_free on Drop.
New methods:
- `DredDecoderHandle::parse_into(&mut self, state: &mut DredState, packet)`
wraps opus_dred_parse with max_dred_samples=48000 (1s max), sampling_rate
=48000, defer_processing=0. Returns the positive sample offset of the
first decodable DRED sample, 0 if no DRED is present, or an error.
Populates state.samples_available so subsequent reconstruct calls know
the valid offset range.
- `DecoderHandle::reconstruct_from_dred(&mut self, state, offset_samples,
output)` wraps opus_decoder_dred_decode. Reconstructs audio at a specific
sample position (positive, measured backward from the DRED anchor packet)
into a caller-provided output buffer. Validates that 0 < offset_samples
<= state.samples_available() before calling the FFI to catch range bugs.
Tests (+7, wzp-codec total: 68 passing):
- dred_decoder_handle_creates_and_drops
- dred_state_creates_and_drops
- dred_state_reset_zeroes_counter
- dred_parse_and_reconstruct_roundtrip — end-to-end validation. Encodes
60 frames of a 300 Hz sine wave through a DRED-enabled Opus 24k encoder,
parses DRED state out of each arriving packet, asserts that at least one
packet carries non-zero samples_available (DRED warm-up completes within
the first second), then reconstructs 20 ms of audio from inside the
window and asserts non-zero total energy. This is the hard signal that
the full libopus 1.5 DRED FFI chain is correctly wired on our side.
- reconstruct_with_out_of_range_offset_errors — offset > samples_available
is rejected at the Rust layer before the FFI call.
- reconstruct_with_zero_offset_errors — offset <= 0 rejected.
- dred_parse_empty_packet_returns_zero — graceful handling of empty input.
Architectural note (divergence from PRD's literal wording):
The PRD said "jitter buffer takes a Box<dyn DredReconstructor>". After
checking Cargo.toml for wzp-transport, it does NOT depend on wzp-codec —
only wzp-proto. Adding a DRED state ring inside the jitter buffer would
require a new cross-crate dependency and couple the codec-agnostic jitter
buffer to libopus internals. Instead, Phase 3b will put the DRED state
ring and reconstruction dispatch in CallDecoder (one layer up from the
jitter buffer), intercepting the existing PlayoutResult::Missing signal
and attempting reconstruction before falling through to classical PLC.
The jitter buffer itself stays unchanged. Same lookahead/backfill
semantics, cleaner layering. PRD's intent preserved, implementation
refined.
Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-codec --lib: 68 passing (61 Phase 2 baseline + 7 new)
- The roundtrip test is the acceptance gate — it proves that
opus_dred_decoder_create, opus_dred_alloc, opus_dred_parse, and
opus_decoder_dred_decode all work correctly through our wrappers on
real libopus 1.5.2 output.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 2 of the DRED integration (docs/PRD-dred-integration.md). With
Phase 1 having enabled DRED on every Opus profile, the app-level RaptorQ
layer is now redundant overhead on those tiers: +20% bitrate, +40–100 ms
receive-side latency (block wait), +CPU for stats we never used. This
phase removes RaptorQ from the Opus encode and decode paths on both the
desktop (wzp-client/call.rs) and Android (wzp-android/engine.rs) sides.
Codec2 tiers keep RaptorQ with their current ratios unchanged — DRED is
libopus-only and Codec2 has no neural equivalent.
Encoder changes (the real bandwidth / CPU win):
- CallEncoder::encode_frame and engine.rs encode loop now gate the
RaptorQ path on !codec.is_opus():
- Opus source packets emit fec_block=0, fec_symbol=0,
fec_ratio_encoded=0 in the MediaHeader
- fec_enc.add_source_symbol is skipped on Opus
- generate_repair + repair packet emission is skipped on Opus
- block_id and frame_in_block counters stay frozen at 0 for Opus
- Codec2 path is byte-for-byte identical to pre-Phase-2 behavior.
Decoder changes (mostly cleanup, since both live decoder paths were
already reading audio directly from source packets and only using the
RaptorQ decoder output for stats):
- CallDecoder::ingest skips fec_dec.add_symbol on Opus packets. Source
packets still flow to the jitter buffer; Opus repair packets from old
senders are dropped cleanly (repair packets never hit the jitter
buffer either).
- engine.rs recv loop skips fec_dec.add_symbol, fec_dec.try_decode, and
fec_dec.expire_before on Opus packets. The `fec_recovered` stat
counter becomes Codec2-only (a separate DRED reconstruction counter
lands in Phase 4).
Wire-format backward compat verified at pre-flight:
- Old receiver + new sender: engine.rs pipeline.rs path gates on
non-zero fec_block/fec_symbol which now never fire for Opus, so the
RaptorQ decoder simply isn't fed. Audio flows normally. Desktop
CallDecoder's old path accumulated packets into the stale-eviction
HashMap, which cleans up after 2s — harmless.
- New receiver + old sender: new receiver skips RaptorQ on Opus so
old-sender repair packets are ignored entirely (no crash, no double-
decode). Loses the (previously vestigial) RaptorQ recovery benefit,
which was never actually active in the audio path. Source packets
still decode normally.
- No wire format version bump required. MediaHeader is unchanged; we
just zero the FEC fields on Opus packets.
Test changes:
- Removed `encoder_generates_repair_on_full_block` — asserted the old
(pre-Phase-2) RaptorQ-on-Opus behavior and is now incorrect. Replaced
with two symmetric tests:
- `opus_source_packets_have_zero_fec_header_fields` — verifies
Phase 2 invariants on Opus packets
- `opus_encoder_never_emits_repair_packets` — runs 20 frames of
non-silent sine wave through a GOOD-profile encoder, asserts
exactly 20 output packets, zero repair
- `codec2_encoder_generates_repair_on_full_block` — same shape as
the old test but on CATASTROPHIC profile (Codec2 1200, 8
frames/block, ratio 1.0) to verify Codec2 path still emits
repairs as before
Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-codec --lib: 61 passing (Phase 1 baseline held)
- cargo test -p wzp-client --lib: 32 passing (+3 new Phase 2 tests,
-1 old test removed)
- cargo check -p wzp-android --lib: zero errors (host link of
wzp-android tests fails on -llog per pre-existing Android-only
build.rs, unrelated to this work; integration build via
build-and-notify.sh will validate Android end-to-end)
- Pre-existing broken integration test in
crates/wzp-client/tests/handshake_integration.rs (SignalMessage
schema drift) is NOT caused by this commit — baseline had the same
3 compile errors before Phase 2. Flagged as a separate cleanup task.
Expected observable effects on a real call:
- Opus 24k outgoing bitrate drops from ~28.8 kbps (ratio 0.2 RaptorQ)
to ~25 kbps (base 24 kbps + DRED ~1–10 kbps signal-dependent)
- Opus receive-side latency drops ~40 ms on clean network (no more
block wait — jitter buffer emits as soon as a source packet arrives)
- Codec2 calls show no latency or bitrate change
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1 of the DRED integration (docs/PRD-dred-integration.md). The Opus
encoder now emits DRED (Deep REDundancy) bytes in every packet, carrying
a neural-coded history of recent audio that the decoder can use to
reconstruct loss bursts up to the configured window. Opus inband FEC
(LBRR) is disabled because DRED does the same job better and running both
wastes bitrate on overlapping protection.
Tiered DRED duration policy per PRD:
Studio (Opus 32k/48k/64k): 10 frames = 100 ms
Normal (Opus 16k/24k): 20 frames = 200 ms
Degraded (Opus 6k): 50 frames = 500 ms
Each profile switch (via adaptive quality) updates the DRED duration to
match the new tier. A 5% packet_loss floor is applied whenever DRED is
active, because libopus 1.5 gates DRED emission on non-zero packet_loss.
Real loss measurements from the quality adapter override upward.
Escape hatch: AUDIO_USE_LEGACY_FEC=1 reverts the encoder to Phase 0
behavior (inband FEC Mode1, DRED off, no loss floor). Read once at
OpusEncoder::new; call-scoped, not re-read mid-call. Trait-level
set_inband_fec becomes a no-op in DRED mode to preserve the invariant
even if external callers forget.
Observations from the bitrate probe test (dred_mode_roundtrip_voice_pattern):
DRED mode: 3649 bytes/sec (~29.2 kbps) on Opus 24k + 300 Hz sine
Legacy mode: 2383 bytes/sec (~19.1 kbps)
Delta: +10.1 kbps
The delta is considerably larger than the "+1 kbps flat" figure I carried
into the PRD from hazy memory of published DRED benchmarks. Likely because
the input (300 Hz sine) is very compressible so the base Opus rate in
legacy mode is well below the 24 kbps target, making the delta look
disproportionate. Signal-dependent — real speech would probably show a
different ratio. If production telemetry shows the overhead is excessive,
we can cut DRED duration on the normal tier from 200 ms to 100 ms as a
first tuning lever. Not blocking Phase 1 since the test still passes
within the reasonable 2000–8000 bytes/sec bounds.
Test changes (+8 tests, total wzp-codec: 61 passing):
- dred_duration_for_studio_tiers_is_100ms (per-profile policy)
- dred_duration_for_normal_tiers_is_200ms
- dred_duration_for_degraded_tier_is_500ms
- dred_duration_for_codec2_is_zero
- default_mode_is_dred_not_legacy (sanity check on fresh construction)
- dred_mode_roundtrip_voice_pattern (observes DRED bitrate, asserts bounds)
- profile_switch_refreshes_dred_duration (verifies set_profile updates DRED)
- set_inband_fec_noop_in_dred_mode (trait-level inband FEC no-op)
Verification:
- cargo check --workspace: zero errors, no new warnings
- cargo test -p wzp-codec: 61/61 passing (53 pre-Phase-1 baseline + 8 new)
- Empirical DRED bitrate observed via `rtk proxy cargo test
dred_mode_roundtrip_voice_pattern -- --nocapture`
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 0 of the DRED integration (docs/PRD-dred-integration.md). No behavior
change: inband FEC stays ON, no DRED, same bitrate, same quality. This
commit unblocks Phase 1+ by getting us onto libopus 1.5.2 where DRED lives.
Rationale for going straight to a custom DecoderHandle: opusic-c::Decoder's
inner *mut OpusDecoder pointer is pub(crate), so we cannot reach it for the
Phase 3 DRED reconstruction path. Running two parallel decoders (one for
audio, one for DRED) would drift because the DRED decoder wouldn't see
normal decode calls. Single unified DecoderHandle over raw opusic-sys is
the only correct architecture, so we build it in Phase 0 rather than
rewriting opus_dec.rs twice.
Changes:
- Cargo.toml (workspace + wzp-codec): remove audiopus 0.3.0-rc.0, add
opusic-c 1.5.5 (bundled + dred features), opusic-sys 0.6.0 (bundled),
bytemuck 1. Pinned exactly for reproducible libopus 1.5.2.
- opus_enc.rs: rewritten against opusic_c::Encoder. Argument order for
Encoder::new swapped (Channels first). set_inband_fec(bool) now maps
to InbandFec::Mode1 (the libopus 1.5 equivalent of 1.3's LBRR). encode
uses bytemuck::cast_slice<i16,u16> at the &[u16] boundary.
- dred_ffi.rs (new): DecoderHandle wrapping *mut OpusDecoder directly via
opusic-sys. Owns the allocation, frees on Drop. Exposes decode,
decode_lost, and a pub(crate) as_raw_ptr() for the future Phase 3 DRED
reconstruction. Send+Sync justified via &mut self access discipline.
- opus_dec.rs: rewritten as a thin AudioDecoder impl over DecoderHandle.
Behavior identical to pre-swap.
Verification (Phase 0 acceptance gates):
- cargo check --workspace: clean (30 pre-existing warnings in jni_bridge.rs
unrelated to this work; zero in changed files).
- cargo test -p wzp-codec: 53 tests pass (50 pre-swap + 6 new: 3 in
dred_ffi.rs for DecoderHandle lifecycle, 3 in opus_enc.rs for version
check and roundtrip).
- linked_libopus_is_1_5 test asserts opusic_c::version() contains "1.5" —
hard signal that the swap landed correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SignalManager (NEW):
- Dedicated Rust struct with its own QUIC connection to _signal
- Separate JNI handle (nativeSignalConnect/GetState/PlaceCall/etc)
- Kotlin wrapper polls state every 500ms via getState() JSON
- Lives independently of WzpEngine — survives across calls
- connect() blocks briefly on 8MB thread, then recv loop runs on dedicated thread
WzpEngine (CLEANED):
- Back to pure media-only role (audio, codec, FEC, jitter)
- Removed start_signaling/place_call/answer_call methods
- Removed signal_transport/signal_fingerprint from EngineState
CallViewModel:
- Two separate managers: signalManager (persistent) + engine (per-call)
- Two separate polling loops: signalPollJob + statsJob
- Auto-connect to media room when signal polling detects "setup" state
- hangupDirectCall() ends media but keeps signal alive
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Don't set callState for signal-only states (prevents auto-join room)
- Store signal transport + fingerprint in EngineState after registration
- place_call/answer_call send directly via signal transport (not command channel)
- Spawn small threads for async signal sends (non-blocking)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Spawn signaling on dedicated thread with 4MB stack instead of using
Android's IO dispatcher thread (insufficient stack for tokio + QUIC)
- Add "direct-call-v1" version marker to home screen subtitle
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When relay listens on 0.0.0.0, derive the actual IP from the client's
connection address for the CallSetup message.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Derive a 4-digit code from the shared DH secret via HKDF with label
"warzone-sas-code". Both peers compute the same code; a MITM relay
produces a different one. Users compare verbally during the call.
- CryptoSession::sas_code() -> Option<u32> on the trait
- ChaChaSession stores and returns the SAS
- HKDF derivation in WarzoneKeyExchange::derive_session()
- Tests: both peers match, MITM produces different code
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Call rooms (call-*) restricted to the two authorized participants only
- Room capacity enforced at 2 for call rooms
- Unauthorized clients get immediate connection close
- Unified fingerprint format: SHA-256(Ed25519 pub)[:16] as xxxx:xxxx:...
Used consistently in signal registration, handshake, and ACL checks
Tested: Alice+Bob authorized, attacker rejected with "not authorized"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New feature: call someone directly by fingerprint through the relay.
- Client connects with SNI "_signal" for persistent signaling
- RegisterPresence/RegisterPresenceAck for relay registration
- DirectCallOffer routed to target by fingerprint
- DirectCallAnswer with AcceptGeneric/AcceptTrusted/Reject modes
- Relay creates private room (call-{id}), sends CallSetup to both
- Both clients connect to private room for media (existing SFU path)
- Hangup forwarding + cleanup on disconnect
- Desktop CLI: --signal + --call <fingerprint> for testing
- CallRegistry tracks call state (Pending/Ringing/Active/Ended)
- SignalHub manages persistent signaling connections
Tested: Alice calls Bob by fingerprint, relay routes offer, Bob
auto-accepts, both join private room, media flows bidirectionally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Time-based dedup (2s TTL) replaces fixed-window dedup — consecutive
senders with same seq numbers no longer collide
- Raw byte forwarding for federation local delivery (no re-serialization)
- Jitter buffer resets on large backward seq jumps (>100)
- recv_media skips malformed datagrams instead of returning connection-closed
- SIGTERM handler for clean QUIC shutdown on wzp-client
- JSONL event log infrastructure (--event-log flag) for protocol analysis
- FEC disabled on GOOD profile for federation debugging (fec_ratio=0.0)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Federation media from different senders had conflicting seq numbers,
FEC block IDs, and Opus decoder state. The relay now assigns fresh
monotonic seq/fec_block/fec_symbol to all federation-delivered packets,
ensuring clients see a clean continuous stream regardless of sender changes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When propagating GlobalRoomActive to other peers, use tagged participants
(with relay_label set to the originating relay) instead of the raw
untagged participants. This shows "Relay C" instead of "Relay B" when
C's participants are forwarded through hub B to A.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a new sender reuses the same block_id values as a previous sender,
the FEC decoder was silently dropping all data because blocks were marked
as "already decoded". Now blocks older than 2 seconds are automatically
reset when new data arrives for them.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Dedup key now includes source peer fingerprint hash, preventing
packets from different senders with same room+seq from being dropped
as duplicates (was silently killing all multi-hop audio)
- Build scripts default to --pull (use --no-pull to skip)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Deduplicate remote participants by fingerprint in all merge sites
(canonical == raw room name caused double-lookup, doubling every remote participant)
- GlobalRoomInactive now propagates updated participant list to other peers
(hub relay B was not informing A when C's participants left)
- Add 15-second stale presence sweeper that purges remote participants
from peers that stop sending data (safety net for QUIC timeout delays)
- Add @Synchronized to WzpEngine.getStats/stopCall/destroy to prevent
TOCTOU race between stats polling coroutine and engine teardown (SIGSEGV)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Android default room changed from 'android' to 'general'
- Relay choose_profile capped at GOOD (Opus 24k) — studio tiers
(32k/48k/64k) cause high packet loss on federation paths due to
larger datagrams exceeding path MTU. Will re-enable after MTU
discovery is implemented.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a new federation link is established, announce not only LOCAL
global rooms but also rooms from OTHER peers (remote_participants).
This fixes multi-hop: when R2 connects to R3, R2 tells R3 about
R1's rooms that R2 learned about earlier.
Previously, only local rooms were announced on link setup. If R1
had a client but R2 had no clients, R2 wouldn't tell R3 about R1.
Also added diagnostic logging for room announcements on link setup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes for 3-relay chain (R1→R2→R3):
1. Room lookup in handle_datagram: hub relay (R2) has no local
participants, so active_rooms() was empty and datagrams were
silently dropped. Now also checks global_rooms config directly,
allowing hub relays to forward without local clients.
2. Multi-hop forwarding: removed active_rooms filter — forward to
ALL connected peers except source. The receiving peer decides
whether to deliver or forward further.
3. Android relay_label: native RoomMember now includes relay_label
from RoomUpdate signal. Kotlin UI reads it for relay grouping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a remote relay's room goes inactive (all participants left),
the receiving relay now:
1. Clears remote_participants for that peer+room
2. Broadcasts updated RoomUpdate to local clients with the remote
participant removed
3. Updates federation_active_rooms metric
Previously, remote participants lingered in the participant list
after disconnect, causing ghost entries and stale media forwarding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Connects to a relay over QUIC with SNI "version", reads build hash
from a unidirectional stream, prints "<relay> <git-hash>" and exits.
Usage: wzp-client --version-check 172.16.81.175:4434
Output: 172.16.81.175:4434 8dbda3e
Relay side: detects "version" SNI, opens uni stream, writes
BUILD_GIT_HASH, waits 100ms for client to read, closes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wzp-relay --version prints "wzp-relay <short-git-hash>".
Build hash also logged on startup: version=abc1234.
Enables verifying deployed relay matches expected build.
Also fixed federation-test.sh: use kill -INT (not SIGTERM) so
clients save recordings before exit. Added save delay.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoomParticipant.relay_label identifies which relay a participant is
connected to. Local participants have None, federated participants
get tagged with the peer relay's label when storing remote_participants.
This enables clients to group participants by relay in the UI.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. CLI client now sends raw room names (no hash), matching Android
JNI and Desktop Tauri. All three clients are now consistent.
2. When a client joins a global room, the relay merges federated
remote participants into the initial RoomUpdate. Previously,
clients that joined after the GlobalRoomActive signal only saw
local participants. Now they see everyone immediately.
3. Added get_remote_participants() to FederationManager for querying
cached remote participants from all peer links.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wires up the existing RelayMetrics federation fields:
- wzp_federation_peer_status{peer} — 1=connected, 0=disconnected
- wzp_federation_packets_forwarded_total{peer,direction} — in/out counts
- wzp_federation_active_rooms — number of active federated rooms
These are critical for monitoring federation health and will feed into
the adaptive codec selection system (PRD-coordinated-codec.md).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GlobalRoomActive signal now carries participant list from the
announcing relay. When received, the relay:
1. Stores remote participants per peer link
2. Broadcasts merged RoomUpdate to local clients (local + all remote)
This means clients on different relays can now SEE each other in the
participant list. Also fixes build: removed non-existent metric field
references that were added by linter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Prometheus metrics for federation links (per-peer RTT, packet
counters, active rooms gauge, dedup/rate-limit drop counters).
Add dedup filter (4096-entry ring buffer) to drop duplicate packets
arriving via multiple federation paths. Add per-room token bucket
rate limiter (500 pps) to prevent amplification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Different clients send different room names:
- Android: raw "general" as SNI
- Desktop: hash_room_name("general") = "f09ae11d..." as SNI
Federation datagrams are tagged with an 8-byte room hash. Previously,
each relay computed the hash from the client-provided room name,
causing mismatches between relays with different client types.
Fix: resolve_global_room() maps any room name (raw or hashed) to the
canonical [[global_rooms]] name. global_room_hash() always uses the
canonical name for federation hashing. handle_datagram uses both raw
and canonical hash matching to find the local room.
Also: run_participant now receives the pre-computed federation_room_hash
so the egress uses the canonical hash, not the client-specific name.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire AdaptiveQualityController into Android engine for auto codec
switching based on network quality reports. Add color-coded TX/RX
codec badges to the in-call screen showing active codecs and Auto mode.
- Recv task: ingest QualityReports, feed to controller, signal profile
changes via AtomicU8 to send task
- Send task: check for pending profile switch at frame boundaries,
update encoder/FEC/frame size
- Track peer codec from incoming packet headers
- Kotlin UI: codec badges (blue=studio, green=good, amber=degraded,
red=catastrophic) with Auto tag
- Add .taskmaster to .gitignore
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bug: when a local client joins a global room and sends media, the
egress task checked peer_links.active_rooms to decide where to forward.
But active_rooms tracks what PEERS announced (their rooms), not what
WE announced. So our own GlobalRoomActive signal went out but our
peer_links had empty active_rooms — media was dropped.
Fix: for locally-originated media, send to ALL connected federation
peers unconditionally. The receiving relay decides whether to deliver
to local participants (if it has the room) or forward further. This
is correct because federation peers are explicitly configured — if
they're connected, they should receive global room media.
Multi-hop forwarding (handle_datagram) still filters by active_rooms
to prevent loops — only forwards to peers that announced the room.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When --config points to a non-existent file, the relay now generates
a personalized example config that includes:
- listen_addr matching the --listen flag (not hardcoded 0.0.0.0:4433)
- Pre-filled [[peers]] section with this relay's detected IP, port,
and TLS fingerprint — ready to copy/paste into other relay configs
This makes setting up federation much easier: start each relay, it
generates its config with its own peering info commented out, you
just uncomment and copy between configs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enables running multiple relays on the same machine:
wzp-relay -c ~/.wzp1/config.toml -i ~/.wzp1/relay-identity --listen :4433
wzp-relay -c ~/.wzp2/config.toml -i ~/.wzp2/relay-identity --listen :4434
wzp-relay -c ~/.wzp3/config.toml -i ~/.wzp3/relay-identity --listen :4435
Config auto-creation: if the config file doesn't exist, writes an
example config with all fields documented and commented. The relay
starts with defaults but the file is ready to edit.
Identity auto-generation: if the identity file doesn't exist, generates
a new random seed (OsRng via wzp_crypto::Seed::generate) and saves it.
Subsequent starts load the same identity.
Short flags: -c for --config, -i for --identity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added logging to trace federation media flow:
- media_task logs first + every 250th received datagram (count, len)
- handle_datagram multi-hop forward logs errors (was silently dropped)
- forward_to_peers logs when no peer matches
2-relay (A→B): WORKING — full audio received, 300 packets forwarded
3-relay (A→B→C): B receives datagrams from A but only 1 arrives —
remaining packets not received, likely a QUIC read_datagram issue
when handle_datagram holds locks during processing. Needs further
investigation into async lock contention or datagram buffering.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2-relay test: 5.0s audio, RMS 4748, PASS. Full pipeline verified:
- Room correctly identified as global (hash matching works)
- Federation egress channel created and connected
- GlobalRoomActive signals exchanged between peers
- 300 packets (250 source + 50 FEC) forwarded via tagged datagrams
- Client B on relay B received full 5-second tone from client A on relay A
Added debug logging: is_global check, egress channel creation, per-peer
forwarding with active_rooms diagnostic when no match found. Also logs
egress packet count (first + every 250th).
Multi-hop propagation: GlobalRoomActive signals forwarded to other peers
so A→B→C chain knows about rooms across the full mesh.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major rewrite of relay federation replacing virtual participants with
a clean router model:
1. Global rooms: [[global_rooms]] in TOML config declares rooms that
are bridged across federation. Each relay is a router + local SFU.
2. Room events: RoomManager emits LocalJoin/LocalLeave via broadcast
channel when rooms transition between empty and non-empty.
3. GlobalRoomActive/Inactive signals: relays announce when they have
local participants in global rooms. Peers track active state and
forward media accordingly. Announcements propagate for multi-hop.
4. Media forwarding: separated from SFU loop. Local participant sends
via mpsc channel → egress task → forward_to_peers() → room-hash
tagged datagrams to active peer links. Inbound datagrams delivered
to local participants + forwarded to other active peers (multi-hop).
5. Loop prevention: don't forward back to source relay.
6. Room name hashing: is_global_room() checks both plain name and
hash (clients hash room names for SNI privacy).
Removed: ParticipantSender::Federation, federated_participants, virtual
participant join/leave, periodic room polling. Rooms now only contain
local participants.
Signaling tested: 3-relay chain (A→B←C) correctly propagates
GlobalRoomActive through B to both A and C. Media forwarding plumbing
in place but needs final debugging.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>