When a remote relay's room goes inactive (all participants left),
the receiving relay now:
1. Clears remote_participants for that peer+room
2. Broadcasts updated RoomUpdate to local clients with the remote
participant removed
3. Updates federation_active_rooms metric
Previously, remote participants lingered in the participant list
after disconnect, causing ghost entries and stale media forwarding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Connects to a relay over QUIC with SNI "version", reads build hash
from a unidirectional stream, prints "<relay> <git-hash>" and exits.
Usage: wzp-client --version-check 172.16.81.175:4434
Output: 172.16.81.175:4434 8dbda3e
Relay side: detects "version" SNI, opens uni stream, writes
BUILD_GIT_HASH, waits 100ms for client to read, closes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wzp-relay --version prints "wzp-relay <short-git-hash>".
Build hash also logged on startup: version=abc1234.
Enables verifying deployed relay matches expected build.
Also fixed federation-test.sh: use kill -INT (not SIGTERM) so
clients save recordings before exit. Added save delay.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoomParticipant.relay_label identifies which relay a participant is
connected to. Local participants have None, federated participants
get tagged with the peer relay's label when storing remote_participants.
This enables clients to group participants by relay in the UI.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoomParticipant now has optional relay_label field. Desktop client
groups participants by relay: "This Relay" (green dot) for local,
peer label (blue dot) for federated. Shows all relays in the chain
including intermediate ones.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. CLI client now sends raw room names (no hash), matching Android
JNI and Desktop Tauri. All three clients are now consistent.
2. When a client joins a global room, the relay merges federated
remote participants into the initial RoomUpdate. Previously,
clients that joined after the GlobalRoomActive signal only saw
local participants. Now they see everyone immediately.
3. Added get_remote_participants() to FederationManager for querying
cached remote participants from all peer links.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wires up the existing RelayMetrics federation fields:
- wzp_federation_peer_status{peer} — 1=connected, 0=disconnected
- wzp_federation_packets_forwarded_total{peer,direction} — in/out counts
- wzp_federation_active_rooms — number of active federated rooms
These are critical for monitoring federation health and will feed into
the adaptive codec selection system (PRD-coordinated-codec.md).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GlobalRoomActive signal now carries participant list from the
announcing relay. When received, the relay:
1. Stores remote participants per peer link
2. Broadcasts merged RoomUpdate to local clients (local + all remote)
This means clients on different relays can now SEE each other in the
participant list. Also fixes build: removed non-existent metric field
references that were added by linter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Prometheus metrics for federation links (per-peer RTT, packet
counters, active rooms gauge, dedup/rate-limit drop counters).
Add dedup filter (4096-entry ring buffer) to drop duplicate packets
arriving via multiple federation paths. Add per-room token bucket
rate limiter (500 pps) to prevent amplification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Different clients send different room names:
- Android: raw "general" as SNI
- Desktop: hash_room_name("general") = "f09ae11d..." as SNI
Federation datagrams are tagged with an 8-byte room hash. Previously,
each relay computed the hash from the client-provided room name,
causing mismatches between relays with different client types.
Fix: resolve_global_room() maps any room name (raw or hashed) to the
canonical [[global_rooms]] name. global_room_hash() always uses the
canonical name for federation hashing. handle_datagram uses both raw
and canonical hash matching to find the local room.
Also: run_participant now receives the pre-computed federation_room_hash
so the egress uses the canonical hash, not the client-specific name.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire AdaptiveQualityController into Android engine for auto codec
switching based on network quality reports. Add color-coded TX/RX
codec badges to the in-call screen showing active codecs and Auto mode.
- Recv task: ingest QualityReports, feed to controller, signal profile
changes via AtomicU8 to send task
- Send task: check for pending profile switch at frame boundaries,
update encoder/FEC/frame size
- Track peer codec from incoming packet headers
- Kotlin UI: codec badges (blue=studio, green=good, amber=degraded,
red=catastrophic) with Auto tag
- Add .taskmaster to .gitignore
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bug: when a local client joins a global room and sends media, the
egress task checked peer_links.active_rooms to decide where to forward.
But active_rooms tracks what PEERS announced (their rooms), not what
WE announced. So our own GlobalRoomActive signal went out but our
peer_links had empty active_rooms — media was dropped.
Fix: for locally-originated media, send to ALL connected federation
peers unconditionally. The receiving relay decides whether to deliver
to local participants (if it has the room) or forward further. This
is correct because federation peers are explicitly configured — if
they're connected, they should receive global room media.
Multi-hop forwarding (handle_datagram) still filters by active_rooms
to prevent loops — only forwards to peers that announced the room.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When --config points to a non-existent file, the relay now generates
a personalized example config that includes:
- listen_addr matching the --listen flag (not hardcoded 0.0.0.0:4433)
- Pre-filled [[peers]] section with this relay's detected IP, port,
and TLS fingerprint — ready to copy/paste into other relay configs
This makes setting up federation much easier: start each relay, it
generates its config with its own peering info commented out, you
just uncomment and copy between configs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enables running multiple relays on the same machine:
wzp-relay -c ~/.wzp1/config.toml -i ~/.wzp1/relay-identity --listen :4433
wzp-relay -c ~/.wzp2/config.toml -i ~/.wzp2/relay-identity --listen :4434
wzp-relay -c ~/.wzp3/config.toml -i ~/.wzp3/relay-identity --listen :4435
Config auto-creation: if the config file doesn't exist, writes an
example config with all fields documented and commented. The relay
starts with defaults but the file is ready to edit.
Identity auto-generation: if the identity file doesn't exist, generates
a new random seed (OsRng via wzp_crypto::Seed::generate) and saves it.
Subsequent starts load the same identity.
Short flags: -c for --config, -i for --identity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added logging to trace federation media flow:
- media_task logs first + every 250th received datagram (count, len)
- handle_datagram multi-hop forward logs errors (was silently dropped)
- forward_to_peers logs when no peer matches
2-relay (A→B): WORKING — full audio received, 300 packets forwarded
3-relay (A→B→C): B receives datagrams from A but only 1 arrives —
remaining packets not received, likely a QUIC read_datagram issue
when handle_datagram holds locks during processing. Needs further
investigation into async lock contention or datagram buffering.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2-relay test: 5.0s audio, RMS 4748, PASS. Full pipeline verified:
- Room correctly identified as global (hash matching works)
- Federation egress channel created and connected
- GlobalRoomActive signals exchanged between peers
- 300 packets (250 source + 50 FEC) forwarded via tagged datagrams
- Client B on relay B received full 5-second tone from client A on relay A
Added debug logging: is_global check, egress channel creation, per-peer
forwarding with active_rooms diagnostic when no match found. Also logs
egress packet count (first + every 250th).
Multi-hop propagation: GlobalRoomActive signals forwarded to other peers
so A→B→C chain knows about rooms across the full mesh.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major rewrite of relay federation replacing virtual participants with
a clean router model:
1. Global rooms: [[global_rooms]] in TOML config declares rooms that
are bridged across federation. Each relay is a router + local SFU.
2. Room events: RoomManager emits LocalJoin/LocalLeave via broadcast
channel when rooms transition between empty and non-empty.
3. GlobalRoomActive/Inactive signals: relays announce when they have
local participants in global rooms. Peers track active state and
forward media accordingly. Announcements propagate for multi-hop.
4. Media forwarding: separated from SFU loop. Local participant sends
via mpsc channel → egress task → forward_to_peers() → room-hash
tagged datagrams to active peer links. Inbound datagrams delivered
to local participants + forwarded to other active peers (multi-hop).
5. Loop prevention: don't forward back to source relay.
6. Room name hashing: is_global_room() checks both plain name and
hash (clients hash room names for SNI privacy).
Removed: ParticipantSender::Federation, federated_participants, virtual
participant join/leave, periodic room polling. Rooms now only contain
local participants.
Signaling tested: 3-relay chain (A→B←C) correctly propagates
GlobalRoomActive through B to both A and C. Media forwarding plumbing
in place but needs final debugging.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Added [[trusted]] config: relay B can accept inbound federation
from relay A by fingerprint alone, without knowing A's address.
A connects to B with [[peers]], B trusts A with [[trusted]].
- FederationHello signal: outbound connections send their TLS
fingerprint as first signal. The accepting relay verifies it
against [[peers]] (by IP) or [[trusted]] (by fingerprint).
- Tested 3-relay chain: A→B←C. Both A and C connect to B, B trusts
both. B correctly accepts both inbound connections. Room
announcements flow A→B and C→B.
- Remaining: B needs to announce rooms back to A and C on the same
connection so media can flow A→B→C. Currently A has no virtual
participant for B, so media doesn't reach B's SFU for forwarding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds --debug-tap <room> flag (or debug_tap in TOML config) that logs
every media packet's header metadata passing through a room. Use '*'
for all rooms.
Output (via tracing target "debug_tap"):
TAP room=... dir=in addr=... seq=31 codec=Opus24k ts=520
fec_block=5 fec_sym=1 repair=false len=65 fan_out=1
Shows: direction, source address, sequence number, codec ID, timestamp,
FEC block/symbol, repair flag, payload size, and fan-out count.
No decryption needed — headers are not encrypted.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added debug logging to federation signal path. Fixed the announce/recv
flow: outbound link's announce_task sends FederationRoomJoin, peer's
inbound signal_task receives it and creates virtual participant.
Tested: two relays on localhost with mutual TOML config, client A
sends tone via relay A, client B records via relay B — audio received
through federation (0.1s/RMS 7291/PASS).
Room announcement delay is ~1s (poll interval). The full pipeline:
client join → room created → announce_task detects → sends signal →
peer receives → creates virtual participant → SFU loop forwards
media via room-hash-tagged datagrams → peer demuxes → local delivery.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Inbound federation connections now matched by source IP against
configured peer URLs (QUIC clients don't present TLS certs, so
fingerprint matching fails for inbound direction).
- Added periodic room announcement task (1s poll) that sends
FederationRoomJoin to peers when new rooms appear with local
participants. Handles rooms created after federation link is up.
- Added find_peer_by_addr() to FederationManager.
Federation link topology: each relay pair has 2 connections (outbound
from each side). Outbound sends signals, peer's inbound receives them.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1 of relay federation:
1. Signal messages: FederationRoomJoin/Leave/ParticipantUpdate added
to SignalMessage enum for relay-to-relay room coordination.
2. Room changes: ParticipantOrigin (Local/Federated) tracking, loop
prevention (federated media only forwards to local participants),
ParticipantSender::Federation with 8-byte room-hash prefixed
datagrams, merged participant lists (local + remote), new methods:
join_federated(), update_federated_participants(), local_senders(),
active_rooms(), local_participants().
3. FederationManager: connects to configured peers via QUIC with SNI
"_federation", reconnects with exponential backoff (5s-300s),
exchanges FederationRoomJoin signals, runs recv loops for both
signals and media datagrams, creates virtual participants in rooms.
4. Accept-side: _federation SNI handling in main.rs, unknown peer
gets helpful "add to relay.toml" log message, recognized peers
handed off to FederationManager.
TODO: TLS fingerprint verification — currently outbound connections
use client_config() which doesn't present a cert, so inbound
verification fails. Need mutual TLS or URL-based peer matching.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The relay now supports loading configuration from a TOML file via
--config <path>. CLI flags override TOML values. All fields have
serde defaults so a minimal config only needs what you want to change.
Example relay.toml:
listen_addr = "0.0.0.0:4433"
[[peers]]
url = "193.180.213.68:4433"
fingerprint = "1a:39:38:..."
label = "Pangolin EU"
Federation hint on startup now shows TOML format with TLS fingerprint
(not Ed25519 identity fingerprint), since TLS fingerprint is what
peers actually verify. Configured peers are logged on startup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The relay's TLS certificate is now derived from the persisted
Ed25519 seed via HKDF, so the same seed produces the same cert
and the same TLS fingerprint across restarts. This fixes the
"Server Key Changed" warnings on every relay restart.
Implementation: HKDF-SHA256(seed, "wzp-tls-ed25519") → Ed25519
signing key → PKCS8 DER → rcgen KeyPair → self-signed cert.
Also adds tls_fingerprint() helper (SHA-256 of DER cert, hex with
colons) and prints it on startup. This is the prerequisite for
relay federation (peers verify each other by TLS fingerprint).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On startup, the relay detects its outbound IP (via UDP socket trick)
and prints a ready-to-copy YAML snippet for other relays to federate:
federation: to peer with this relay, add to peers config:
- url: "193.180.213.68:4433"
fingerprint: "a5d6:e3c6:..."
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Auto codec: new "Auto" position on quality slider (JNI index 7).
When selected, the engine uses the relay's chosen_profile from
CallAnswer instead of the local preference. Slider now has 8
positions: Studio 64k → Auto → Codec2 1.2k.
2. Force ping: added refresh button (↻) in Manage Relays dialog
header. Calls pingAllServers() to re-check all relays on demand.
3. Delete relay fix: the X button was inside a Surface(onClick=...)
which swallowed the touch event. Replaced with a separate Surface
that properly intercepts the click.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same fix as Android — the CallOffer now includes STUDIO_64K/48K/32K
so the relay can negotiate studio quality levels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CallOffer only advertised GOOD/DEGRADED/CATASTROPHIC. When a
client uses a studio profile, the relay's choose_profile couldn't
pick it. Now advertises all 6 profiles (studio 64k/48k/32k + good +
degraded + catastrophic) in both Android engine and shared handshake.
Also: the relay MUST be rebuilt with the new CodecId variants,
otherwise it will fail to deserialize CallOffer messages containing
studio QualityProfiles in supported_profiles.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Wire protocol: add Opus 32k/48k/64k (CodecId 6/7/8) + STUDIO
profiles with is_opus() helper. Opus enc/dec accept all Opus variants.
2. JNI bridge: expand profile_from_int to 7 levels (0-6) mapping to
GOOD, DEGRADED, CATASTROPHIC, Codec2_3200, STUDIO_32K/48K/64K.
3. Settings UI: replace radio buttons with Material3 Slider — 7 stops
from Studio 64k (green) to Codec2 1.2k (dark red), matching desktop.
4. Key-change warning: AlertDialog on connect when server fingerprint
has changed. Shows old vs new fingerprint, Accept New Key or Cancel.
Accepting saves the new fingerprint and proceeds with the call.
5. Engine recv: handle studio codec IDs in auto-switch path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds three new codec IDs (Opus32k=6, Opus48k=7, Opus64k=8) and
corresponding STUDIO_32K, STUDIO_48K, STUDIO_64K quality profiles.
All use 20ms frames with minimal FEC (10%) for maximum quality on
good networks.
Updated across: wire protocol (codec_id.rs), encoder/decoder
(opus_enc/dec.rs), adaptive codec switch (call.rs), CLI
(--profile studio-64k), desktop engine + UI slider (8 quality
levels from Studio 64k green to Codec2 1.2k red).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The send loop was hardcoded to 960 samples (20ms/Opus24k), causing
DEGRADED (Opus 6k, 40ms) and CATASTROPHIC (Codec2 1200, 40ms) to
fail — the encoder needed 1920 samples but only got 960.
Changes:
- capture_buf, ring read threshold, and timestamp increment are now
computed from profile.frame_duration_ms (960 for 20ms, 1920 for 40ms)
- decode_buf sized to MAX_FRAME_SAMPLES (1920) to handle any incoming codec
- recv codec switch now uses correct QualityProfile per codec (was
inheriting original profile's frame_duration_ms, breaking cross-codec)
- added ComfortNoise guard on recv path
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CallDecoder now inspects each incoming packet's codec_id and
automatically switches the audio decoder if it differs from the
current profile. This enables cross-codec interop where one client
sends Opus and the other sends Codec2 — previously the receiver
would try to decode with the wrong codec, producing garbled audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enables debugging Codec2 by allowing forced codec selection from CLI.
Supports: good, degraded, catastrophic, codec2-3200, codec2-1200.
Frame size, timing, and jitter buffer are all adjusted dynamically
based on the selected profile.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Relay identity:
- Stored in ~/.wzp/relay-identity (hex-encoded 32-byte seed)
- Generated on first run, reused on restart
- Fingerprint stays consistent across relay restarts
Linux build script (scripts/build-linux-notify.sh):
- Fire and forget: Hetzner VM → build all binaries → upload to rustypaste → ntfy notify → destroy VM
- Builds: wzp-relay, wzp-client, wzp-client-audio, wzp-web, wzp-bench
- Packages as tar.gz, uploads to rustypaste
- --keep flag to preserve VM
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Relay recognizes SNI "ping" and returns immediately — no handshake,
no stream accept, no timeout error logs. Client closes after QUIC
connect for RTT measurement.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ping was a static JNI method that loaded the .so before nativeInit,
crashing jemalloc. Now ping is an instance method on WzpEngine:
- Engine is created once (nativeInit), reused for both ping and call
- pingRelay() uses same tokio runtime pattern as startCall()
- Auto-pings all servers on app launch (after engine init)
- No process restart needed
- TOFU fingerprints saved on first successful ping
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same fix as Android: Box::new([0i16; 16384]) allocates 32KB on the
stack before moving to heap. Use vec![].into_boxed_slice() for
direct heap allocation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The jni crate emits VERBOSE logs for every JNI method lookup (~10 lines
per call, 100+ calls/sec on audio threads). This floods logcat, consumes
CPU, and triggers system kills. Filter to only show INFO+ for our crates
and WARN+ for everything else.
Also fix build script: clean full Rust target to ensure libc++_shared.so
is always copied by cargo-ndk.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds nativeWriteAudioDirect / nativeReadAudioDirect JNI functions
that accept a DirectByteBuffer instead of ShortArray. The buffer's
native memory is accessed directly by Rust via pointer — no
GetShortArrayRegion / SetShortArrayRegion, no GC-managed array
copies on the audio hot path.
This fixes SIGBUS crashes on Android 16 where ART's concurrent
mark-compact GC crashes when flipping thread roots during JNI
array operations on MAX_PRIORITY audio threads.
Old ShortArray methods kept for backward compatibility.
AudioPipeline switched to use Direct variants.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- AudioRing: use vec![].into_boxed_slice() instead of Box::new([]) to
avoid 32KB stack allocation that crashes scudo on Android
- JNI bridge: wrap tracing_subscriber init in catch_unwind to survive
sharded_slab allocation failures on some devices
- Engine: per-step encode profiling (avg_agc_us, avg_opus_us, avg_fec_us,
avg_send_us) logged every 5 seconds in send stats
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds average microsecond timings for each encode step:
- avg_agc_us: AGC processing
- avg_opus_us: Opus encoding
- avg_fec_us: FEC encode + repair generation
- avg_send_us: QUIC send_media
- avg_total_us: sum of above
Logged every 5 seconds in send stats. Resets each interval.
Use to identify which step is bottlenecking the encode loop
on devices where fps drops below 50.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same fix as Android (4af7c5f): writer never touches read_pos,
reader self-corrects when lapped. Power-of-2 capacity (16384),
bitmask indexing, overflow/underrun counters.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rust tracing subscriber was never initialized — all info!/warn!/error!
calls in the engine went to /dev/null. This meant our send/recv health
logging was invisible and we couldn't confirm the congestion fix was
active.
Now initializes tracing-android layer on first nativeInit(), routing
all Rust logs to logcat under tag "wzp_android". Also expanded logcat
filter in DebugReporter to capture engine-level log lines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>