feat(p2p): Phase 3.5 dual-path QUIC race + GUI call-flow debug logs

Two features in one commit because they ship and test together:
Phase 3.5 closes the hole-punching loop and the call-flow debug
logs give the user live visibility into every step of a call so
real-hardware testing of the new P2P path is debuggable.

## Phase 3.5 — dual-path QUIC connect race

Completes the hole-punching work Phase 3 scaffolded. On receiving
a CallSetup with peer_direct_addr, the client now actually races a
direct QUIC handshake against the relay dial and uses whichever
completes first. Symmetric role assignment avoids the two-conns-
per-call problem:

- Both peers compare `own_reflex_addr` vs `peer_reflex_addr`
  lexicographically.
- Smaller addr → **Acceptor** (A-role): builds a server-capable
  dual endpoint, awaits an incoming QUIC session. Does NOT dial.
- Larger addr → **Dialer** (D-role): builds a client-only
  endpoint, dials the peer's addr with `call-<id>` SNI. Does NOT
  listen.
- Both sides always dial the relay in parallel as fallback.
- `tokio::select!` with `biased` preference for direct, `tokio::pin!`
  so each branch can await the losing opposite as fallback.
- Direct timeout 2s, relay fallback timeout 5s (so 7s worst case
  from CallSetup to "no media path" error).

New crate module `wzp_client::dual_path::{race, WinningPath}`
(moved here from desktop/src-tauri so it's testable from a
workspace test). `determine_role` in `wzp_client::reflect` is
pure-function and unit-tested.

### CallEngine integration
- New `pre_connected_transport: Option<Arc<QuinnTransport>>` arg
  on both android + desktop `CallEngine::start` branches. Skips
  the internal wzp_transport::connect step when Some. Backward-
  compat: None keeps Phase 0 relay-only behavior.
- `connect` Tauri command reads own_reflex_addr from SignalState,
  computes role, runs the race, passes the winning transport
  into CallEngine. If ANY input is missing (no peer addr, no own
  addr, equal addrs), falls back to classic relay path —
  identical to pre-Phase-3.5 behavior.

### Tests (9 new, all passing)
- 6 unit tests for `determine_role` truth table in
  `wzp-client/src/reflect.rs` (smaller=Acceptor, larger=Dialer,
  port-only diff, equal, missing-side, symmetry)
- 3 integration tests in `crates/wzp-client/tests/dual_path.rs`:
    * `dual_path_direct_wins_on_loopback` — two-endpoint test
      rig, Dialer wins direct path vs loopback mock relay
    * `dual_path_relay_wins_when_direct_is_dead` — dead peer
      port, 2s direct timeout, relay fallback wins
    * `dual_path_errors_cleanly_when_both_paths_dead` — <10s
      error, no hang

## GUI call-flow debug logs

Runtime-toggled structured events at every step of a call so the
user can see where a call progressed or stalled on real hardware.
Modeled on the existing DRED_VERBOSE_LOGS pattern.

### Rust side
- `static CALL_DEBUG_LOGS: AtomicBool` + `emit_call_debug(&app,
  step, details)` helper. Always logs via `tracing::info!`
  (logcat always has a copy); GUI Tauri `call-debug-log` event
  only fires when the flag is on.
- Tauri commands `set_call_debug_logs` / `get_call_debug_logs`.

### Instrumented steps (24 emit_call_debug sites)
- `register_signal`: start, identity loaded, endpoint created,
  connect failed/ok, RegisterPresence sent, ack received/failed,
  recv loop spawning
- Recv loop: CallRinging, DirectCallOffer (w/ caller_reflexive_addr),
  DirectCallAnswer (w/ callee_reflexive_addr), CallSetup (w/
  peer_direct_addr), Hangup
- `place_call`: start, reflect query start/ok/none, offer sent,
  send failed
- `answer_call`: start, reflect query start/ok/none or privacy
  skip, answer sent, send failed
- `connect`: start, dual_path_race_start (w/ role), won (w/
  path), failed, skipped (w/ reasons), call_engine_starting/
  started/failed

### JS side
- New `callDebugLogs: boolean` field on Settings type.
- Boot-time hydrate of the Rust flag from localStorage so the
  choice survives restarts (like `dredDebugLogs`).
- Settings panel: new "Call flow debug logs" checkbox alongside
  the DRED toggle.
- New "Call Debug Log" section that ONLY shows when the flag is
  on. Rolling in-memory buffer of the last 200 events, rendered
  as monospace `HH:MM:SS.mmm step {details}` lines with auto-
  scroll and a Clear button.
- `listen("call-debug-log", ...)` subscribed at app startup,
  appends to the buffer, re-renders on every event.

Full workspace test goes from 404 → 413 passing. Clippy clean
on touched crates.

PRD: .taskmaster/docs/prd_phase35_dual_path_race.txt
Tasks: 61-69 all completed

Next: APK + desktop build carrying everything — Phase 2 NAT
detect, Phase 3 advertising, Phase 3.5 dual-path + call debug
logs, plus the earlier Android first-join diagnostics — so the
user can validate the P2P path on real hardware with live
per-step visibility into where any failures happen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Siavash Sameni
2026-04-11 14:06:44 +04:00
parent 39277bf3a0
commit 59ce52f8e8
8 changed files with 920 additions and 73 deletions

View File

@@ -0,0 +1,195 @@
//! Phase 3.5 — dual-path QUIC connect race for P2P hole-punching.
//!
//! When both peers advertised reflex addrs in the
//! DirectCallOffer/Answer flow, the relay cross-wires them into
//! `CallSetup.peer_direct_addr`. This module races a direct QUIC
//! handshake against the existing relay dial and returns whichever
//! completes first — with automatic drop of the loser via
//! `tokio::select!`.
//!
//! Role determination is deterministic and symmetric
//! (`wzp_client::reflect::determine_role`): whichever peer has the
//! lexicographically smaller reflex addr becomes the **Acceptor**
//! (listens on a server-capable endpoint), the other becomes the
//! **Dialer** (dials the peer's addr). Because the rule is
//! identical on both sides, the Acceptor's inbound QUIC session
//! and the Dialer's outbound are the SAME connection — no
//! negotiation needed, no two-conns-per-call confusion.
//!
//! Timeout policy:
//! - Direct path: 2s from the start of `race`. Cone-NAT hole-punch
//! typically completes in < 500ms on a LAN; 2s gives us tolerance
//! for a single QUIC Initial retry on unreliable networks.
//! - Relay path: 10s (existing behavior elsewhere in the codebase).
//! - Overall: `tokio::select!` returns as soon as either succeeds.
use std::net::SocketAddr;
use std::sync::Arc;
use std::time::Duration;
use crate::reflect::Role;
use wzp_transport::QuinnTransport;
/// Which path won the race. Used by the `connect` command for
/// logging + (in the future) metrics.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum WinningPath {
Direct,
Relay,
}
/// Attempt a direct QUIC connection to the peer in parallel with
/// the relay dial and return the winning `QuinnTransport`.
///
/// `role` selects the direction of the direct attempt:
/// - `Role::Acceptor` creates a server-capable endpoint and waits
/// for the peer to dial in.
/// - `Role::Dialer` creates a client-only endpoint and dials
/// `peer_direct_addr`.
///
/// The relay path is always attempted in parallel as a fallback so
/// the race ALWAYS produces a working transport unless both paths
/// genuinely fail (network partition). Returns
/// `Err(anyhow::anyhow!(...))` if both paths fail within the
/// timeout.
#[allow(clippy::too_many_arguments)]
pub async fn race(
role: Role,
peer_direct_addr: SocketAddr,
relay_addr: SocketAddr,
room_sni: String,
call_sni: String,
) -> anyhow::Result<(Arc<QuinnTransport>, WinningPath)> {
// Rustls provider must be installed before any quinn endpoint
// is created. Install attempt is idempotent.
let _ = rustls::crypto::ring::default_provider().install_default();
// Build the direct-path endpoint + future based on role.
// Each future returns an already-wrapped `QuinnTransport` so we
// don't need a direct `quinn::Connection` type in scope here
// (this crate doesn't depend on quinn directly).
let direct_ep: wzp_transport::Endpoint;
let direct_fut: std::pin::Pin<
Box<dyn std::future::Future<Output = anyhow::Result<QuinnTransport>> + Send>,
>;
match role {
Role::Acceptor => {
let (sc, _cert_der) = wzp_transport::server_config();
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let ep = wzp_transport::create_endpoint(bind, Some(sc))?;
tracing::info!(
local_addr = ?ep.local_addr().ok(),
"dual_path: A-role endpoint up, awaiting peer dial"
);
let ep_for_fut = ep.clone();
direct_fut = Box::pin(async move {
// `wzp_transport::accept` wraps the same
// `endpoint.accept().await?.await?` dance we want
// and maps errors into TransportError for us.
let conn = wzp_transport::accept(&ep_for_fut)
.await
.map_err(|e| anyhow::anyhow!("direct accept: {e}"))?;
Ok(QuinnTransport::new(conn))
});
direct_ep = ep;
}
Role::Dialer => {
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let ep = wzp_transport::create_endpoint(bind, None)?;
tracing::info!(
local_addr = ?ep.local_addr().ok(),
%peer_direct_addr,
"dual_path: D-role endpoint up, dialing peer"
);
let ep_for_fut = ep.clone();
let client_cfg = wzp_transport::client_config();
let sni = call_sni.clone();
direct_fut = Box::pin(async move {
let conn =
wzp_transport::connect(&ep_for_fut, peer_direct_addr, &sni, client_cfg)
.await
.map_err(|e| anyhow::anyhow!("direct dial: {e}"))?;
Ok(QuinnTransport::new(conn))
});
direct_ep = ep;
}
}
// Relay path: classic dial to the relay's media room.
let relay_bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let relay_ep = wzp_transport::create_endpoint(relay_bind, None)?;
let relay_ep_for_fut = relay_ep.clone();
let relay_client_cfg = wzp_transport::client_config();
let relay_sni = room_sni.clone();
let relay_fut = async move {
let conn =
wzp_transport::connect(&relay_ep_for_fut, relay_addr, &relay_sni, relay_client_cfg)
.await
.map_err(|e| anyhow::anyhow!("relay dial: {e}"))?;
Ok::<_, anyhow::Error>(QuinnTransport::new(conn))
};
// Race the two with a shared 2s ceiling on the direct attempt.
// Pin both so we can poll them from multiple branches of the
// select without moving the futures — the "direct failed, wait
// for relay" and "relay failed, wait for direct" fallback paths
// below need to await the OPPOSITE future after the winning
// branch fires. Without pinning, tokio::select! moves the
// future out and we can't touch it again.
tracing::info!(?role, %peer_direct_addr, %relay_addr, "dual_path: racing direct vs relay");
let direct_timed = tokio::time::timeout(Duration::from_secs(2), direct_fut);
tokio::pin!(direct_timed, relay_fut);
let result = tokio::select! {
biased; // prefer direct win if both arrive in the same tick
direct_result = &mut direct_timed => {
match direct_result {
Ok(Ok(transport)) => {
tracing::info!(%peer_direct_addr, "dual_path: direct WON");
Ok((Arc::new(transport), WinningPath::Direct))
}
Ok(Err(e)) => {
// Direct failed — fall back to waiting for relay.
tracing::warn!(error = %e, "dual_path: direct failed, awaiting relay");
match tokio::time::timeout(Duration::from_secs(5), &mut relay_fut).await {
Ok(Ok(transport)) => Ok((Arc::new(transport), WinningPath::Relay)),
Ok(Err(e2)) => Err(anyhow::anyhow!("both paths failed: direct={e}, relay={e2}")),
Err(_) => Err(anyhow::anyhow!("both paths failed: direct={e}, relay=timeout(5s)")),
}
}
Err(_elapsed) => {
tracing::warn!("dual_path: direct timed out (2s), awaiting relay");
match tokio::time::timeout(Duration::from_secs(5), &mut relay_fut).await {
Ok(Ok(transport)) => Ok((Arc::new(transport), WinningPath::Relay)),
Ok(Err(e2)) => Err(anyhow::anyhow!("direct timeout + relay failed: {e2}")),
Err(_) => Err(anyhow::anyhow!("direct timeout + relay timeout")),
}
}
}
}
relay_result = &mut relay_fut => {
match relay_result {
Ok(transport) => {
tracing::info!("dual_path: relay WON (direct still pending)");
Ok((Arc::new(transport), WinningPath::Relay))
}
Err(e) => {
tracing::warn!(error = %e, "dual_path: relay failed, awaiting direct remainder");
match tokio::time::timeout(Duration::from_millis(1500), &mut direct_timed).await {
Ok(Ok(Ok(transport))) => Ok((Arc::new(transport), WinningPath::Direct)),
_ => Err(anyhow::anyhow!("relay failed + direct unavailable: {e}")),
}
}
}
}
};
// Drop both endpoints once the winner is stored in result. The
// winning transport owns its own connection so dropping the
// endpoint won't kill it.
drop(direct_ep);
drop(relay_ep);
result
}