Files
wz-phone/crates/wzp-client/src/dual_path.rs
Siavash Sameni fa038df057 feat(p2p): Phase 5.5 — ICE LAN host candidates (IPv4 + IPv6)
Same-LAN P2P was failing because MikroTik masquerade (like most
consumer NATs) doesn't support NAT hairpinning — the advertised
WAN reflex addr is unreachable from a peer on the same LAN as
the advertiser. Phase 5 got us Cone NAT classification and fixed
the measurement artifact, but same-LAN direct dials still had
nowhere to land.

Phase 5.5 adds ICE-style host candidates: each client enumerates
its LAN-local network interface addresses, includes them in the
DirectCallOffer/Answer alongside the reflex addr, and the
dual-path race fans out to ALL peer candidates in parallel.
Same-LAN peers find each other via their RFC1918 IPv4 + ULA /
global-unicast IPv6 addresses without touching the NAT at all.

Dual-stack IPv6 is in scope from the start — on modern ISPs
(including Starlink) the v6 path often works even when v4
hairpinning doesn't, because there's no NAT on the v6 side.

## Changes

### `wzp_client::reflect::local_host_candidates(port)` (new)

Enumerates network interfaces via `if-addrs` and returns
SocketAddrs paired with the caller's port. Filters:

- IPv4: RFC1918 (10/8, 172.16/12, 192.168/16) + CGNAT (100.64/10)
- IPv6: global unicast (2000::/3) + ULA (fc00::/7)
- Skipped: loopback, link-local (169.254, fe80::), public v4
  (already covered by reflex-addr), unspecified

Safe from any thread, one `getifaddrs(3)` syscall.

### Wire protocol (wzp-proto/packet.rs)

Three new `#[serde(default, skip_serializing_if = "Vec::is_empty")]`
fields, backward-compat with pre-5.5 clients/relays by
construction:

- `DirectCallOffer.caller_local_addrs: Vec<String>`
- `DirectCallAnswer.callee_local_addrs: Vec<String>`
- `CallSetup.peer_local_addrs: Vec<String>`

### Call registry (wzp-relay/call_registry.rs)

`DirectCall` gains `caller_local_addrs` + `callee_local_addrs`
Vec<String> fields. New `set_caller_local_addrs` /
`set_callee_local_addrs` setters. Follow the same pattern as
the reflex addr fields.

### Relay cross-wiring (wzp-relay/main.rs)

Both the local-call and cross-relay-federation paths now track
the local_addrs through the registry and inject them into the
CallSetup's peer_local_addrs. Cross-wiring is identical to the
existing peer_direct_addr logic — each party's CallSetup
carries the OTHER party's LAN candidates.

### Client side (desktop/src-tauri/lib.rs)

- `place_call`: gathers local host candidates via
  `local_host_candidates(signal_endpoint.local_addr().port())`
  and includes them in `DirectCallOffer.caller_local_addrs`.
  The port match is critical — it's the Phase 5 shared signal
  socket, so incoming dials to these addrs land on the same
  endpoint that's already listening.
- `answer_call`: same, AcceptTrusted only (privacy mode keeps
  LAN addrs hidden too, for consistency with the reflex addr).
- `connect` Tauri command: new `peer_local_addrs: Vec<String>`
  arg. Builds a `PeerCandidates` bundle and passes it to the
  dual-path race.
- Recv loop's CallSetup handler: destructures + forwards the
  new field to JS via the signal-event payload.

### `dual_path::race` (wzp-client/dual_path.rs)

Signature change: takes `PeerCandidates` (reflex + local Vec)
instead of a single SocketAddr. The D-role branch now fans out
N parallel dials via `tokio::task::JoinSet` — one per candidate
— and the first successful dial wins (losers are aborted
immediately via `set.abort_all()`). Only when ALL candidates
have failed do we return Err; individual candidate failures are
just traced at debug level and the race waits for the others.

LAN host candidates are tried BEFORE the reflex addr in
`PeerCandidates::dial_order()` — they're faster when they work,
and the reflex addr is the fallback for the not-on-same-LAN
case.

### JS side (desktop/main.ts)

`connect` invoke now passes `peerLocalAddrs: data.peer_local_addrs ?? []`
alongside the existing `peerDirectAddr`.

### Tests

All existing test callsites updated for the new Vec<String>
fields (defaults to Vec::new() in tests — they don't exercise
the multi-candidate path). `dual_path.rs` integration tests
wrap the single `dead_peer` / `acceptor_listen_addr` in a
`PeerCandidates { reflexive: Some(_), local: Vec::new() }`.

Full workspace test: 423 passing (same as before 5.5).

## Expected behavior on the reporter's setup

Two phones behind MikroTik, both on the same LAN:

  place_call:host_candidates {"local_addrs": ["192.168.88.21:XXX", "2001:...:YY:XXX"]}
  recv:DirectCallAnswer {"callee_local_addrs": ["192.168.88.22:ZZZ", "2001:...:WW:ZZZ"]}
  recv:CallSetup {"peer_direct_addr":"150.228.49.65:NN",
                  "peer_local_addrs":["192.168.88.22:ZZZ","2001:...:WW:ZZZ"]}
  connect:dual_path_race_start {"peer_reflex":"...","peer_local":[...]}
  dual_path: direct dial succeeded on candidate 0   ← LAN v4 wins
  connect:dual_path_race_won {"path":"Direct"}

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 07:34:49 +04:00

375 lines
16 KiB
Rust

//! Phase 3.5 — dual-path QUIC connect race for P2P hole-punching.
//!
//! When both peers advertised reflex addrs in the
//! DirectCallOffer/Answer flow, the relay cross-wires them into
//! `CallSetup.peer_direct_addr`. This module races a direct QUIC
//! handshake against the existing relay dial and returns whichever
//! completes first — with automatic drop of the loser via
//! `tokio::select!`.
//!
//! Role determination is deterministic and symmetric
//! (`wzp_client::reflect::determine_role`): whichever peer has the
//! lexicographically smaller reflex addr becomes the **Acceptor**
//! (listens on a server-capable endpoint), the other becomes the
//! **Dialer** (dials the peer's addr). Because the rule is
//! identical on both sides, the Acceptor's inbound QUIC session
//! and the Dialer's outbound are the SAME connection — no
//! negotiation needed, no two-conns-per-call confusion.
//!
//! Timeout policy:
//! - Direct path: 2s from the start of `race`. Cone-NAT hole-punch
//! typically completes in < 500ms on a LAN; 2s gives us tolerance
//! for a single QUIC Initial retry on unreliable networks.
//! - Relay path: 10s (existing behavior elsewhere in the codebase).
//! - Overall: `tokio::select!` returns as soon as either succeeds.
use std::net::SocketAddr;
use std::sync::Arc;
use std::time::Duration;
use crate::reflect::Role;
use wzp_transport::QuinnTransport;
/// Which path won the race. Used by the `connect` command for
/// logging + (in the future) metrics.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum WinningPath {
Direct,
Relay,
}
/// Attempt a direct QUIC connection to the peer in parallel with
/// the relay dial and return the winning `QuinnTransport`.
///
/// `role` selects the direction of the direct attempt:
/// - `Role::Acceptor` creates a server-capable endpoint and waits
/// for the peer to dial in.
/// - `Role::Dialer` creates a client-only endpoint and dials
/// `peer_direct_addr`.
///
/// The relay path is always attempted in parallel as a fallback so
/// the race ALWAYS produces a working transport unless both paths
/// genuinely fail (network partition). Returns
/// `Err(anyhow::anyhow!(...))` if both paths fail within the
/// timeout.
/// Phase 5.5 candidate bundle — full ICE-ish candidate list for
/// the peer. The race tries them all in parallel alongside the
/// relay path. At minimum this should contain the peer's
/// server-reflexive address; `local_addrs` carries LAN host
/// candidates gathered from their physical interfaces.
///
/// Empty is valid: the D-role has nothing to dial and the race
/// reduces to "relay only" + (if A-role) accepting on the
/// shared endpoint.
#[derive(Debug, Clone, Default)]
pub struct PeerCandidates {
/// Peer's server-reflexive address (Phase 3). `None` if the
/// peer didn't advertise one.
pub reflexive: Option<SocketAddr>,
/// Peer's LAN host addresses (Phase 5.5). Tried first on
/// same-LAN pairs — direct dials to these bypass the NAT
/// entirely.
pub local: Vec<SocketAddr>,
}
impl PeerCandidates {
/// Flatten into the list of addrs the D-role should dial.
/// Order: LAN host candidates first (fastest when they
/// work), then reflexive (covers the non-LAN case).
pub fn dial_order(&self) -> Vec<SocketAddr> {
let mut out = Vec::with_capacity(self.local.len() + 1);
out.extend(self.local.iter().copied());
if let Some(a) = self.reflexive {
// Only add if it's not already in the list (some
// edge cases on same-LAN could have the same addr
// in both).
if !out.contains(&a) {
out.push(a);
}
}
out
}
/// Is there anything for the D-role to dial? If not, the
/// race reduces to relay-only.
pub fn is_empty(&self) -> bool {
self.reflexive.is_none() && self.local.is_empty()
}
}
#[allow(clippy::too_many_arguments)]
pub async fn race(
role: Role,
peer_candidates: PeerCandidates,
relay_addr: SocketAddr,
room_sni: String,
call_sni: String,
// Phase 5: when `Some`, reuse this endpoint for BOTH the
// direct-path branch AND the relay dial. Pass the signal
// endpoint. The endpoint MUST be server-capable (created
// with a server config) for the A-role accept branch to
// work.
//
// When `None`, falls back to fresh endpoints per role.
// Used by tests.
shared_endpoint: Option<wzp_transport::Endpoint>,
) -> anyhow::Result<(Arc<QuinnTransport>, WinningPath)> {
// Rustls provider must be installed before any quinn endpoint
// is created. Install attempt is idempotent.
let _ = rustls::crypto::ring::default_provider().install_default();
// Build the direct-path endpoint + future based on role.
//
// A-role: one accept future on the shared endpoint. The
// first incoming QUIC connection wins — we don't care
// which peer candidate the dialer used to reach us.
//
// D-role: N parallel dial futures, one per peer candidate
// (all LAN host addrs + the reflex addr), consolidated
// into a single direct_fut via FuturesUnordered-style
// "first OK wins" semantics. The first successful dial
// becomes the direct path; the losers are dropped (quinn
// will abort the in-flight handshakes via the dropped
// Connecting futures).
//
// Either way, direct_fut resolves to a single QuinnTransport
// (or an error) and is raced against the relay_fut by the
// outer tokio::select!.
let direct_ep: wzp_transport::Endpoint;
let direct_fut: std::pin::Pin<
Box<dyn std::future::Future<Output = anyhow::Result<QuinnTransport>> + Send>,
>;
match role {
Role::Acceptor => {
let ep = match shared_endpoint.clone() {
Some(ep) => {
tracing::info!(
local_addr = ?ep.local_addr().ok(),
"dual_path: A-role reusing shared endpoint for accept"
);
ep
}
None => {
let (sc, _cert_der) = wzp_transport::server_config();
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let fresh = wzp_transport::create_endpoint(bind, Some(sc))?;
tracing::info!(
local_addr = ?fresh.local_addr().ok(),
"dual_path: A-role fresh endpoint up, awaiting peer dial"
);
fresh
}
};
let ep_for_fut = ep.clone();
direct_fut = Box::pin(async move {
// `wzp_transport::accept` wraps the same
// `endpoint.accept().await?.await?` dance we want.
// If `ep_for_fut` is the shared signal endpoint,
// this pulls the NEXT incoming connection —
// normally that's the peer's direct-P2P dial.
// Signal recv is done via the signal CONNECTION
// (accept_bi), not the endpoint, so no conflict.
let conn = wzp_transport::accept(&ep_for_fut)
.await
.map_err(|e| anyhow::anyhow!("direct accept: {e}"))?;
Ok(QuinnTransport::new(conn))
});
direct_ep = ep;
}
Role::Dialer => {
let ep = match shared_endpoint.clone() {
Some(ep) => {
tracing::info!(
local_addr = ?ep.local_addr().ok(),
candidates = ?peer_candidates.dial_order(),
"dual_path: D-role reusing shared endpoint to dial peer candidates"
);
ep
}
None => {
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let fresh = wzp_transport::create_endpoint(bind, None)?;
tracing::info!(
local_addr = ?fresh.local_addr().ok(),
candidates = ?peer_candidates.dial_order(),
"dual_path: D-role fresh endpoint up, dialing peer candidates"
);
fresh
}
};
let ep_for_fut = ep.clone();
let dial_order = peer_candidates.dial_order();
let sni = call_sni.clone();
direct_fut = Box::pin(async move {
if dial_order.is_empty() {
// No candidates — the race reduces to
// relay-only. Surface a stable error so the
// outer select falls through to relay_fut
// without a spurious "direct failed" warning.
// Use a pending future that never resolves so
// the select's "other side wins" branch is
// the natural outcome.
std::future::pending::<anyhow::Result<QuinnTransport>>().await
} else {
// Fan out N parallel dials via JoinSet. First
// `Ok` wins; `Err` from a single candidate is
// not fatal — we wait for the others. Only
// when ALL have failed do we return Err.
let mut set = tokio::task::JoinSet::new();
for (idx, candidate) in dial_order.iter().enumerate() {
let ep = ep_for_fut.clone();
let client_cfg = wzp_transport::client_config();
let sni = sni.clone();
let candidate = *candidate;
set.spawn(async move {
let result = wzp_transport::connect(
&ep,
candidate,
&sni,
client_cfg,
)
.await;
(idx, candidate, result)
});
}
let mut last_err: Option<String> = None;
while let Some(join_res) = set.join_next().await {
let (idx, candidate, dial_res) = match join_res {
Ok(t) => t,
Err(e) => {
last_err = Some(format!("join {e}"));
continue;
}
};
match dial_res {
Ok(conn) => {
tracing::info!(
%candidate,
candidate_idx = idx,
"dual_path: direct dial succeeded on candidate"
);
// Abort the remaining in-flight
// dials so they don't complete
// and leak QUIC sessions.
set.abort_all();
return Ok(QuinnTransport::new(conn));
}
Err(e) => {
tracing::debug!(
%candidate,
candidate_idx = idx,
error = %e,
"dual_path: direct dial failed, trying others"
);
last_err = Some(format!("candidate {candidate}: {e}"));
}
}
}
Err(anyhow::anyhow!(
"all {} direct candidates failed; last: {}",
dial_order.len(),
last_err.unwrap_or_else(|| "n/a".into())
))
}
});
direct_ep = ep;
}
}
// Relay path: classic dial to the relay's media room. Phase 5:
// reuse the shared endpoint here too so MikroTik-style NATs
// keep a stable external port across all flows from this
// client. Falls back to a fresh endpoint when not shared.
let relay_ep = match shared_endpoint.clone() {
Some(ep) => ep,
None => {
let relay_bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
wzp_transport::create_endpoint(relay_bind, None)?
}
};
let relay_ep_for_fut = relay_ep.clone();
let relay_client_cfg = wzp_transport::client_config();
let relay_sni = room_sni.clone();
let relay_fut = async move {
let conn =
wzp_transport::connect(&relay_ep_for_fut, relay_addr, &relay_sni, relay_client_cfg)
.await
.map_err(|e| anyhow::anyhow!("relay dial: {e}"))?;
Ok::<_, anyhow::Error>(QuinnTransport::new(conn))
};
// Race the two with a shared 2s ceiling on the direct attempt.
// Pin both so we can poll them from multiple branches of the
// select without moving the futures — the "direct failed, wait
// for relay" and "relay failed, wait for direct" fallback paths
// below need to await the OPPOSITE future after the winning
// branch fires. Without pinning, tokio::select! moves the
// future out and we can't touch it again.
tracing::info!(
?role,
candidates = ?peer_candidates.dial_order(),
%relay_addr,
"dual_path: racing direct vs relay"
);
let direct_timed = tokio::time::timeout(Duration::from_secs(2), direct_fut);
tokio::pin!(direct_timed, relay_fut);
let result = tokio::select! {
biased; // prefer direct win if both arrive in the same tick
direct_result = &mut direct_timed => {
match direct_result {
Ok(Ok(transport)) => {
tracing::info!("dual_path: direct WON");
Ok((Arc::new(transport), WinningPath::Direct))
}
Ok(Err(e)) => {
// Direct failed — fall back to waiting for relay.
tracing::warn!(error = %e, "dual_path: direct failed, awaiting relay");
match tokio::time::timeout(Duration::from_secs(5), &mut relay_fut).await {
Ok(Ok(transport)) => Ok((Arc::new(transport), WinningPath::Relay)),
Ok(Err(e2)) => Err(anyhow::anyhow!("both paths failed: direct={e}, relay={e2}")),
Err(_) => Err(anyhow::anyhow!("both paths failed: direct={e}, relay=timeout(5s)")),
}
}
Err(_elapsed) => {
tracing::warn!("dual_path: direct timed out (2s), awaiting relay");
match tokio::time::timeout(Duration::from_secs(5), &mut relay_fut).await {
Ok(Ok(transport)) => Ok((Arc::new(transport), WinningPath::Relay)),
Ok(Err(e2)) => Err(anyhow::anyhow!("direct timeout + relay failed: {e2}")),
Err(_) => Err(anyhow::anyhow!("direct timeout + relay timeout")),
}
}
}
}
relay_result = &mut relay_fut => {
match relay_result {
Ok(transport) => {
tracing::info!("dual_path: relay WON (direct still pending)");
Ok((Arc::new(transport), WinningPath::Relay))
}
Err(e) => {
tracing::warn!(error = %e, "dual_path: relay failed, awaiting direct remainder");
match tokio::time::timeout(Duration::from_millis(1500), &mut direct_timed).await {
Ok(Ok(Ok(transport))) => Ok((Arc::new(transport), WinningPath::Direct)),
_ => Err(anyhow::anyhow!("relay failed + direct unavailable: {e}")),
}
}
}
}
};
// Let both endpoint clones drop at end-of-scope. With the
// Phase 5 shared-endpoint path, these clones are Arc<Endpoint>
// clones of the signal endpoint — dropping them just decrements
// the ref count, the socket stays alive for the signal loop +
// any further direct-P2P attempts. With the fresh-endpoint
// fallback, the drops are the last refs so the sockets close
// promptly. Either way the winning transport already owns its
// own quinn::Connection reference which is independent of the
// Endpoint lifetime.
let _ = (direct_ep, relay_ep);
result
}