Files

Build Release Binaries / build-amd64 (push) Has been cancelled

Details

docs: incident reports for send-task crash and playout ring desync

Two root-caused bugs documented with full evidence:

1. Send task fatal exit on QUIC congestion (FIXED in 2092245)
   - send_media() Err(Blocked) caused break → killed entire call
   - Now drops packet and continues

2. Playout ring buffer cursor desync (ROOT-CAUSED, fix pending)
   - AudioRing::write() mutates read_pos from producer thread on overflow
   - Violates SPSC contract → reader/writer fight over read_pos
   - Causes 12-16s bidirectional silence ~25-30s into call
   - Both clients affected simultaneously

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-06 08:52:14 +00:00

4.7 KiB

Raw Blame History

Incident Report: Send Task Fatal Exit on QUIC Congestion

Date: 2026-04-06 Severity: High — causes complete audio loss mid-call Status: Fixed in Android client, pending fix in desktop client and web client

Summary

A QUIC congestion event causes send_datagram() to return Err(Blocked). The send task treats this as a fatal error and exits, which kills the entire call via tokio::select!. Audio becomes one-way (recv still works briefly) then dies completely.

Root Cause

In the engine's send loop (run_call function), transport.send_media() errors were handled with break:

// BEFORE (broken)
if let Err(e) = transport.send_media(&source_pkt).await {
    error!("send error: {e}");
    break;  // <-- kills send task, which kills everything
}

Quinn's send_datagram() is synchronous and returns Err(SendDatagramError::Blocked) when the QUIC congestion window is full. This is a transient condition — the window opens again once ACKs arrive. But the break kills the send task, and since all tasks run under tokio::select!, the recv task, stats task, and signal task all die too.

Why it manifests as "intermittent disconnections"

Mobile networks have brief congestion spikes (cell tower handoff, WiFi interference)
A single spike fills the QUIC congestion window
One Blocked error → send task exits → select! cancels recv → complete silence
The QUIC connection stays open (no error logged), so stats polling continues showing stale data
From the user's perspective: audio drops for 5-20 seconds then "maybe comes back" (it doesn't — they're hearing cached playout ring drain)

Evidence from debug reports

Relay logs confirmed the relay was healthy:

max_forward_ms=0 — relay forwards instantly
send_errors=0 — no relay-side failures
The relay saw large recv gap warnings on participant 1 (Nothing A059): 722ms → 814ms → 1778ms → 3500ms → 6091ms — the client progressively stopped sending

Client stats confirmed:

frames_encoded kept incrementing (Opus encoder running)
frames_decoded froze at a fixed value (recv task died)
fec_recovered froze simultaneously
RTT, loss, jitter all frozen (stats task died)

Fix Applied

Android client (`crates/wzp-android/src/engine.rs`)

// AFTER (fixed)
if let Err(e) = transport.send_media(&source_pkt).await {
    send_errors += 1;
    frames_dropped += 1;
    if send_errors <= 3 || last_send_error_log.elapsed().as_secs() >= 1 {
        warn!(seq = s, send_errors, frames_dropped,
              "send_media error (dropping packet): {e}");
        last_send_error_log = Instant::now();
    }
    continue;  // <-- drop packet, keep going
}

Same pattern applied to FEC repair packet sends.

Recv task also hardened: transient errors (non-closed/reset) are now logged and survived rather than causing exit.

Added periodic health logging to both tasks (5-second intervals):

Send: frames_sent, frames_dropped, send_errors, ring_avail
Recv: frames_decoded, fec_recovered, recv_errors, max_recv_gap_ms, playout_avail

Relay (`crates/wzp-relay/src/room.rs`)

Added debug logging to both plain and trunked forwarding loops:

Per-recv gap tracking (warns on >200ms gaps)
Room manager lock contention tracking (warns on >10ms)
Forward latency tracking (warns on >50ms)
Send error counting with peer identification
5-second periodic stats with all above metrics

Affected Clients — FIX REQUIRED

Desktop client (`crates/wzp-client/src/cli.rs`)

Lines 345-348:

if let Err(e) = transport.send_media(pkt).await {
    error!("send error: {e}");
    break;  // <-- SAME BUG
}

Lines 431-434:

if let Err(e) = send_transport.send_media(pkt).await {
    error!("send error: {e}");
    return;  // <-- SAME BUG
}

Both need the same continue-on-error pattern.

Web client (`crates/wzp-web/src/main.rs`)

Needs audit — WebSocket transport may have different error semantics but same pattern should be checked.

Testing

After fix, a congestion event will:

Log warnings with packet counts: send_media error (dropping packet): Blocked
Drop affected packets (brief audio glitch — ~20-100ms)
Resume normal sending once congestion window opens
FEC on the receiver side will recover most dropped packets
Call continues uninterrupted

Timeline

10:37 — First crash observed (LinearProgressIndicator compose bug masked investigation)
10:58 — Debug reports collected, decoded stall pattern identified
11:16 — Relay debug logging deployed, confirmed relay is clean
11:17 — Second debug reports collected, send gaps correlated with relay recv gaps
11:30 — Root cause identified: break on send_media error in send task
11:45 — Fix applied and deployed

4.7 KiB Raw Blame History