docs: incident reports for send-task crash and playout ring desync
Some checks failed
Build Release Binaries / build-amd64 (push) Has been cancelled
Some checks failed
Build Release Binaries / build-amd64 (push) Has been cancelled
Two root-caused bugs documented with full evidence:
1. Send task fatal exit on QUIC congestion (FIXED in 2092245)
- send_media() Err(Blocked) caused break → killed entire call
- Now drops packet and continues
2. Playout ring buffer cursor desync (ROOT-CAUSED, fix pending)
- AudioRing::write() mutates read_pos from producer thread on overflow
- Violates SPSC contract → reader/writer fight over read_pos
- Causes 12-16s bidirectional silence ~25-30s into call
- Both clients affected simultaneously
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
123
debug/INCIDENT-2026-04-06-send-task-crash.md
Normal file
123
debug/INCIDENT-2026-04-06-send-task-crash.md
Normal file
@@ -0,0 +1,123 @@
|
||||
# Incident Report: Send Task Fatal Exit on QUIC Congestion
|
||||
|
||||
**Date:** 2026-04-06
|
||||
**Severity:** High — causes complete audio loss mid-call
|
||||
**Status:** Fixed in Android client, **pending fix in desktop client and web client**
|
||||
|
||||
## Summary
|
||||
|
||||
A QUIC congestion event causes `send_datagram()` to return `Err(Blocked)`. The send task treats this as a fatal error and exits, which kills the entire call via `tokio::select!`. Audio becomes one-way (recv still works briefly) then dies completely.
|
||||
|
||||
## Root Cause
|
||||
|
||||
In the engine's send loop (`run_call` function), `transport.send_media()` errors were handled with `break`:
|
||||
|
||||
```rust
|
||||
// BEFORE (broken)
|
||||
if let Err(e) = transport.send_media(&source_pkt).await {
|
||||
error!("send error: {e}");
|
||||
break; // <-- kills send task, which kills everything
|
||||
}
|
||||
```
|
||||
|
||||
Quinn's `send_datagram()` is synchronous and returns `Err(SendDatagramError::Blocked)` when the QUIC congestion window is full. This is a **transient condition** — the window opens again once ACKs arrive. But the `break` kills the send task, and since all tasks run under `tokio::select!`, the recv task, stats task, and signal task all die too.
|
||||
|
||||
### Why it manifests as "intermittent disconnections"
|
||||
|
||||
- Mobile networks have brief congestion spikes (cell tower handoff, WiFi interference)
|
||||
- A single spike fills the QUIC congestion window
|
||||
- One `Blocked` error → send task exits → `select!` cancels recv → complete silence
|
||||
- The QUIC connection stays open (no error logged), so stats polling continues showing stale data
|
||||
- From the user's perspective: audio drops for 5-20 seconds then "maybe comes back" (it doesn't — they're hearing cached playout ring drain)
|
||||
|
||||
### Evidence from debug reports
|
||||
|
||||
**Relay logs** confirmed the relay was healthy:
|
||||
- `max_forward_ms=0` — relay forwards instantly
|
||||
- `send_errors=0` — no relay-side failures
|
||||
- The relay saw `large recv gap` warnings on participant 1 (Nothing A059): 722ms → 814ms → 1778ms → 3500ms → 6091ms — the client progressively stopped sending
|
||||
|
||||
**Client stats** confirmed:
|
||||
- `frames_encoded` kept incrementing (Opus encoder running)
|
||||
- `frames_decoded` froze at a fixed value (recv task died)
|
||||
- `fec_recovered` froze simultaneously
|
||||
- RTT, loss, jitter all frozen (stats task died)
|
||||
|
||||
## Fix Applied
|
||||
|
||||
### Android client (`crates/wzp-android/src/engine.rs`)
|
||||
|
||||
```rust
|
||||
// AFTER (fixed)
|
||||
if let Err(e) = transport.send_media(&source_pkt).await {
|
||||
send_errors += 1;
|
||||
frames_dropped += 1;
|
||||
if send_errors <= 3 || last_send_error_log.elapsed().as_secs() >= 1 {
|
||||
warn!(seq = s, send_errors, frames_dropped,
|
||||
"send_media error (dropping packet): {e}");
|
||||
last_send_error_log = Instant::now();
|
||||
}
|
||||
continue; // <-- drop packet, keep going
|
||||
}
|
||||
```
|
||||
|
||||
Same pattern applied to FEC repair packet sends.
|
||||
|
||||
Recv task also hardened: transient errors (non-closed/reset) are now logged and survived rather than causing exit.
|
||||
|
||||
Added periodic health logging to both tasks (5-second intervals):
|
||||
- Send: `frames_sent`, `frames_dropped`, `send_errors`, `ring_avail`
|
||||
- Recv: `frames_decoded`, `fec_recovered`, `recv_errors`, `max_recv_gap_ms`, `playout_avail`
|
||||
|
||||
### Relay (`crates/wzp-relay/src/room.rs`)
|
||||
|
||||
Added debug logging to both plain and trunked forwarding loops:
|
||||
- Per-recv gap tracking (warns on >200ms gaps)
|
||||
- Room manager lock contention tracking (warns on >10ms)
|
||||
- Forward latency tracking (warns on >50ms)
|
||||
- Send error counting with peer identification
|
||||
- 5-second periodic stats with all above metrics
|
||||
|
||||
## Affected Clients — FIX REQUIRED
|
||||
|
||||
### Desktop client (`crates/wzp-client/src/cli.rs`)
|
||||
|
||||
**Lines 345-348:**
|
||||
```rust
|
||||
if let Err(e) = transport.send_media(pkt).await {
|
||||
error!("send error: {e}");
|
||||
break; // <-- SAME BUG
|
||||
}
|
||||
```
|
||||
|
||||
**Lines 431-434:**
|
||||
```rust
|
||||
if let Err(e) = send_transport.send_media(pkt).await {
|
||||
error!("send error: {e}");
|
||||
return; // <-- SAME BUG
|
||||
}
|
||||
```
|
||||
|
||||
Both need the same continue-on-error pattern.
|
||||
|
||||
### Web client (`crates/wzp-web/src/main.rs`)
|
||||
|
||||
Needs audit — WebSocket transport may have different error semantics but same pattern should be checked.
|
||||
|
||||
## Testing
|
||||
|
||||
After fix, a congestion event will:
|
||||
1. Log warnings with packet counts: `send_media error (dropping packet): Blocked`
|
||||
2. Drop affected packets (brief audio glitch — ~20-100ms)
|
||||
3. Resume normal sending once congestion window opens
|
||||
4. FEC on the receiver side will recover most dropped packets
|
||||
5. Call continues uninterrupted
|
||||
|
||||
## Timeline
|
||||
|
||||
- 10:37 — First crash observed (LinearProgressIndicator compose bug masked investigation)
|
||||
- 10:58 — Debug reports collected, decoded stall pattern identified
|
||||
- 11:16 — Relay debug logging deployed, confirmed relay is clean
|
||||
- 11:17 — Second debug reports collected, send gaps correlated with relay recv gaps
|
||||
- 11:30 — Root cause identified: `break` on `send_media` error in send task
|
||||
- 11:45 — Fix applied and deployed
|
||||
Reference in New Issue
Block a user