Files

Build Release Binaries / build-amd64 (push) Has been cancelled

Details

docs: incident reports for send-task crash and playout ring desync

Two root-caused bugs documented with full evidence:

1. Send task fatal exit on QUIC congestion (FIXED in 2092245)
   - send_media() Err(Blocked) caused break → killed entire call
   - Now drops packet and continues

2. Playout ring buffer cursor desync (ROOT-CAUSED, fix pending)
   - AudioRing::write() mutates read_pos from producer thread on overflow
   - Violates SPSC contract → reader/writer fight over read_pos
   - Causes 12-16s bidirectional silence ~25-30s into call
   - Both clients affected simultaneously

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-06 08:52:14 +00:00

9.1 KiB

Raw Blame History

Incident Report: Playout Ring Buffer Cursor Desync — Bidirectional Audio Loss

Date: 2026-04-06 Severity: Critical — causes 10-16 seconds of complete bidirectional silence mid-call Status: Root-caused, fix pending Affects: All clients using AudioRing (Android, potentially desktop)

Summary

Both participants in a call experience simultaneous, prolonged audio silence (10-16 seconds) despite the QUIC transport, relay, and Rust codec pipeline all functioning normally. The root cause is a cursor desynchronization in the lock-free SPSC ring buffer (AudioRing) that transfers decoded PCM from the Rust recv task to the Kotlin AudioTrack playout thread.

How We Know It's the Ring Buffer

Evidence that eliminates other components

Component	Evidence it's healthy	Source
QUIC send path	`frames_dropped=0, send_errors=0` on both clients	Engine send stats log
QUIC recv path	`max_recv_gap_ms=82, recv_errors=0` — no gaps >82ms	Engine recv stats log
Relay forwarding	`max_forward_ms=0, send_errors=0` in previous relay-instrumented test	Relay debug logging
Opus codec	`frames_decoded=2442` over 51.9s = 47 frames/sec (correct for 20ms)	Final stats JSON
FEC	`fec_recovered=4870` — FEC working normally	Final stats JSON
Audio capture	Pixel 6 capture has 0% silence; Nothing has gaps but those are expected mic pauses	capture_rms.csv

Evidence pointing to the ring buffer

Both clients go silent at the exact same wall-clock moment (26.66s into call) — rules out per-device issues; the common factor is the relay, but the relay was proven healthy in prior tests.
playout_avail=8640 at stats dump time — the playout ring reports 8640 samples available (180ms, nearly full at the 9600 capacity). The recv task believes it has successfully written data into the ring. But the AudioTrack playout thread is reading silence (RMS=0 for 12+ seconds).
Recv task continued receiving packets with no gaps — max_recv_gap_ms=82 across the entire call. The decoded audio was written to the ring continuously.
Silence starts and ends cleanly — the transition from audio → silence happens within a single 20ms frame (frame 1332: rms=101, frame 1333: rms=0). This is not network degradation (which shows gradual quality loss). It's a discrete state change — the reader suddenly stops seeing data.
Recovery is also discrete — at ~38.8s (Sharp Hawk) and ~42.7s (Pixel 6), audio snaps back with high-energy frames (rms=3296+). Not a gradual reconnection.

The Ring Buffer Code

File: crates/wzp-android/src/audio_ring.rs

const RING_CAPACITY: usize = 960 * 10; // 9600 samples = 200ms at 48kHz

pub struct AudioRing {
    buf: Box<[i16; RING_CAPACITY]>,
    write_pos: AtomicUsize,  // monotonically increasing, wraps at usize::MAX
    read_pos: AtomicUsize,   // monotonically increasing, wraps at usize::MAX
}

`available()` — how many samples can be read

pub fn available(&self) -> usize {
    let w = self.write_pos.load(Ordering::Acquire);
    let r = self.read_pos.load(Ordering::Acquire);
    w.wrapping_sub(r)  // relies on usize wrapping arithmetic
}

`write()` — producer (Rust recv task thread, inside tokio block_on)

pub fn write(&self, samples: &[i16]) -> usize {
    let w = self.write_pos.load(Ordering::Relaxed);
    let count = samples.len().min(RING_CAPACITY);
    // ... write samples at (w + i) % RING_CAPACITY ...
    self.write_pos.store(w.wrapping_add(count), Ordering::Release);

    // If we overwrote unread data, advance read_pos
    if self.available() > RING_CAPACITY {
        let new_read = self.write_pos.load(Ordering::Relaxed).wrapping_sub(RING_CAPACITY);
        self.read_pos.store(new_read, Ordering::Release);
    }
}

`read()` — consumer (Kotlin AudioTrack JVM thread, via JNI)

pub fn read(&self, out: &mut [i16]) -> usize {
    let avail = self.available();
    let count = out.len().min(avail);
    let r = self.read_pos.load(Ordering::Relaxed);
    // ... read samples at (r + i) % RING_CAPACITY ...
    self.read_pos.store(r.wrapping_add(count), Ordering::Release);
    count
}

Suspected Failure Modes

1. Writer advances `read_pos` while reader is mid-read (data race)

The write() method at lines 68-72 modifies read_pos from the writer thread when it detects overflow. But the read() method on the consumer thread also modifies read_pos. This violates the SPSC contract — read_pos is supposed to be owned by the consumer.

Scenario:

Reader loads read_pos = R (line 82)
Writer detects overflow, stores read_pos = R' (line 71) where R' > R
Reader finishes reading, stores read_pos = R + count (line 88) — overwrites the writer's R' with a stale, smaller value

After step 3, the ring's read_pos has gone backwards. Now available() returns write_pos.wrapping_sub(old_read_pos) which is larger than RING_CAPACITY. Every subsequent write() call hits the overflow branch and keeps advancing read_pos, but the reader keeps overwriting it back. The ring is in a corrupted state where the reader and writer are fighting over read_pos.

2. `wrapping_sub` returns astronomically large values

available() uses w.wrapping_sub(r). On a 64-bit platform, if due to the race above r > w, wrapping_sub returns usize::MAX - (r - w) + 1 — an enormous number. The read() method caps this with out.len().min(avail) so it reads out.len() samples. But those samples are from indices calculated as (r + i) % RING_CAPACITY which wraps correctly. The samples read would be whatever was in the buffer at those positions — potentially stale/old data, or zeros from initialization.

However, the playout RMS CSV shows clean zeros (RMS=0), not garbage. This suggests the ring is returning the zeroed-out initial buffer contents, meaning read_pos has jumped far ahead of write_pos, pointing to memory that was never written to (or was written long ago and has since been zeroed by the overflow advance logic).

3. Why silence lasts exactly 12-16 seconds

After the desync, each write() call (every 20ms when a packet is decoded) enters the overflow branch and resets read_pos. But the reader immediately overwrites it back in its next read() call. This tug-of-war continues until one of two things happens:

The cursors happen to realign through wrapping arithmetic
A timing coincidence where the writer's store to read_pos happens to "win" the race

The 12-16 second duration is non-deterministic and depends on exact thread scheduling and cursor values.

Reproduction Pattern

The bug manifests after roughly 25-30 seconds of a call. This timing is suspicious:

At 48kHz mono, 20ms frames = 50 frames/sec
Each decoded frame writes 960 samples to the ring
After 25 seconds: write_pos ≈ 25 * 50 * 960 = 1,200,000
The ring capacity is 9600, so write_pos has wrapped around RING_CAPACITY about 125 times

The wrapping of the monotonic cursors past certain thresholds, combined with the reader/writer read_pos race, likely triggers the desync at this scale.

Data Files

All data from two independent test sessions (3 calls total) is in /workspace/wzp/debug/:

File	Contents
`wzp_debug_20260406_120546.zip`	Sharp Hawk (Nothing A059) — 51.9s call
`wzp_debug_20260406_120549.zip`	Bright Viper (Pixel 6) — 51.9s call
`wzp_debug_20260406_111733.zip`	Sharp Hawk — earlier 72.0s call, same pattern
`wzp_debug_20260406_111735.zip`	Bright Viper — earlier 72.0s call, same pattern
`wzp_debug_20260406_105858.zip`	First session (pre-logging fix), 39.8s call
`wzp_debug_20260406_105900.zip`	First session, 39.7s call

Key fields in each zip

meta.txt — device, duration, final stats JSON
playout_rms.csv — per-frame (20ms) RMS of AudioTrack output; silence = RMS 0
capture_rms.csv — per-frame RMS of AudioRecord input
logcat.txt — Android logcat filtered to WZP + audio tags

How to reproduce the analysis

import csv
with open("playout_rms.csv") as f:
    for row in csv.DictReader(f):
        if int(row['rms']) == 0 and int(row['time_ms']) > 2000:
            print(f"SILENCE at {row['time_ms']}ms")

Affected Code

crates/wzp-android/src/audio_ring.rs — the AudioRing struct, specifically the write() method's overflow handling that mutates read_pos from the producer thread
Any client using AudioRing for playout (currently only Android; desktop uses cpal directly)

Constraints for the Fix

Must remain lock-free — AudioTrack thread runs at Thread.MAX_PRIORITY and cannot block
Must handle overflow gracefully — if the reader falls behind, old audio should be dropped, not cause a desync
The writer (Rust recv task) and reader (Kotlin AudioTrack via JNI) run on different threads with different scheduling priorities
Ring capacity is 200ms which is tight — any fix must not increase latency significantly
The write_pos and read_pos are the only synchronization mechanism (no mutex, no condvar)

9.1 KiB Raw Blame History