T1.5: Migrate emit/parse sites to v2 wire format

This commit is contained in:
Siavash Sameni
2026-05-11 12:36:45 +04:00
parent 9680b6ff34
commit c93d302656
120 changed files with 5953 additions and 2888 deletions

View File

@@ -0,0 +1,116 @@
# PRD: Transport Feedback & Bandwidth Estimator
> **Status:** proposed
> **Resolves:** Audit W6 (no BWE), W14 (no receiver→sender feedback channel).
> **Depends on:** PRD #1 (wire format v2 — for u32 seq).
## Problem
`AdaptiveQualityController` decides tier transitions from loss% and RTT only. Quinn exposes congestion-window and bytes-in-flight, but we don't consume them. There is no receiver→sender feedback channel beyond the inline 4-byte `QualityReport`.
Consequences:
- On stable links with spare capacity, we never upgrade past the declared profile (audio stuck at Opus 24 k when 64 k is available).
- Oscillation between adjacent tiers on the boundary.
- **No bandwidth-aware adaptation = no usable video.** Video without BWE either oscillates wildly or never uses available capacity.
## Goals
- Continuous bandwidth estimate per session, surfaced to adaptation controllers.
- Receiver→sender feedback at ~50 ms cadence carrying ack/nack/remb.
- Audio benefits immediately (smarter upgrades, fewer oscillations).
- Video uses BWE as its primary input (PRD #7).
## Non-goals
- Replacing Quinn's congestion controller — we ride on top.
- Cross-stream BWE (each session estimates independently for v1).
## Design
### `SignalMessage::TransportFeedback`
New signal variant, sent on the existing signal stream every 50 ms or every N media packets, whichever first:
```rust
pub struct TransportFeedback {
pub version: u8, // PRD #4 W12: always present
pub stream_id: u8, // 0 for session-wide; >0 for per-stream
pub acked_seqs: Vec<u32>, // recent seqs received OK (RLE-compressed)
pub nacked_seqs: Vec<u32>, // recent seqs missing (RLE-compressed)
pub remb_bps: u32, // receiver's estimated max bandwidth
pub recv_time_us: u64, // arrival-time for sender-side jitter calc
}
```
RLE compression keeps the wire size bounded (typical payload ~50 B).
### `BandwidthEstimator` (in `wzp-proto`)
```rust
pub struct BandwidthEstimator {
cwnd_bps: AtomicU64, // from Quinn path stats
bytes_in_flight: AtomicU64, // from Quinn path stats
peer_remb_bps: AtomicU64, // from TransportFeedback
smoothed_bps: AtomicU64, // EWMA output
}
impl BandwidthEstimator {
pub fn update_from_quinn(&self, stats: &QuinnPathStats);
pub fn update_from_peer(&self, fb: &TransportFeedback);
pub fn target_send_bps(&self) -> u64 {
// 0.9 × min(cwnd_bps, peer_remb_bps), EWMA-smoothed
}
}
```
Three signals fused:
1. **Quinn cwnd.** Conservative ceiling — sending faster than cwnd just drops or queues.
2. **Peer REMB.** Receiver's perspective on what they can actually consume (after their own jitter buffer, decode budget, etc.).
3. **EWMA smoothing.** Half-life ~2 s; avoids oscillation.
Target = 90 % of `min(cwnd, remb)`, leaving headroom for probing upward.
### Adaptation controller integration
`AdaptiveQualityController::tick()` already consumes loss/RTT/jitter. Add BWE input:
```rust
if self.bwe.target_send_bps() > self.current_tier_ceiling_bps() * 1.3
&& consecutive_upgrade_reports >= UPGRADE_THRESHOLD {
self.upgrade_one_tier();
}
```
Upgrade gated on BWE *headroom*, not just clean reports. Eliminates the "always at Opus 24 k on a fiber link" pathology.
### Probing
To detect unused capacity, sender occasionally adds 510 % padding/FEC during otherwise-clean windows. If `cwnd` doesn't drop and `remb` doesn't fall, the headroom is real — upgrade. If signals degrade, back off. Cheap and standard.
## Implementation outline
1. New `wzp-proto::bwe::BandwidthEstimator`.
2. `wzp-transport` exposes `QuinnPathStats { cwnd_bps, bytes_in_flight, rtt_ms }`; already partially there via `QuinnPathSnapshot`.
3. `SignalMessage::TransportFeedback` variant + serde.
4. Receiver-side: track recent seqs in a ring buffer; emit feedback every 50 ms.
5. Sender-side: BWE consumes own Quinn stats + incoming feedback.
6. `AdaptiveQualityController::set_bwe(&BandwidthEstimator)`.
7. Prometheus: `wzp_session_bwe_bps`, `wzp_session_remb_bps`, `wzp_session_cwnd_bps`.
8. Probing logic behind a flag for first deployment.
## Acceptance criteria
- On a shaped 5 Mbps link with Opus 24 k, controller upgrades to Opus 64 k within 30 s.
- On a shaped 50 kbps link, controller stays at Opus 6 k and does not oscillate.
- Feedback wire size < 100 B per 50 ms (= < 2 kbps overhead).
- Probing finds headroom on a 10 Mbps link in < 60 s.
## Risks
- **Probing-induced loss on already-saturated links.** Mitigation: probe only when smoothed loss < 1 % over 10 s.
- **Feedback storm under heavy loss.** Mitigation: feedback rate capped at 20 Hz independent of media rate.
- **Quinn cwnd lies on QUIC-over-some-VPNs.** Mitigation: REMB serves as cross-check; take min of the two.
## Effort
~4 engineer-days (Wave 2 tasks T2.1T2.3).