feat(telemetry): Phase 4 — LossRecoveryUpdate protocol + relay metrics + DebugReporter
Phase 4 lays the telemetry foundation for distinguishing DRED recoveries
from classical PLC in production: a new SignalMessage variant, two new
per-session Prometheus counters on the relay side, and a highlighted
loss-recovery section in the Android DebugReporter.
The periodic emitter (client → relay) and Grafana panel are deferred to
Phase 4b — this commit ships the protocol surface, the relay sink, and
the immediate user-visible debug output. Once 4b lands the full path
(emitter → relay → Prometheus → Grafana), the metrics here will
automatically start receiving data.
Scope decision — why not extend QualityReport instead:
The existing wire-format QualityReport is a fixed 4-byte media packet
trailer. Adding counter fields to it would shift the binary layout and
break backward compatibility (old receivers would parse the last 4
bytes of the extended trailer as QR, corrupting audio). Using a
new SignalMessage variant on the reliable QUIC signal stream sidesteps
the wire-format problem entirely — serde JSON enums tolerate unknown
variants gracefully on old receivers, and the signal channel is the
right layer for periodic telemetry aggregates.
Changes:
wzp-proto/src/packet.rs:
- New SignalMessage::LossRecoveryUpdate variant carrying:
* dred_reconstructions: u64 (monotonic since call start)
* classical_plc_invocations: u64 (monotonic)
* frames_decoded: u64 (for rate calculation)
- All three fields tagged #[serde(default)] for forward compat.
wzp-client/src/featherchat.rs:
- Added a match arm so signal_to_call_type() handles the new
variant (treat as Offer for featherChat bridging purposes).
wzp-relay/src/metrics.rs:
- Two new IntCounterVec metrics on the relay, labeled by session_id:
* wzp_relay_session_dred_reconstructions_total
* wzp_relay_session_classical_plc_total
- New method update_session_loss_recovery(session_id, dred, plc)
applies monotonic deltas: if the incoming totals exceed the
current counter, the difference is inc_by'd. If the incoming
totals are LOWER (client restart or counter reset), the
Prometheus counter holds steady until the client catches up.
This matches the existing update_session_buffer delta pattern.
- remove_session_metrics() now cleans up the two new labels.
- New test session_loss_recovery_monotonic_delta exercises:
* initial population (10 DRED, 2 PLC)
* forward advance (25, 5 → delta +15, +3)
* lower values ignored (client reset → counters unchanged)
* client catches up (30, 8 → advances to new max)
- Existing session_metrics_cleanup test extended to cover the
new counters.
android/app/src/main/java/com/wzp/debug/DebugReporter.kt:
- Phase 4 users — and incident responders — need to quickly see
whether DRED is actually firing during a call. The stats JSON
already carries the counters (after Phase 3c), but they were
buried in the trailing JSON dump. Added a dedicated
"=== Loss Recovery ===" section to the meta preamble that
extracts dred_reconstructions, classical_plc_invocations,
frames_decoded, and fec_recovered from the JSON and displays
them plainly, plus computed percentages when frames_decoded > 0.
- New extractLongField helper: tiny hand-rolled JSON integer
extractor. We don't want to pull in a full JSON parser for this
single use case and CallStats has a flat, well-known schema.
Verification:
- cargo check --workspace: zero errors
- cargo test -p wzp-proto --lib: 63 passing
- cargo test -p wzp-codec --lib: 68 passing
- cargo test -p wzp-client --lib: 35 passing (+1 ignored probe)
- cargo test -p wzp-relay --lib: 68 passing (+1 new Phase 4 test)
- cargo check -p wzp-android --lib: zero errors
- Android APK build verified earlier today (unridden-alfonso.apk
via the remote Docker builder) — Phase 0–3c confirmed to compile
end-to-end on the NDK target.
Phase 4b remaining (not blocking this commit):
- Periodic LossRecoveryUpdate emitter in wzp-client/src/call.rs and
wzp-android/src/engine.rs (every ~5 s)
- Relay-side handler in main.rs that matches the new variant and
calls metrics.update_session_loss_recovery
- Grafana "Loss recovery breakdown" panel in docs/grafana-dashboard.json
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -46,6 +46,14 @@ class DebugReporter(private val context: Context) {
|
||||
val zipFile = File(context.cacheDir, "wzp_debug_${timestamp}.zip")
|
||||
|
||||
ZipOutputStream(BufferedOutputStream(FileOutputStream(zipFile))).use { zos ->
|
||||
// Phase 4: extract DRED / classical PLC counters from the
|
||||
// stats JSON so they're visible in the meta preamble at a
|
||||
// glance, not buried in the trailing JSON dump.
|
||||
val dredReconstructions = extractLongField(finalStatsJson, "dred_reconstructions")
|
||||
val classicalPlc = extractLongField(finalStatsJson, "classical_plc_invocations")
|
||||
val framesDecoded = extractLongField(finalStatsJson, "frames_decoded")
|
||||
val fecRecovered = extractLongField(finalStatsJson, "fec_recovered")
|
||||
|
||||
// 1. Call metadata
|
||||
val meta = buildString {
|
||||
appendLine("=== WZ Phone Debug Report ===")
|
||||
@@ -58,6 +66,18 @@ class DebugReporter(private val context: Context) {
|
||||
appendLine("Device: ${android.os.Build.MANUFACTURER} ${android.os.Build.MODEL}")
|
||||
appendLine("Android: ${android.os.Build.VERSION.RELEASE} (API ${android.os.Build.VERSION.SDK_INT})")
|
||||
appendLine()
|
||||
appendLine("=== Loss Recovery ===")
|
||||
appendLine("Frames decoded: $framesDecoded")
|
||||
appendLine("DRED reconstructions: $dredReconstructions (Opus neural recovery)")
|
||||
appendLine("Classical PLC: $classicalPlc (fallback)")
|
||||
appendLine("RaptorQ FEC recovered: $fecRecovered (Codec2 only)")
|
||||
if (framesDecoded > 0) {
|
||||
val dredPct = 100.0 * dredReconstructions / framesDecoded
|
||||
val plcPct = 100.0 * classicalPlc / framesDecoded
|
||||
appendLine("DRED rate: ${"%.2f".format(dredPct)}%")
|
||||
appendLine("Classical PLC rate: ${"%.2f".format(plcPct)}%")
|
||||
}
|
||||
appendLine()
|
||||
appendLine("=== Final Stats ===")
|
||||
appendLine(finalStatsJson)
|
||||
}
|
||||
@@ -195,4 +215,28 @@ class DebugReporter(private val context: Context) {
|
||||
FileInputStream(file).use { it.copyTo(zos) }
|
||||
zos.closeEntry()
|
||||
}
|
||||
|
||||
/**
|
||||
* Tiny JSON field extractor — pulls an integer value for a top-level
|
||||
* field like `"dred_reconstructions":42`. We don't want to pull in a
|
||||
* full JSON parser just for the debug preamble, and the CallStats
|
||||
* output is a flat record with well-known field names.
|
||||
*
|
||||
* Returns 0 if the field is missing or unparseable.
|
||||
*/
|
||||
private fun extractLongField(json: String, field: String): Long {
|
||||
val key = "\"$field\":"
|
||||
val idx = json.indexOf(key)
|
||||
if (idx < 0) return 0
|
||||
var i = idx + key.length
|
||||
// Skip whitespace
|
||||
while (i < json.length && json[i].isWhitespace()) i++
|
||||
val start = i
|
||||
while (i < json.length && (json[i].isDigit() || json[i] == '-')) i++
|
||||
return try {
|
||||
json.substring(start, i).toLong()
|
||||
} catch (_: NumberFormatException) {
|
||||
0
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user