Audit: - docs/AUDIT-2026-05-25.md: full protocol audit covering 8 findings (4 critical, 2 high, 5 medium, 4 low) with code references and fix effort estimates - vault/Audit/Tasks.md: Obsidian Tasks plugin file tracking all audit items with priorities, due dates, and per-step checklists Architecture docs updated for Wire format v2 and Wave 5/6 features: - ARCHITECTURE.md: adds wzp-video to dependency graph and project structure; wire format updated to v2 (16B header, 5B MiniHeader); relay concurrency section corrected (DashMap+RwLock is current, not a future optimization); test count 571→702; Android note - PROGRESS.md: Wave 5 and Wave 6 sections appended; test count 372→702; current status and open blockers as of 2026-05-25 - ROAD-TO-VIDEO.md: implementation status table inserted (✅/🟡/🔴/🔲 per phase); 6-step critical path to first video call - WZP-SPEC.md: MediaHeader updated to v2 (16B byte-aligned); MiniHeader updated to 5B with seq_delta; codec IDs 9-12 added (H.264/H.265/AV1); version negotiation section added Obsidian vault (vault/): - 114 files across Architecture/, PRDs/, Reports/, Android/, Reference/, Audit/ with YAML frontmatter - 00 - Home.md index note with wiki links - .obsidian/app.json config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
308 lines
16 KiB
Markdown
308 lines
16 KiB
Markdown
---
|
||
tags: [prd, wzp]
|
||
type: prd
|
||
---
|
||
|
||
# Design Exploration: Federated Reputation Gossip (T6.3)
|
||
|
||
> **Status:** Design exploration — no approach selected.
|
||
> **Blocked on:** Reviewer design call (needs operator-trust model decision).
|
||
> **Scope:** How should WZP relays share abuse verdicts across a federation mesh so that abusers cannot hop between relays?
|
||
|
||
## Background
|
||
|
||
WZP relays are E2E-encrypted SFUs. They cannot inspect payload content, but they **can** observe metadata: packet rates, payload sizes, timestamps, keyframe patterns, and BWE responsiveness. Tiers A–F of the conformance pipeline observe these signals and produce a `Verdict ∈ {Legitimate, Suspect, Abusive}`.
|
||
|
||
Tier G (`ResponsePolicy`) escalates:
|
||
- `Abusive` → typed `Hangup` + 1 h fingerprint cool-down
|
||
- Repeat `Abusive` within 24 h → relay-local `Block` for 24 h
|
||
|
||
**The gap:** Block state is in-memory only and per-relay. A blocked fingerprint on relay A can reconnect to relay B (same room, same mesh) and resume abuse immediately. T6.3 explores closing this gap.
|
||
|
||
**What is being gossiped?** A *reputation event*: "fingerprint `F` produced violation `V` with verdict `Abusive` at time `T` on relay `R`."
|
||
|
||
---
|
||
|
||
## Assumptions
|
||
|
||
1. Relays trust each other *connection-level* (TLS fingerprints in `PeerConfig` / `TrustedConfig`) but are **not** guaranteed to share the same abuse-detection thresholds or calibration.
|
||
2. The federation mesh is small (tens of relays, not thousands).
|
||
3. False positives happen — a legitimate user on a long lecture call can trigger `Suspect` or even `Abusive` on an aggressively-tuned relay.
|
||
4. A compromised relay is a realistic threat model (stolen credentials, operator coercion, buggy calibration).
|
||
5. Relays are operated by different entities — there is no single administrative root of trust.
|
||
|
||
---
|
||
|
||
## Approach 1: Push Gossip
|
||
|
||
### Summary
|
||
When a relay issues a `Block` action (repeat abusive), it immediately broadcasts a `ReputationEvent` to all peer relays via the existing federation QUIC channels. Peers incorporate the event into their local block lists.
|
||
|
||
### Wire format
|
||
|
||
```rust
|
||
// New SignalMessage variant
|
||
ReputationEvent {
|
||
version: u8,
|
||
/// Fingerprint being reported (the abused party, not the reporter).
|
||
fingerprint: String,
|
||
/// Which violation code triggered the block.
|
||
violation: ViolationCode,
|
||
/// When the block was issued (Unix epoch seconds, u64).
|
||
issued_at: u64,
|
||
/// TTL in seconds (default 86400 = 24 h).
|
||
ttl_secs: u32,
|
||
/// Relay that issued the block (TLS fingerprint hex).
|
||
origin_relay_fp: String,
|
||
/// Ed25519 signature over (fingerprint || violation || issued_at || ttl_secs || origin_relay_fp).
|
||
/// The signing key is the relay's long-term identity key (reused from client handshake identity).
|
||
signature: [u8; 64],
|
||
}
|
||
```
|
||
|
||
**What is signed?** The canonical serialization of the payload fields (excluding the signature itself). This prevents replay of old events and binds the event to a specific origin relay.
|
||
|
||
**Key distribution:** Each relay's Ed25519 public key is published in a well-known endpoint (e.g., `/.well-known/wzp-relay.pub`) or embedded in the `FederationHello` handshake. Verification happens on receipt.
|
||
|
||
### Sybil resistance
|
||
|
||
- **Signing requirement:** Only relays with a known Ed25519 pubkey can produce valid events. A rogue relay must first be in the `TrustedConfig` to even connect.
|
||
- **Origin attribution:** Every event is traceable to a specific relay. If relay R starts flooding false positives, peers can blacklist R's pubkey or reduce R's event weight to zero.
|
||
- **No aggregate thresholding:** This design trusts every signed event equally. A single malicious relay can block any fingerprint across the mesh.
|
||
|
||
**Mitigation option (not implemented):** Require *k-of-n* independent relay reports before applying a cross-relay block. This introduces complexity (tracking per-fingerprint report counts, decaying counters, handling churn).
|
||
|
||
### Convergence model
|
||
|
||
- **Eventual consistency:** Events propagate via multi-hop flood (same mechanism as `GlobalRoomActive`).
|
||
- **Bounded staleness:** Events carry TTL. Stale events (> TTL) are ignored.
|
||
- **No ordering guarantee:** Two relays may issue conflicting events (relay A blocks F, relay B clears F). Last-writer-wins based on `issued_at`.
|
||
|
||
### Storage
|
||
|
||
- **In-memory only:** `HashMap<(fingerprint, origin_relay), ReputationEntry>` with TTL-based eviction.
|
||
- **No persistence:** Restarting a relay loses all gossiped state. Re-blocking requires the origin relay to re-gossip or the abuser to re-offend.
|
||
- **Memory bound:** ~100 bytes per entry × 10k entries = ~1 MB. Trivial.
|
||
|
||
### Partition tolerance
|
||
|
||
- **Partitioned relay A** blocks fingerprint F locally but cannot push to peers. When partition heals, A floods the backlog. Peers apply events with `issued_at` within TTL; expired backlog is ignored.
|
||
- **Partitioned relay B** never hears about F's block. F can abuse B freely during the partition. This is acceptable for a best-effort gossip system.
|
||
|
||
### Failure modes
|
||
|
||
| Scenario | Impact | Mitigation |
|
||
|---|---|---|
|
||
| Compromised relay floods false blocks | All fingerprints blocked mesh-wide | Manual pubkey blacklist; no automatic mitigation in this design |
|
||
| Buggy relay false-positives a popular fingerprint | Legitimate users blocked everywhere | Operator contact + pubkey downgrade |
|
||
| Network partition | Split-brain block lists | Acceptable; partition healing replays backlog |
|
||
| Clock skew | Events from future/past rejected or mis-ordered | Use `issued_at` with ±5 min tolerance; NTP assumed |
|
||
| Replay attack | Old event re-broadcast after TTL | Signature binds `issued_at`; verify TTL at receipt |
|
||
|
||
### Complexity
|
||
|
||
- **Low-medium:** Reuses existing federation broadcast infrastructure. Adds one `SignalMessage` variant, Ed25519 signing/verification, and an in-memory TTL map.
|
||
|
||
---
|
||
|
||
## Approach 2: Pull Gossip (Reputation Oracle)
|
||
|
||
### Summary
|
||
One relay in the mesh is designated the **reputation oracle** (or a small quorum of 3). Other relays query the oracle periodically (e.g., every 60 s) for the current block list. The oracle maintains the authoritative reputation state.
|
||
|
||
### Wire format
|
||
|
||
```rust
|
||
// Pull request
|
||
ReputationQuery {
|
||
version: u8,
|
||
/// Last checkpoint the requester has seen (opaque cursor).
|
||
since_cursor: Option<String>,
|
||
}
|
||
|
||
// Pull response
|
||
ReputationSnapshot {
|
||
version: u8,
|
||
/// Opaque cursor for delta pagination.
|
||
cursor: String,
|
||
/// List of active blocks at the oracle.
|
||
blocks: Vec<ReputationBlock>,
|
||
/// Oracle's Ed25519 signature over the serialized snapshot.
|
||
signature: [u8; 64],
|
||
}
|
||
|
||
struct ReputationBlock {
|
||
fingerprint: String,
|
||
violation: ViolationCode,
|
||
issued_at: u64,
|
||
ttl_secs: u32,
|
||
/// Which relay originally reported this (for audit).
|
||
reported_by: String,
|
||
}
|
||
```
|
||
|
||
**What is signed?** The entire `ReputationSnapshot` serialized canonically. The oracle is the sole signer.
|
||
|
||
**Oracle selection:** Config-based. Each relay's config names its oracle(s):
|
||
```toml
|
||
[reputation]
|
||
oracle = "https://relay-oracle.example.com"
|
||
oracle_pubkey = "AA:BB:CC:..."
|
||
```
|
||
|
||
### Sybil resistance
|
||
|
||
- **Centralized trust:** The oracle is a single point of trust. If the oracle is honest, no rogue relay can poison the mesh.
|
||
- **Oracle compromise:** A compromised oracle can block or unblock any fingerprint across all querying relays. This is a **catastrophic** failure mode.
|
||
- **Quorum variant:** 3 oracles with 2-of-3 signing. Queries return the intersection of block lists. More resilient but adds latency and complexity.
|
||
|
||
### Convergence model
|
||
|
||
- **Bounded staleness:** Worst-case = query interval (60 s) + network RTT.
|
||
- **Strong consistency within staleness bound:** All querying relays see the same oracle state (modulo query timing skew).
|
||
- **No multi-hop gossip:** Direct query/response only.
|
||
|
||
### Storage
|
||
|
||
- **Oracle side:** In-memory + optional persistence (SQLite or flat file). Oracle restarts reload from disk.
|
||
- **Querying relays:** In-memory cache of the last snapshot. No local state between restarts.
|
||
- **Memory bound:** Same as Approach 1 (~1 MB for 10k entries).
|
||
|
||
### Partition tolerance
|
||
|
||
- **Partitioned querying relay:** Falls back to local-only blocking (current behavior). When partition heals, next query re-syncs.
|
||
- **Partitioned oracle:** All querying relays lose cross-relay reputation and fall back to local-only. Oracle restart without persistence = same.
|
||
- **No split-brain:** Either you have the oracle snapshot or you don't. No conflicting states.
|
||
|
||
### Failure modes
|
||
|
||
| Scenario | Impact | Mitigation |
|
||
|---|---|---|
|
||
| Oracle compromise | Global block/unblock of any fingerprint | 2-of-3 quorum; operator out-of-band alert |
|
||
| Oracle downtime | All querying relays lose cross-relay reputation | Fallback to local-only; no amplification |
|
||
| Rogue relay reports to oracle | Oracle may incorporate false positives | Oracle applies local thresholding (e.g., require 2 independent reports) |
|
||
| Query amplification | N relays × 60 s = many queries | Oracle caches; responses are cheap |
|
||
| Man-in-the-middle on query | Attacker injects fake snapshot | TLS + signature verification on response |
|
||
|
||
### Complexity
|
||
|
||
- **Medium:** Adds an HTTP/QUIC query endpoint, snapshot pagination, oracle election/config, and a new failure mode (oracle as SPOF).
|
||
- **Operational burden:** Someone must run the oracle. Small federations may not want this.
|
||
|
||
---
|
||
|
||
## Approach 3: No Gossip — Explicit Ban-List Distribution
|
||
|
||
### Summary
|
||
Relays do **not** gossip reputation automatically. Instead, a human admin (or an automated audit pipeline) curates an authoritative ban list, signs it, and pushes it to all relays. Relays apply the list verbatim.
|
||
|
||
### Wire format
|
||
|
||
```rust
|
||
// Signed ban list (distributed out-of-band: HTTPS, SCP, S3, etc.)
|
||
BanList {
|
||
version: u8,
|
||
/// Issued at (Unix epoch seconds).
|
||
issued_at: u64,
|
||
/// Expires at (Unix epoch seconds). After this, the list is ignored.
|
||
expires_at: u64,
|
||
/// Entries.
|
||
entries: Vec<BanEntry>,
|
||
/// Admin Ed25519 signature over canonical serialization.
|
||
signature: [u8; 64],
|
||
}
|
||
|
||
struct BanEntry {
|
||
fingerprint: String,
|
||
/// Human-readable reason (not machine-parsed).
|
||
reason: String,
|
||
/// Optional: which relay originally reported.
|
||
source_relay: Option<String>,
|
||
}
|
||
```
|
||
|
||
**What is signed?** The entire `BanList`. The admin (not a relay) is the signer.
|
||
|
||
**Distribution:** Out-of-band from the federation mesh. Could be:
|
||
- Admin `scp`s JSON to each relay's config directory
|
||
- Relays poll an HTTPS URL every 5 min
|
||
- Shared object storage (S3, GCS)
|
||
|
||
**Key distribution:** Admin pubkey is baked into each relay's config at provisioning time:
|
||
```toml
|
||
[ban_list]
|
||
admin_pubkey = "AA:BB:CC:..."
|
||
url = "https://ops.example.com/banlist.json"
|
||
refresh_secs = 300
|
||
```
|
||
|
||
### Sybil resistance
|
||
|
||
- **Strong:** Only the admin can produce a valid ban list. No relay can poison another relay.
|
||
- **Admin compromise:** Catastrophic — attacker controls the global ban list. But this is a standard operational threat model (same as DNS root, certificate transparency log, etc.).
|
||
- **No relay-to-relay trust required:** Relays don't need to trust each other's calibration or behaviour.
|
||
|
||
### Convergence model
|
||
|
||
- **Poll-based bounded staleness:** Worst-case = `refresh_secs` (default 300 s = 5 min).
|
||
- **Strong consistency:** All relays that successfully fetch the list see identical state.
|
||
- **No event propagation:** No flood, no multi-hop, no deduplication needed.
|
||
|
||
### Storage
|
||
|
||
- **On-disk cache:** Each relay stores the latest fetched ban list to survive restart.
|
||
- **In-memory lookup:** `HashSet<fingerprint>` for O(1) block checks.
|
||
- **Memory bound:** Same as other approaches.
|
||
|
||
### Partition tolerance
|
||
|
||
- **Partitioned relay:** Continues using its last cached ban list until `expires_at`. After expiry, falls back to local-only blocking.
|
||
- **No split-brain:** Either you have the signed list or you don't.
|
||
|
||
### Failure modes
|
||
|
||
| Scenario | Impact | Mitigation |
|
||
|---|---|---|
|
||
| Admin key compromise | Global block/unblock of any fingerprint | Key rotation + out-of-band alert |
|
||
| Distribution channel down | Relays use stale cached list until expiry | Short expiry + monitoring |
|
||
| Admin false-positives a popular fingerprint | Global block of legitimate users | Human review process; short expiry allows quick recovery |
|
||
| Relay never fetches list | Local-only blocking only | Monitoring alert on relay ops dashboard |
|
||
| List too large | Fetch latency, memory bloat | Pagination; but 10k entries = ~500 KB JSON, trivial |
|
||
|
||
### Complexity
|
||
|
||
- **Low:** No changes to federation mesh at all. Adds an HTTP polling loop, a file watcher, or an S3 sync. The signing/verification is trivial (one Ed25519 verify on fetch).
|
||
- **Operational burden:** Requires a human or automated pipeline to maintain the ban list. This is acceptable for small federations (the WZP target audience) but does not scale to open meshes.
|
||
|
||
---
|
||
|
||
## Comparative Summary
|
||
|
||
| Dimension | Approach 1: Push Gossip | Approach 2: Pull Oracle | Approach 3: Ban-List Distribution |
|
||
|---|---|---|---|
|
||
| **Trust model** | Every relay trusts every other relay's verdict equally | Trusts a designated oracle (or 2-of-3 quorum) | Trusts a single admin key |
|
||
| **Sybil resistance** | Weak — one rogue relay can poison the mesh | Medium-strong — oracle is gatekeeper | Strong — only admin can sign |
|
||
| **Convergence** | Eventual; multi-hop flood | Bounded (query interval); direct pull | Bounded (poll interval); out-of-band |
|
||
| **Partition tolerance** | Acceptable (backlog replay on heal) | Acceptable (fallback to local) | Good (cached list + expiry) |
|
||
| **False-positive blast radius** | Mesh-wide from one relay | Mesh-wide from oracle | Mesh-wide from admin |
|
||
| **Operational burden** | Low — fully automatic | Medium — must run oracle | Medium — must curate list |
|
||
| **Federation code changes** | Medium — broadcast loop, dedup, signatures | Medium — query endpoint, snapshot pagination | Low — out-of-band, no mesh changes |
|
||
| **Scaling** | Poor — flood doesn't scale past ~50 relays | Good — O(N) queries, oracle is O(1) | Good — O(N) fetches, no mesh load |
|
||
| **Audit trail** | Good — every event attributed to origin relay | Good — oracle logs all reports | Good — list is a snapshot |
|
||
| **Rollback / correction** | Hard — events spread everywhere; need counter-events | Easy — oracle updates snapshot | Easy — admin publishes new list |
|
||
|
||
## Open Questions (Blockers for Implementation)
|
||
|
||
1. **Trust model:** Do we trust all peer relays equally (Approach 1), trust a designated oracle (Approach 2), or trust a human admin (Approach 3)? This is a product/operations decision, not a technical one.
|
||
2. **Key infrastructure:** The federation layer currently has **no message-level signing**. All three approaches require adding Ed25519 signing/verification to relay-relay messages (or out-of-band fetches). The `wzp-crypto` crate already has Ed25519 identity support (used in client handshake) — it can be reused.
|
||
3. **Fingerprint scope:** Is the block scoped to a fingerprint (Ed25519 pubkey), an IP address, or both? Current `ResponsePolicy` uses a string fingerprint. Gossip should probably scope to the cryptographic identity, not IP, to prevent NAT rebind evasion.
|
||
4. **Privacy leakage:** Gossiping "fingerprint F is abusive" leaks that F was on a call. Is this acceptable? WZP is not anonymous by design (participants know each other's fingerprints), but operators learning about calls they are not in is a privacy concern.
|
||
5. **TTL vs. persistent bans:** Should a gossiped block expire automatically (TTL), or should it require manual review for extension? Automatic expiry limits false-positive damage but allows persistent abusers to cycle.
|
||
6. **Rate limiting on gossip:** A compromised relay could flood the mesh with `ReputationEvent` messages. The existing per-room rate limiter (500 pps) doesn't apply to control messages. A per-peer gossip rate limit is needed regardless of approach.
|
||
|
||
## Recommendation
|
||
|
||
**Do not implement any approach yet.** The blockers above (especially #1 trust model and #4 privacy) need a reviewer/operator design call. T6.3 should remain **Blocked** until then.
|
||
|
||
If forced to pick a default for a small, closed federation (the current WZP target audience), **Approach 3 (Ban-List Distribution)** has the lowest complexity, the strongest Sybil resistance, and the easiest rollback. It sacrifices automation but small federations typically have an ops person who can review blocks anyway. Approach 1 (Push Gossip) is more elegant for larger meshes but requires solving the rogue-relay problem first (e.g., k-of-n thresholding).
|