wz-phone/vault/PRDs/PRD-relay-federation-gossip.md

---
tags: [prd, wzp]
type: prd
---

# Design Exploration: Federated Reputation Gossip (T6.3)

> **Status:** Design exploration — no approach selected.
> **Blocked on:** Reviewer design call (needs operator-trust model decision).
> **Scope:** How should WZP relays share abuse verdicts across a federation mesh so that abusers cannot hop between relays?

## Background

WZP relays are E2E-encrypted SFUs. They cannot inspect payload content, but they **can** observe metadata: packet rates, payload sizes, timestamps, keyframe patterns, and BWE responsiveness. Tiers A–F of the conformance pipeline observe these signals and produce a `Verdict ∈ {Legitimate, Suspect, Abusive}`.

Tier G (`ResponsePolicy`) escalates:
- `Abusive` → typed `Hangup` + 1 h fingerprint cool-down
- Repeat `Abusive` within 24 h → relay-local `Block` for 24 h

**The gap:** Block state is in-memory only and per-relay. A blocked fingerprint on relay A can reconnect to relay B (same room, same mesh) and resume abuse immediately. T6.3 explores closing this gap.

**What is being gossiped?** A *reputation event*: "fingerprint `F` produced violation `V` with verdict `Abusive` at time `T` on relay `R`."

---

## Assumptions

1. Relays trust each other *connection-level* (TLS fingerprints in `PeerConfig` / `TrustedConfig`) but are **not** guaranteed to share the same abuse-detection thresholds or calibration.
2. The federation mesh is small (tens of relays, not thousands).
3. False positives happen — a legitimate user on a long lecture call can trigger `Suspect` or even `Abusive` on an aggressively-tuned relay.
4. A compromised relay is a realistic threat model (stolen credentials, operator coercion, buggy calibration).
5. Relays are operated by different entities — there is no single administrative root of trust.

---

## Approach 1: Push Gossip

### Summary
When a relay issues a `Block` action (repeat abusive), it immediately broadcasts a `ReputationEvent` to all peer relays via the existing federation QUIC channels. Peers incorporate the event into their local block lists.

### Wire format

```rust
// New SignalMessage variant
ReputationEvent {
    version: u8,
    /// Fingerprint being reported (the abused party, not the reporter).
    fingerprint: String,
    /// Which violation code triggered the block.
    violation: ViolationCode,
    /// When the block was issued (Unix epoch seconds, u64).
    issued_at: u64,
    /// TTL in seconds (default 86400 = 24 h).
    ttl_secs: u32,
    /// Relay that issued the block (TLS fingerprint hex).
    origin_relay_fp: String,
    /// Ed25519 signature over (fingerprint || violation || issued_at || ttl_secs || origin_relay_fp).
    /// The signing key is the relay's long-term identity key (reused from client handshake identity).
    signature: [u8; 64],
}
```

**What is signed?** The canonical serialization of the payload fields (excluding the signature itself). This prevents replay of old events and binds the event to a specific origin relay.

**Key distribution:** Each relay's Ed25519 public key is published in a well-known endpoint (e.g., `/.well-known/wzp-relay.pub`) or embedded in the `FederationHello` handshake. Verification happens on receipt.

### Sybil resistance

- **Signing requirement:** Only relays with a known Ed25519 pubkey can produce valid events. A rogue relay must first be in the `TrustedConfig` to even connect.
- **Origin attribution:** Every event is traceable to a specific relay. If relay R starts flooding false positives, peers can blacklist R's pubkey or reduce R's event weight to zero.
- **No aggregate thresholding:** This design trusts every signed event equally. A single malicious relay can block any fingerprint across the mesh.

**Mitigation option (not implemented):** Require *k-of-n* independent relay reports before applying a cross-relay block. This introduces complexity (tracking per-fingerprint report counts, decaying counters, handling churn).

### Convergence model

- **Eventual consistency:** Events propagate via multi-hop flood (same mechanism as `GlobalRoomActive`).
- **Bounded staleness:** Events carry TTL. Stale events (> TTL) are ignored.
- **No ordering guarantee:** Two relays may issue conflicting events (relay A blocks F, relay B clears F). Last-writer-wins based on `issued_at`.

### Storage

- **In-memory only:** `HashMap<(fingerprint, origin_relay), ReputationEntry>` with TTL-based eviction.
- **No persistence:** Restarting a relay loses all gossiped state. Re-blocking requires the origin relay to re-gossip or the abuser to re-offend.
- **Memory bound:** ~100 bytes per entry × 10k entries = ~1 MB. Trivial.

### Partition tolerance

- **Partitioned relay A** blocks fingerprint F locally but cannot push to peers. When partition heals, A floods the backlog. Peers apply events with `issued_at` within TTL; expired backlog is ignored.
- **Partitioned relay B** never hears about F's block. F can abuse B freely during the partition. This is acceptable for a best-effort gossip system.

### Failure modes

| Scenario | Impact | Mitigation |
|---|---|---|
| Compromised relay floods false blocks | All fingerprints blocked mesh-wide | Manual pubkey blacklist; no automatic mitigation in this design |
| Buggy relay false-positives a popular fingerprint | Legitimate users blocked everywhere | Operator contact + pubkey downgrade |
| Network partition | Split-brain block lists | Acceptable; partition healing replays backlog |
| Clock skew | Events from future/past rejected or mis-ordered | Use `issued_at` with ±5 min tolerance; NTP assumed |
| Replay attack | Old event re-broadcast after TTL | Signature binds `issued_at`; verify TTL at receipt |

### Complexity

- **Low-medium:** Reuses existing federation broadcast infrastructure. Adds one `SignalMessage` variant, Ed25519 signing/verification, and an in-memory TTL map.

---

## Approach 2: Pull Gossip (Reputation Oracle)

### Summary
One relay in the mesh is designated the **reputation oracle** (or a small quorum of 3). Other relays query the oracle periodically (e.g., every 60 s) for the current block list. The oracle maintains the authoritative reputation state.

### Wire format

```rust
// Pull request
ReputationQuery {
    version: u8,
    /// Last checkpoint the requester has seen (opaque cursor).
    since_cursor: Option<String>,
}

// Pull response
ReputationSnapshot {
    version: u8,
    /// Opaque cursor for delta pagination.
    cursor: String,
    /// List of active blocks at the oracle.
    blocks: Vec<ReputationBlock>,
    /// Oracle's Ed25519 signature over the serialized snapshot.
    signature: [u8; 64],
}

struct ReputationBlock {
    fingerprint: String,
    violation: ViolationCode,
    issued_at: u64,
    ttl_secs: u32,
    /// Which relay originally reported this (for audit).
    reported_by: String,
}
```

**What is signed?** The entire `ReputationSnapshot` serialized canonically. The oracle is the sole signer.

**Oracle selection:** Config-based. Each relay's config names its oracle(s):
```toml
[reputation]
oracle = "https://relay-oracle.example.com"
oracle_pubkey = "AA:BB:CC:..."
```

### Sybil resistance

- **Centralized trust:** The oracle is a single point of trust. If the oracle is honest, no rogue relay can poison the mesh.
- **Oracle compromise:** A compromised oracle can block or unblock any fingerprint across all querying relays. This is a **catastrophic** failure mode.
- **Quorum variant:** 3 oracles with 2-of-3 signing. Queries return the intersection of block lists. More resilient but adds latency and complexity.

### Convergence model

- **Bounded staleness:** Worst-case = query interval (60 s) + network RTT.
- **Strong consistency within staleness bound:** All querying relays see the same oracle state (modulo query timing skew).
- **No multi-hop gossip:** Direct query/response only.

### Storage

- **Oracle side:** In-memory + optional persistence (SQLite or flat file). Oracle restarts reload from disk.
- **Querying relays:** In-memory cache of the last snapshot. No local state between restarts.
- **Memory bound:** Same as Approach 1 (~1 MB for 10k entries).

### Partition tolerance

- **Partitioned querying relay:** Falls back to local-only blocking (current behavior). When partition heals, next query re-syncs.
- **Partitioned oracle:** All querying relays lose cross-relay reputation and fall back to local-only. Oracle restart without persistence = same.
- **No split-brain:** Either you have the oracle snapshot or you don't. No conflicting states.

### Failure modes

| Scenario | Impact | Mitigation |
|---|---|---|
| Oracle compromise | Global block/unblock of any fingerprint | 2-of-3 quorum; operator out-of-band alert |
| Oracle downtime | All querying relays lose cross-relay reputation | Fallback to local-only; no amplification |
| Rogue relay reports to oracle | Oracle may incorporate false positives | Oracle applies local thresholding (e.g., require 2 independent reports) |
| Query amplification | N relays × 60 s = many queries | Oracle caches; responses are cheap |
| Man-in-the-middle on query | Attacker injects fake snapshot | TLS + signature verification on response |

### Complexity

- **Medium:** Adds an HTTP/QUIC query endpoint, snapshot pagination, oracle election/config, and a new failure mode (oracle as SPOF).
- **Operational burden:** Someone must run the oracle. Small federations may not want this.

---

## Approach 3: No Gossip — Explicit Ban-List Distribution

### Summary
Relays do **not** gossip reputation automatically. Instead, a human admin (or an automated audit pipeline) curates an authoritative ban list, signs it, and pushes it to all relays. Relays apply the list verbatim.

### Wire format

```rust
// Signed ban list (distributed out-of-band: HTTPS, SCP, S3, etc.)
BanList {
    version: u8,
    /// Issued at (Unix epoch seconds).
    issued_at: u64,
    /// Expires at (Unix epoch seconds). After this, the list is ignored.
    expires_at: u64,
    /// Entries.
    entries: Vec<BanEntry>,
    /// Admin Ed25519 signature over canonical serialization.
    signature: [u8; 64],
}

struct BanEntry {
    fingerprint: String,
    /// Human-readable reason (not machine-parsed).
    reason: String,
    /// Optional: which relay originally reported.
    source_relay: Option<String>,
}
```

**What is signed?** The entire `BanList`. The admin (not a relay) is the signer.

**Distribution:** Out-of-band from the federation mesh. Could be:
- Admin `scp`s JSON to each relay's config directory
- Relays poll an HTTPS URL every 5 min
- Shared object storage (S3, GCS)

**Key distribution:** Admin pubkey is baked into each relay's config at provisioning time:
```toml
[ban_list]
admin_pubkey = "AA:BB:CC:..."
url = "https://ops.example.com/banlist.json"
refresh_secs = 300
```

### Sybil resistance

- **Strong:** Only the admin can produce a valid ban list. No relay can poison another relay.
- **Admin compromise:** Catastrophic — attacker controls the global ban list. But this is a standard operational threat model (same as DNS root, certificate transparency log, etc.).
- **No relay-to-relay trust required:** Relays don't need to trust each other's calibration or behaviour.

### Convergence model

- **Poll-based bounded staleness:** Worst-case = `refresh_secs` (default 300 s = 5 min).
- **Strong consistency:** All relays that successfully fetch the list see identical state.
- **No event propagation:** No flood, no multi-hop, no deduplication needed.

### Storage

- **On-disk cache:** Each relay stores the latest fetched ban list to survive restart.
- **In-memory lookup:** `HashSet<fingerprint>` for O(1) block checks.
- **Memory bound:** Same as other approaches.

### Partition tolerance

- **Partitioned relay:** Continues using its last cached ban list until `expires_at`. After expiry, falls back to local-only blocking.
- **No split-brain:** Either you have the signed list or you don't.

### Failure modes

| Scenario | Impact | Mitigation |
|---|---|---|
| Admin key compromise | Global block/unblock of any fingerprint | Key rotation + out-of-band alert |
| Distribution channel down | Relays use stale cached list until expiry | Short expiry + monitoring |
| Admin false-positives a popular fingerprint | Global block of legitimate users | Human review process; short expiry allows quick recovery |
| Relay never fetches list | Local-only blocking only | Monitoring alert on relay ops dashboard |
| List too large | Fetch latency, memory bloat | Pagination; but 10k entries = ~500 KB JSON, trivial |

### Complexity

- **Low:** No changes to federation mesh at all. Adds an HTTP polling loop, a file watcher, or an S3 sync. The signing/verification is trivial (one Ed25519 verify on fetch).
- **Operational burden:** Requires a human or automated pipeline to maintain the ban list. This is acceptable for small federations (the WZP target audience) but does not scale to open meshes.

---

## Comparative Summary

| Dimension | Approach 1: Push Gossip | Approach 2: Pull Oracle | Approach 3: Ban-List Distribution |
|---|---|---|---|
| **Trust model** | Every relay trusts every other relay's verdict equally | Trusts a designated oracle (or 2-of-3 quorum) | Trusts a single admin key |
| **Sybil resistance** | Weak — one rogue relay can poison the mesh | Medium-strong — oracle is gatekeeper | Strong — only admin can sign |
| **Convergence** | Eventual; multi-hop flood | Bounded (query interval); direct pull | Bounded (poll interval); out-of-band |
| **Partition tolerance** | Acceptable (backlog replay on heal) | Acceptable (fallback to local) | Good (cached list + expiry) |
| **False-positive blast radius** | Mesh-wide from one relay | Mesh-wide from oracle | Mesh-wide from admin |
| **Operational burden** | Low — fully automatic | Medium — must run oracle | Medium — must curate list |
| **Federation code changes** | Medium — broadcast loop, dedup, signatures | Medium — query endpoint, snapshot pagination | Low — out-of-band, no mesh changes |
| **Scaling** | Poor — flood doesn't scale past ~50 relays | Good — O(N) queries, oracle is O(1) | Good — O(N) fetches, no mesh load |
| **Audit trail** | Good — every event attributed to origin relay | Good — oracle logs all reports | Good — list is a snapshot |
| **Rollback / correction** | Hard — events spread everywhere; need counter-events | Easy — oracle updates snapshot | Easy — admin publishes new list |

## Open Questions (Blockers for Implementation)

1. **Trust model:** Do we trust all peer relays equally (Approach 1), trust a designated oracle (Approach 2), or trust a human admin (Approach 3)? This is a product/operations decision, not a technical one.
2. **Key infrastructure:** The federation layer currently has **no message-level signing**. All three approaches require adding Ed25519 signing/verification to relay-relay messages (or out-of-band fetches). The `wzp-crypto` crate already has Ed25519 identity support (used in client handshake) — it can be reused.
3. **Fingerprint scope:** Is the block scoped to a fingerprint (Ed25519 pubkey), an IP address, or both? Current `ResponsePolicy` uses a string fingerprint. Gossip should probably scope to the cryptographic identity, not IP, to prevent NAT rebind evasion.
4. **Privacy leakage:** Gossiping "fingerprint F is abusive" leaks that F was on a call. Is this acceptable? WZP is not anonymous by design (participants know each other's fingerprints), but operators learning about calls they are not in is a privacy concern.
5. **TTL vs. persistent bans:** Should a gossiped block expire automatically (TTL), or should it require manual review for extension? Automatic expiry limits false-positive damage but allows persistent abusers to cycle.
6. **Rate limiting on gossip:** A compromised relay could flood the mesh with `ReputationEvent` messages. The existing per-room rate limiter (500 pps) doesn't apply to control messages. A per-peer gossip rate limit is needed regardless of approach.

## Recommendation

**Do not implement any approach yet.** The blockers above (especially #1 trust model and #4 privacy) need a reviewer/operator design call. T6.3 should remain **Blocked** until then.

If forced to pick a default for a small, closed federation (the current WZP target audience), **Approach 3 (Ban-List Distribution)** has the lowest complexity, the strongest Sybil resistance, and the easiest rollback. It sacrifices automation but small federations typically have an ops person who can review blocks anyway. Approach 1 (Push Gossip) is more elegant for larger meshes but requires solving the rogue-relay problem first (e.g., k-of-n thresholding).