Files
wz-phone/docs/PRD/PRD-relay-federation-gossip.md
Siavash Sameni 1e729e4b1d T6.3: Design exploration for federated reputation gossip
Add docs/PRD/PRD-relay-federation-gossip.md comparing 3 approaches:
1. Push gossip — relay broadcasts RepeatAbusive verdicts to peers
2. Pull oracle — peers query a reputation oracle periodically
3. Ban-list distribution — admin signs and pushes authoritative list

For each: wire format, Sybil resistance, convergence, storage,
partition tolerance, failure modes. Open questions block implementation
(trust model, privacy leakage, key infrastructure). Move T6.3 to Blocked
pending reviewer design call.
2026-05-12 19:13:31 +04:00

16 KiB
Raw Blame History

Design Exploration: Federated Reputation Gossip (T6.3)

Status: Design exploration — no approach selected. Blocked on: Reviewer design call (needs operator-trust model decision). Scope: How should WZP relays share abuse verdicts across a federation mesh so that abusers cannot hop between relays?

Background

WZP relays are E2E-encrypted SFUs. They cannot inspect payload content, but they can observe metadata: packet rates, payload sizes, timestamps, keyframe patterns, and BWE responsiveness. Tiers AF of the conformance pipeline observe these signals and produce a Verdict ∈ {Legitimate, Suspect, Abusive}.

Tier G (ResponsePolicy) escalates:

  • Abusive → typed Hangup + 1 h fingerprint cool-down
  • Repeat Abusive within 24 h → relay-local Block for 24 h

The gap: Block state is in-memory only and per-relay. A blocked fingerprint on relay A can reconnect to relay B (same room, same mesh) and resume abuse immediately. T6.3 explores closing this gap.

What is being gossiped? A reputation event: "fingerprint F produced violation V with verdict Abusive at time T on relay R."


Assumptions

  1. Relays trust each other connection-level (TLS fingerprints in PeerConfig / TrustedConfig) but are not guaranteed to share the same abuse-detection thresholds or calibration.
  2. The federation mesh is small (tens of relays, not thousands).
  3. False positives happen — a legitimate user on a long lecture call can trigger Suspect or even Abusive on an aggressively-tuned relay.
  4. A compromised relay is a realistic threat model (stolen credentials, operator coercion, buggy calibration).
  5. Relays are operated by different entities — there is no single administrative root of trust.

Approach 1: Push Gossip

Summary

When a relay issues a Block action (repeat abusive), it immediately broadcasts a ReputationEvent to all peer relays via the existing federation QUIC channels. Peers incorporate the event into their local block lists.

Wire format

// New SignalMessage variant
ReputationEvent {
    version: u8,
    /// Fingerprint being reported (the abused party, not the reporter).
    fingerprint: String,
    /// Which violation code triggered the block.
    violation: ViolationCode,
    /// When the block was issued (Unix epoch seconds, u64).
    issued_at: u64,
    /// TTL in seconds (default 86400 = 24 h).
    ttl_secs: u32,
    /// Relay that issued the block (TLS fingerprint hex).
    origin_relay_fp: String,
    /// Ed25519 signature over (fingerprint || violation || issued_at || ttl_secs || origin_relay_fp).
    /// The signing key is the relay's long-term identity key (reused from client handshake identity).
    signature: [u8; 64],
}

What is signed? The canonical serialization of the payload fields (excluding the signature itself). This prevents replay of old events and binds the event to a specific origin relay.

Key distribution: Each relay's Ed25519 public key is published in a well-known endpoint (e.g., /.well-known/wzp-relay.pub) or embedded in the FederationHello handshake. Verification happens on receipt.

Sybil resistance

  • Signing requirement: Only relays with a known Ed25519 pubkey can produce valid events. A rogue relay must first be in the TrustedConfig to even connect.
  • Origin attribution: Every event is traceable to a specific relay. If relay R starts flooding false positives, peers can blacklist R's pubkey or reduce R's event weight to zero.
  • No aggregate thresholding: This design trusts every signed event equally. A single malicious relay can block any fingerprint across the mesh.

Mitigation option (not implemented): Require k-of-n independent relay reports before applying a cross-relay block. This introduces complexity (tracking per-fingerprint report counts, decaying counters, handling churn).

Convergence model

  • Eventual consistency: Events propagate via multi-hop flood (same mechanism as GlobalRoomActive).
  • Bounded staleness: Events carry TTL. Stale events (> TTL) are ignored.
  • No ordering guarantee: Two relays may issue conflicting events (relay A blocks F, relay B clears F). Last-writer-wins based on issued_at.

Storage

  • In-memory only: HashMap<(fingerprint, origin_relay), ReputationEntry> with TTL-based eviction.
  • No persistence: Restarting a relay loses all gossiped state. Re-blocking requires the origin relay to re-gossip or the abuser to re-offend.
  • Memory bound: ~100 bytes per entry × 10k entries = ~1 MB. Trivial.

Partition tolerance

  • Partitioned relay A blocks fingerprint F locally but cannot push to peers. When partition heals, A floods the backlog. Peers apply events with issued_at within TTL; expired backlog is ignored.
  • Partitioned relay B never hears about F's block. F can abuse B freely during the partition. This is acceptable for a best-effort gossip system.

Failure modes

Scenario Impact Mitigation
Compromised relay floods false blocks All fingerprints blocked mesh-wide Manual pubkey blacklist; no automatic mitigation in this design
Buggy relay false-positives a popular fingerprint Legitimate users blocked everywhere Operator contact + pubkey downgrade
Network partition Split-brain block lists Acceptable; partition healing replays backlog
Clock skew Events from future/past rejected or mis-ordered Use issued_at with ±5 min tolerance; NTP assumed
Replay attack Old event re-broadcast after TTL Signature binds issued_at; verify TTL at receipt

Complexity

  • Low-medium: Reuses existing federation broadcast infrastructure. Adds one SignalMessage variant, Ed25519 signing/verification, and an in-memory TTL map.

Approach 2: Pull Gossip (Reputation Oracle)

Summary

One relay in the mesh is designated the reputation oracle (or a small quorum of 3). Other relays query the oracle periodically (e.g., every 60 s) for the current block list. The oracle maintains the authoritative reputation state.

Wire format

// Pull request
ReputationQuery {
    version: u8,
    /// Last checkpoint the requester has seen (opaque cursor).
    since_cursor: Option<String>,
}

// Pull response
ReputationSnapshot {
    version: u8,
    /// Opaque cursor for delta pagination.
    cursor: String,
    /// List of active blocks at the oracle.
    blocks: Vec<ReputationBlock>,
    /// Oracle's Ed25519 signature over the serialized snapshot.
    signature: [u8; 64],
}

struct ReputationBlock {
    fingerprint: String,
    violation: ViolationCode,
    issued_at: u64,
    ttl_secs: u32,
    /// Which relay originally reported this (for audit).
    reported_by: String,
}

What is signed? The entire ReputationSnapshot serialized canonically. The oracle is the sole signer.

Oracle selection: Config-based. Each relay's config names its oracle(s):

[reputation]
oracle = "https://relay-oracle.example.com"
oracle_pubkey = "AA:BB:CC:..."

Sybil resistance

  • Centralized trust: The oracle is a single point of trust. If the oracle is honest, no rogue relay can poison the mesh.
  • Oracle compromise: A compromised oracle can block or unblock any fingerprint across all querying relays. This is a catastrophic failure mode.
  • Quorum variant: 3 oracles with 2-of-3 signing. Queries return the intersection of block lists. More resilient but adds latency and complexity.

Convergence model

  • Bounded staleness: Worst-case = query interval (60 s) + network RTT.
  • Strong consistency within staleness bound: All querying relays see the same oracle state (modulo query timing skew).
  • No multi-hop gossip: Direct query/response only.

Storage

  • Oracle side: In-memory + optional persistence (SQLite or flat file). Oracle restarts reload from disk.
  • Querying relays: In-memory cache of the last snapshot. No local state between restarts.
  • Memory bound: Same as Approach 1 (~1 MB for 10k entries).

Partition tolerance

  • Partitioned querying relay: Falls back to local-only blocking (current behavior). When partition heals, next query re-syncs.
  • Partitioned oracle: All querying relays lose cross-relay reputation and fall back to local-only. Oracle restart without persistence = same.
  • No split-brain: Either you have the oracle snapshot or you don't. No conflicting states.

Failure modes

Scenario Impact Mitigation
Oracle compromise Global block/unblock of any fingerprint 2-of-3 quorum; operator out-of-band alert
Oracle downtime All querying relays lose cross-relay reputation Fallback to local-only; no amplification
Rogue relay reports to oracle Oracle may incorporate false positives Oracle applies local thresholding (e.g., require 2 independent reports)
Query amplification N relays × 60 s = many queries Oracle caches; responses are cheap
Man-in-the-middle on query Attacker injects fake snapshot TLS + signature verification on response

Complexity

  • Medium: Adds an HTTP/QUIC query endpoint, snapshot pagination, oracle election/config, and a new failure mode (oracle as SPOF).
  • Operational burden: Someone must run the oracle. Small federations may not want this.

Approach 3: No Gossip — Explicit Ban-List Distribution

Summary

Relays do not gossip reputation automatically. Instead, a human admin (or an automated audit pipeline) curates an authoritative ban list, signs it, and pushes it to all relays. Relays apply the list verbatim.

Wire format

// Signed ban list (distributed out-of-band: HTTPS, SCP, S3, etc.)
BanList {
    version: u8,
    /// Issued at (Unix epoch seconds).
    issued_at: u64,
    /// Expires at (Unix epoch seconds). After this, the list is ignored.
    expires_at: u64,
    /// Entries.
    entries: Vec<BanEntry>,
    /// Admin Ed25519 signature over canonical serialization.
    signature: [u8; 64],
}

struct BanEntry {
    fingerprint: String,
    /// Human-readable reason (not machine-parsed).
    reason: String,
    /// Optional: which relay originally reported.
    source_relay: Option<String>,
}

What is signed? The entire BanList. The admin (not a relay) is the signer.

Distribution: Out-of-band from the federation mesh. Could be:

  • Admin scps JSON to each relay's config directory
  • Relays poll an HTTPS URL every 5 min
  • Shared object storage (S3, GCS)

Key distribution: Admin pubkey is baked into each relay's config at provisioning time:

[ban_list]
admin_pubkey = "AA:BB:CC:..."
url = "https://ops.example.com/banlist.json"
refresh_secs = 300

Sybil resistance

  • Strong: Only the admin can produce a valid ban list. No relay can poison another relay.
  • Admin compromise: Catastrophic — attacker controls the global ban list. But this is a standard operational threat model (same as DNS root, certificate transparency log, etc.).
  • No relay-to-relay trust required: Relays don't need to trust each other's calibration or behaviour.

Convergence model

  • Poll-based bounded staleness: Worst-case = refresh_secs (default 300 s = 5 min).
  • Strong consistency: All relays that successfully fetch the list see identical state.
  • No event propagation: No flood, no multi-hop, no deduplication needed.

Storage

  • On-disk cache: Each relay stores the latest fetched ban list to survive restart.
  • In-memory lookup: HashSet<fingerprint> for O(1) block checks.
  • Memory bound: Same as other approaches.

Partition tolerance

  • Partitioned relay: Continues using its last cached ban list until expires_at. After expiry, falls back to local-only blocking.
  • No split-brain: Either you have the signed list or you don't.

Failure modes

Scenario Impact Mitigation
Admin key compromise Global block/unblock of any fingerprint Key rotation + out-of-band alert
Distribution channel down Relays use stale cached list until expiry Short expiry + monitoring
Admin false-positives a popular fingerprint Global block of legitimate users Human review process; short expiry allows quick recovery
Relay never fetches list Local-only blocking only Monitoring alert on relay ops dashboard
List too large Fetch latency, memory bloat Pagination; but 10k entries = ~500 KB JSON, trivial

Complexity

  • Low: No changes to federation mesh at all. Adds an HTTP polling loop, a file watcher, or an S3 sync. The signing/verification is trivial (one Ed25519 verify on fetch).
  • Operational burden: Requires a human or automated pipeline to maintain the ban list. This is acceptable for small federations (the WZP target audience) but does not scale to open meshes.

Comparative Summary

Dimension Approach 1: Push Gossip Approach 2: Pull Oracle Approach 3: Ban-List Distribution
Trust model Every relay trusts every other relay's verdict equally Trusts a designated oracle (or 2-of-3 quorum) Trusts a single admin key
Sybil resistance Weak — one rogue relay can poison the mesh Medium-strong — oracle is gatekeeper Strong — only admin can sign
Convergence Eventual; multi-hop flood Bounded (query interval); direct pull Bounded (poll interval); out-of-band
Partition tolerance Acceptable (backlog replay on heal) Acceptable (fallback to local) Good (cached list + expiry)
False-positive blast radius Mesh-wide from one relay Mesh-wide from oracle Mesh-wide from admin
Operational burden Low — fully automatic Medium — must run oracle Medium — must curate list
Federation code changes Medium — broadcast loop, dedup, signatures Medium — query endpoint, snapshot pagination Low — out-of-band, no mesh changes
Scaling Poor — flood doesn't scale past ~50 relays Good — O(N) queries, oracle is O(1) Good — O(N) fetches, no mesh load
Audit trail Good — every event attributed to origin relay Good — oracle logs all reports Good — list is a snapshot
Rollback / correction Hard — events spread everywhere; need counter-events Easy — oracle updates snapshot Easy — admin publishes new list

Open Questions (Blockers for Implementation)

  1. Trust model: Do we trust all peer relays equally (Approach 1), trust a designated oracle (Approach 2), or trust a human admin (Approach 3)? This is a product/operations decision, not a technical one.
  2. Key infrastructure: The federation layer currently has no message-level signing. All three approaches require adding Ed25519 signing/verification to relay-relay messages (or out-of-band fetches). The wzp-crypto crate already has Ed25519 identity support (used in client handshake) — it can be reused.
  3. Fingerprint scope: Is the block scoped to a fingerprint (Ed25519 pubkey), an IP address, or both? Current ResponsePolicy uses a string fingerprint. Gossip should probably scope to the cryptographic identity, not IP, to prevent NAT rebind evasion.
  4. Privacy leakage: Gossiping "fingerprint F is abusive" leaks that F was on a call. Is this acceptable? WZP is not anonymous by design (participants know each other's fingerprints), but operators learning about calls they are not in is a privacy concern.
  5. TTL vs. persistent bans: Should a gossiped block expire automatically (TTL), or should it require manual review for extension? Automatic expiry limits false-positive damage but allows persistent abusers to cycle.
  6. Rate limiting on gossip: A compromised relay could flood the mesh with ReputationEvent messages. The existing per-room rate limiter (500 pps) doesn't apply to control messages. A per-peer gossip rate limit is needed regardless of approach.

Recommendation

Do not implement any approach yet. The blockers above (especially #1 trust model and #4 privacy) need a reviewer/operator design call. T6.3 should remain Blocked until then.

If forced to pick a default for a small, closed federation (the current WZP target audience), Approach 3 (Ban-List Distribution) has the lowest complexity, the strongest Sybil resistance, and the easiest rollback. It sacrifices automation but small federations typically have an ops person who can review blocks anyway. Approach 1 (Push Gossip) is more elegant for larger meshes but requires solving the rogue-relay problem first (e.g., k-of-n thresholding).