Files
nick-doc/09 - Audits/Task 5.9 QA Rollout Analytics and Launch Runbooks.md
2026-05-24 13:19:54 +04:00

157 lines
5.4 KiB
Markdown

---
title: Task 5.9 QA, Rollout, Analytics, and Launch Runbooks
tags: [taskmaster, telegram, qa, rollout, analytics, runbook]
created: 2026-05-24
status: draft
---
# Task 5.9 QA, Rollout, Analytics, and Launch Runbooks
Source: `/.taskmaster/docs/prd-telegram-native-app-bot-wallet.md`
## 1) QA scope for launch readiness
### 1.1 Client matrix (required)
- Telegram iOS
- Telegram Android
- Telegram Desktop
- Telegram Web
- Light / dark themes
- Compact / fullscreen modes
- Normal and slow network
- Blocked bot scenario
- Expired / stale session scenario
- Payment cancellation and abort
- Unlinked user and re-link path
### 1.2 Functional QA checklist
1. Identity and linking
2. Request listing/detail in both bot and Mini App
3. Offer review flow
4. Payment initiation and cancel path
5. Delivery evidence upload
6. Dispute open/respond and status progression
7. Notification quiet/error state
8. Error and blocked-bot behavior
9. Support escalation handoff
### 1.3 Security/abuse QA
- forged/invalid `initData` rejection
- callback replay replayed twice: one success one no-op
- deep-link tampering
- wallet proof mismatch
- callback processing under invalid provider secrets
- admin override behavior and audit event capture
## 2) Environments and rollout
### 2.1 Environment separation
- `telegram-dev-bot` and `telegram-prod-bot` tokens and webhook endpoints must be distinct.
- No shared webhook secret between environments.
- QA and production payment fixtures remain isolated.
### 2.2 Feature flag sequence
1. **Development flag off**: no surface exposed
2. **Internal allowlist**: selected users only (buyer/seller/admin)
3. **Beta cohort**: controlled percentage and fixed org list
4. **Production enablement**: after runbook and KPI thresholds pass
### 2.3 Deployment safety
- If new surface increases payment mismatch or callback failure, immediately pause `TELEGRAM_SURFACE_ENABLED` and keep providers in read-only mode.
- Use existing rollback flow from incident operations and deployment runbooks.
## 3) Analytics and launch KPIs
Track these metrics daily for 14 days after stage advancement:
- activation rate (`activatedTelegramUsers / startedTelegramUsers`)
- link completion rate (`linkedUsers / startedLink`)
- request creation from Telegram (`telegramRequestsCreated`)
- offer response completion (`offerResponses / offersOpened`)
- payment started / payment completed (`telegramPaymentStart`, `telegramPaymentComplete`, `telegramPaymentFail`)
- dispute activity (`disputesOpened`, `disputesResolvedInTelegram`)
- release approvals from Telegram context (`telegramReleaseApprovals`)
- notification opt-outs (`notificationsOptOutRate`)
- callback duplicate ratio (`callbackReplay / callbackTotal`)
- average context resume latency (min and p95)
### Reporting destinations
- Sentry for exception and failure spikes
- application logs for workflow events
- existing monitoring dashboards for rate/latency anomalies
## 4) Launch runbooks
All runbooks are mandatory for Stage-1 rollout and post-launch incidents.
### 4.1 Bot outage
1. Validate webhook endpoint response health.
2. Switch status to notification-only mode where possible.
3. Confirm bot token and webhook URL.
4. Re-route urgent flows to web fallback.
5. Restore Telegram webhook + replay backlog after recovery.
### 4.2 Telegram API outage
1. Confirm external Telegram API status.
2. Temporarily disable deep-link / in-app actions that require Telegram callbacks.
3. Notify users of delayed updates.
4. Keep pending payment states in read-only mode until callback channel is restored.
### 4.3 Payment provider outage
1. Identify affected provider via provider mode and provider health flags.
2. Switch to read-only or alternative provider mode where configured.
3. Run reconciliation before re-enabling full writes.
4. Track stale pending payment age and contact support workflow.
### 4.4 Stuck payment
1. Check payment reconciliation queue and provider status.
2. Verify callback proof and on-chain confirmation.
3. Manually reconcile if allowed by protocol and policy.
4. Escalate if stale > 24h in funded or processing state.
### 4.5 Duplicate callback
1. Validate idempotency path executed correctly.
2. Confirm callback dedupe key retention window.
3. Compare event fingerprint for payload divergence.
4. Mark one path as duplicate no-op and keep audit trail.
### 4.6 Suspicious wallet proof
1. Block automated release/refund for the request.
2. Flag payment and mark for manual ops review.
3. Verify recipient, amount, and tx hash against chain/provider data.
4. Resume only after explicit approval.
### 4.7 Compromised bot token
1. Rotate bot token immediately.
2. Disable bot endpoints and clear webhook secret for 1 hour.
3. Validate callback signatures with new secret.
4. Resume in staged rollout mode with monitoring for 24h.
## 5) Stage exit criteria
- All required QA scenarios pass on iOS/Android/desktop/web.
- No critical webhook/payload mismatch regressions in 24h observation window.
- No unresolved payment stuck items > 24h after manual triage.
- Incident owners can execute all seven runbooks.
- Rollout metrics show non-degrading trend for the first two days.
## 6) Known rollout gaps
1. Fine-grained feature toggles for Telegram in existing observability dashboards are pending.
2. Admin analytics for Telegram-originated releases are schema-dependent and need implementation wiring.
3. Deep-link recovery behavior after prolonged Telegram link expiry still needs UX polishing.