Private Beta Support Runbook¶
Use this runbook for managed BYOK private beta support. It is intentionally designed to work without asking users to expose secrets.
Support Intake¶
Collect:
- org id or org slug,
- user email or channel identity,
- surface: Cloud Web, Desktop cloud workspace, local Desktop workspace, Gateway, BYOK, billing/quota, or admin dashboard,
- session id, run id, workflow id, artifact id, or channel delivery id when available,
- timestamp and timezone,
- expected behavior and observed behavior,
- whether the issue blocks all work or one surface,
- redacted diagnostics bundle or launch-readiness report if available.
Do not request:
- provider API keys,
- OAuth refresh/access tokens,
- Desktop cloud tokens,
- Gateway service tokens,
- channel signing secrets,
- cookie secrets,
- database URLs,
- object-store signed URLs,
- full local filesystem paths unless the user intentionally shares a redacted path for a local-only issue.
Triage Matrix¶
| Symptom | First checks | Escalation |
|---|---|---|
| Cannot sign in | IdP status, org membership, signup mode, cookie settings, clock skew, audit event. | Cloud auth owner. |
| BYOK validation fails | Provider status, key fingerprint, worker role logs, quota/provider error code, redaction. | Worker/BYOK owner. |
| Desktop does not sync | Token status, workspace URL, SSE reconnects, local vs cloud workspace id, cache status. | Desktop sync owner. |
| Gateway does not reply | Gateway health/readiness, provider webhook signature, channel binding, delivery backlog, dead letters. | Gateway owner. |
| Approvals/questions stuck | Session projection pending state, command queue age, gateway interaction token, actor RBAC. | Projection/runtime owner. |
| Artifacts missing | Projection metadata, object-store read/write, signed URL redaction, restore status. | Object-store owner. |
| Quota/billing blocked | Subscription state, entitlement overlay, quota counters, expected 402/429 response. | SaaS operations owner. |
| Admin panel inconsistent | API bootstrap, role membership, audit rows, cache-control, browser console redaction. | Cloud Web owner. |
Diagnostics Workflow¶
- Ask the user for a redacted diagnostics bundle or reproduce from operator logs using org/session ids.
- Verify the bundle contains no raw BYOK, API token, OAuth token, channel secret, cookie, database URL, or signed object-store URL.
- Correlate request id, org id, session id, run id, worker id, scheduler id, and gateway delivery id.
- Check Cloud metrics first: command age, projection lag, worker heartbeats, quota rejections, BYOK failures, and HTTP error rate.
- Check Gateway metrics next: stream reconnects, aggregate and provider-labeled delivery retries/dead letters, provider status, and session-stream count.
- Check object-store and Postgres health for artifact, checkpoint, or replay issues.
- Record outcome, mitigation, owner, and follow-up test.
Support Bundle Contract¶
Support bundles should follow the public contract in Managed BYOK SaaS Boundary. They may include app/build metadata, surface, workspace kind, redacted org/session/workflow/run/ artifact/request/delivery ids, status and reason code, sanitized error summaries, timing/retry/reconnect/queue/projection/quota summaries, BYOK status metadata, billing entitlement verdicts, and Gateway delivery counts.
They must not include raw BYOK keys, OAuth tokens, Desktop or Gateway tokens, API tokens, cookie/internal/operator tokens, MCP secrets, channel bot tokens, signing secrets, database URLs, signed object-store URLs, full chat transcript bodies unless explicitly attached by the user, local file contents, unredacted home-directory paths, or customer identifiers in public evidence.
Onboarding Status And Reason Codes¶
Support, Cloud Web, Desktop, Gateway, and private operations should use the same machine-readable status and reason-code taxonomy. The canonical list is in deploy/private-beta/managed-byok-readiness-contract.template.json; common examples are:
auth.invite_requiredauth.membership_missingauth.token_expiredauth.token_revokedbyok.key_missingbyok.validation_failedbilling.subscription_inactivequota.worker_limit_exceededgateway.identity_not_allowedsupport.diagnostics_redaction_required
Do not replace these with provider-specific or customer-specific text in public artifacts. User-facing messages may be friendlier, but retain the code in logs, diagnostics, audit exports, and support records.
BYOK Issue Handling¶
For user/provider key issues:
- Confirm the provider id is enabled in the org profile and plan.
- Confirm the BYOK status endpoint shows configured or active metadata only.
- Run the bounded provider validation path from the worker role.
- Check provider-side errors without logging plaintext key material.
- If compromise is suspected, stop new worker claims for the org, ask the user to revoke the provider key with the provider, rotate cloud envelope/KMS material if needed, and follow
docs/runbooks/managed-byok-saas.md.
Incident Checklists¶
Suspected Key Exposure¶
- Stop new worker claims for the affected org or provider profile.
- Revoke or disable the affected BYOK secret metadata record.
- Ask the customer to revoke the provider key with the upstream provider.
- Rotate envelope/KMS material if ciphertext handling is in doubt.
- Review logs, diagnostics, audit exports, Desktop cache artifacts, Gateway logs, and launch reports for plaintext exposure.
- Record the incident timeline, affected orgs, mitigation, and follow-up tests in the private incident tracker.
Token Compromise¶
- Revoke the suspected Desktop, Gateway, API, or operator token.
- Verify the token cannot authenticate on the next request or SSE reconnect.
- Rotate related service credentials when scope is uncertain.
- Review audit events for unexpected org, BYOK, gateway, billing, or admin actions.
- Issue replacement scoped tokens only after the actor identity is verified.
Channel Identity Misbinding¶
- Pause the affected channel binding or headless agent.
- Verify inbound actor identity separately from gateway service-token auth.
- Check
cloud_channel_identities, session bindings, interaction tokens, and delivery cursors for the affected provider/thread. - Rebind the channel only after owner/admin confirmation.
- Audit whether approvals/questions were resolved by the wrong actor and record remediation.
Gateway Issue Handling¶
For channel issues:
- Confirm the gateway service token is valid and scoped.
- Confirm inbound actor identity resolves separately from gateway service auth.
- Confirm webhook signatures or shared-secret/HMAC headers are present.
- Confirm the channel binding maps to the expected org, headless agent, and cloud profile.
- Check
/diagnostics.deliveryOperatorfor enabled controls and scopedchannelBindingIds. - Check the durable delivery cursor and dead-letter queue.
- Retry dead-lettered deliveries only after reviewing the event body and provider error, and use the gateway token that last claimed the delivery unless a channel admin is performing broader Cloud-side recovery.
Desktop Sync Issue Handling¶
For Desktop cloud workspace issues:
- Confirm the user is in the cloud workspace, not the local workspace.
- Confirm the configured cloud URL is HTTPS and matches the managed org.
- Confirm token refresh/revocation status.
- Confirm session list and full projection hydrate before cursor resume.
- Confirm local workspace remains usable if cloud is offline.
- If cache corruption is suspected, ask the user to export diagnostics first, then clear the cloud cache for that workspace only.
Customer Offboarding¶
Use offboarding when a private-beta partner leaves, a trial ends, or an incident requires org shutdown:
- Disable new execution by pausing entitlement/profile access.
- Revoke Desktop, Gateway, API, and operator-issued tokens.
- Disable BYOK secret metadata and confirm no worker can reveal plaintext.
- Remove or pause channel bindings and delivery streams.
- Export customer-owned data according to the agreed retention policy.
- Remove or quarantine object-store prefixes after export and retention review.
- Preserve audit logs for the agreed retention period.
- Confirm support can no longer access the org except through approved audit retention workflows.
Escalation Evidence¶
Every escalation should include:
- issue title and severity,
- org id,
- affected surface,
- affected session/channel/workflow/artifact ids,
- request ids,
- first bad timestamp,
- last known good timestamp,
- redacted logs or diagnostics path,
- metrics screenshot or report,
- whether there is a rollback or mitigation.
Never attach raw secrets to issues, support tickets, chat, or release evidence.