Cloud Managed Operations Runbook¶
This runbook is for operators running hosted or managed Open Cowork Cloud plus the headless gateway. It assumes split cloud roles, managed Postgres, object storage, provider secret management, and a separate gateway deployment.
Readiness Checks¶
Before routing traffic to a new deployment:
- Confirm the cloud web role returns
200fromGET /healthz. - Confirm authenticated operators can read
GET /api/runtime/status. - Confirm
GET /api/workers/heartbeatsshows at least one fresh worker and one fresh scheduler heartbeat when those roles are enabled. - Confirm the gateway returns
200fromGET /healthandGET /ready. - Confirm gateway
/metricsincludes provider count, aggregate delivery counters, provider-labeled counters, and error counters when metrics are enabled. - Confirm object-store writes work by creating a small artifact or running a checkpoint-enabled smoke session.
- Confirm BYOK status reads return metadata only and no plaintext keys.
Rollback¶
Rollback is image-based. Schema migrations must remain additive and idempotent, so rollback does not require destructive database migration.
- Pause new rollout traffic at the load balancer or ingress.
- Scale new workers to zero first so they stop claiming new sessions.
- Keep at least one scheduler active unless scheduled workflow execution is intentionally paused.
- Roll back cloud
web,worker, andschedulerimages to the previous known-good tag. - Roll back gateway images independently if channel delivery or webhook handling regressed.
- Verify
GET /healthz,GET /api/workers/heartbeats, gateway/ready, and one cloud session prompt. - Resume traffic and monitor error rate, command latency, projection lag, and gateway delivery retries.
If a release introduced a bad additive column or index, keep the column in place and ship a forward fix. Do not drop columns during incident rollback.
For worker-only regressions, prefer rolling workers back first while keeping web reads available. Keep the scheduler active only if due workflow claims are healthy and worker capacity exists.
Worker Drains¶
Workers own OpenCode execution while a lease is active. To drain safely:
- Mark the worker or worker pool
drainingthrough the admin API. - Stop autoscaler scale-up for the pool so no replacement workers start claiming work during the drain.
- Allow active leases to finish or checkpoint, then confirm worker
currentLoad=0andactiveWorkIds=[]. - Confirm no stale owner writes are accepted by checking projection version, lease-token error logs, and
open_cowork_cloud_worker_stale_owner_rejections_total. - Confirm checkpoint writes are enabled before moving active sessions across nodes.
No database transaction should remain open while OpenCode is running.
The worker process also waits for an active command loop to finish during shutdown until OPEN_COWORK_CLOUD_SHUTDOWN_GRACE_MS elapses. Use that as a safety net only; drain before terminating pods or hosts.
Worker Registration¶
Use this when bootstrapping a new worker pool.
- Create or update the worker pool with mode
self_hostedorsaas_operated. - Set
maxWorkers,maxConcurrentWork, region, and capability metadata. - Register a worker in
pendingstate. - Issue a scoped expiring worker credential and store the one-time plaintext only in the platform secret manager.
- Start the worker with a stable
OPEN_COWORK_CLOUD_WORKER_ID, shared Postgres control-plane URL, shared object store, checkpoints enabled, JSON logs, and metrics. - Verify heartbeat metadata is redacted and includes version, capabilities, current load, and region/deployment label.
- Activate the worker and run a bounded smoke prompt.
Do not start customer-hosted workers against a separate managed control plane in v1.
Worker Credential Rotation¶
- Issue a replacement credential with the same minimal scopes.
- Store the new credential in the secret manager.
- Restart or roll the affected worker after drain.
- Verify the worker heartbeats with the new credential.
- Revoke the old credential and confirm old-token heartbeat rejection.
- Check audit rows for issued, rotated, last-used, and revoked events.
Never paste worker credentials into issue comments, chat, logs, diagnostics, or release reports.
Pause, Drain, Resume, And Retire¶
- Pause: use when a pool should stop claiming and renewing work temporarily. Existing work should be recovered by another active worker or allowed to expire according to policy.
- Drain: use before rollouts and planned host termination. Draining workers renew current leases but should not claim new work.
- Resume: use after rollout or dependency recovery. Confirm queue age and claim latency before resuming every pool.
- Retire: use after a worker has drained and will not return. Retired workers are terminal and should not receive new credentials.
Rolling Worker Update¶
- Confirm release evidence: image digest/checksum/signature, compatibility matrix, SBOM/notices, and config schema validation.
- Drain one pool or deployment group.
- Roll workers with
maxUnavailable=0,maxSurge=1, and termination grace greater than or equal toOPEN_COWORK_CLOUD_SHUTDOWN_GRACE_MS. - Watch worker heartbeat age, queue age, claim latency, command latency, checkpoint failures, BYOK reveal failures, stale-owner rejections, and dead letters.
- Run one session prompt, one workflow run, and one checkpoint/artifact smoke.
- Resume the pool, then continue to the next pool.
Emergency Revoke¶
Use this for suspected worker credential, image, host, runtime, BYOK, or object-store compromise.
- Revoke the worker credential immediately.
- Mark the worker
revoked. - Stop the host/pod/deployment.
- Preserve redacted heartbeat, audit, metric, and diagnostic evidence.
- Allow leases to expire or be reaped; do not hand-edit durable command or workflow records.
- Start a known-good replacement worker and verify stale-owner writes from the revoked worker are rejected.
- Rotate any potentially exposed object-store, channel, provider, or BYOK access path according to the suspected blast radius.
Stuck Queue¶
Use this when command queue depth or oldest queued age exceeds SLO.
- Check quota denials and billing/entitlement denials first; blocked work may be intentional.
- Check active worker count, heartbeat age, current load, and worker pool status.
- Check claim latency and lease denials.
- Check BYOK reveal failures, object-store failures, provider quota, and runtime errors before scaling.
- If a command is retrying repeatedly, use dead-letter/abort controls rather than direct database edits.
- Scale workers only when Postgres connections, object-store throughput, and provider/model quota have headroom.
Stale Lease Spike¶
Use this when stale-owner rejections or expired lease reaping spikes.
- Identify whether the spike followed a rollout, node eviction, object-store outage, BYOK reveal outage, or provider outage.
- Confirm workers are using the expected version and checkpoint schema.
- Check
OPEN_COWORK_CLOUD_SHUTDOWN_GRACE_MSand platform termination grace. - Pause autoscaling until the root cause is understood.
- Verify replacement workers restore from checkpoints and do not duplicate output.
Worker Crash Loop¶
- Stop automatic scale-up for the pool.
- Check last heartbeat error code and redacted summary.
- Check startup config: control-plane URL, secret refs, object store, profile, BYOK provider policy, and runtime cache paths.
- Run the worker image locally or in staging with the same non-secret config shape.
- Roll back if the crash follows a release. Revoke the credential if the host or image may be compromised.
Gateway Backlog¶
Gateway delivery lag is operationally separate from cloud execution lag.
- Check gateway
/readyfor provider startup state. - Check
/metricsforopen_cowork_gateway_deliveries_received_total,open_cowork_gateway_errors_total, and provider-labeled retry/dead-letter counters byprovider_idandprovider_kind. - Inspect gateway
/diagnostics.deliveryOperatorand confirm listing, retry, dead-letter, andchannelBindingIdsmatch the affected provider shard. - Inspect pending
cloud_channel_deliveriesrows by status,next_attempt_at,channel_binding_id, andlast_claimed_by. - For channel-provider outages, keep cloud sessions running and let deliveries retry with backoff.
- For bad provider credentials, rotate the channel secret and restart only the affected gateway deployment.
If gateway lag is caused by worker backlog, do not scale Gateway first. Fix worker queue age, claim latency, BYOK reveal failures, provider quota, or object-store checkpoint errors, then let the Gateway delivery feed drain from durable cursors.
Web Unavailable Or Erroring¶
Use this when GET /healthz fails, Cloud Web returns elevated 5xx responses, or users cannot load the Cloud Web Workbench.
- Check ingress/load-balancer health and TLS certificate status.
- Check
open_cowork_cloud_http_requests_totalby status and role. - Check structured logs by
request_idfor the failing route. - Verify Postgres connectivity from the web role.
- Verify cookie/OIDC configuration if only authenticated routes fail.
- Scale web replicas up only after the dependency error is understood.
- If a new image caused the failure, roll back web first; keep workers running only if command processing remains healthy.
Worker Backlog¶
Use this when prompt latency rises, commands remain pending, or projection lag grows. The queue-depth signal is a bounded estimate from worker scans; use it with oldest queued age and claim latency rather than as an exact backlog count.
- Check
open_cowork_cloud_command_queue_depth_estimate,open_cowork_cloud_runnable_session_claim_duration_ms, andopen_cowork_cloud_worker_loop_duration_ms. - Check worker heartbeats and active sessions in
GET /api/workers/heartbeats. - Check lease signals:
open_cowork_cloud_worker_lease_claims_total,open_cowork_cloud_worker_lease_renewals_total, andopen_cowork_cloud_worker_expired_leases_reaped_total. Ifopen_cowork_cloud_worker_expired_lease_reaper_drain_cap_hits_totalincreases, expired-lease recovery is exhausting its bounded drain cap and may need worker capacity or a stuck-owner investigation. - Check stale-owner signals:
open_cowork_cloud_worker_stale_owner_rejections_totalshould remain near zero outside crash/failover drills. - Check BYOK reveal failures and provider errors before scaling workers.
- Scale workers horizontally only when Postgres connection pool and provider quota have headroom.
- If one session is poisoning the queue, use session abort/retry controls rather than direct database edits.
Scheduler Stalled¶
Use this when scheduled workflows do not start or heartbeat age exceeds the alert threshold.
- Check scheduler heartbeat freshness.
- Check
open_cowork_cloud_scheduler_claims_totalandopen_cowork_cloud_scheduler_failures_total. - Check
open_cowork_cloud_scheduler_expired_claims_reaped_total; any sustained increase means workflow start claims are expiring before session attachment. Ifopen_cowork_cloud_scheduler_expired_claim_reaper_drain_cap_hits_totalincreases, the scheduler is exhausting its bounded recovery drain cap and may need more scheduler capacity or investigation of stalled workflow-start claims. - Confirm exactly one scheduler deployment group is active for the environment.
- Confirm database time and application time are not drifting.
- Restart scheduler only after checking logs for claim transaction failures.
- Verify one due workflow claim after restart and confirm no double-fire.
Postgres Connection Exhaustion¶
Use this when web, worker, scheduler, or Gateway routes fail with database pool or timeout errors.
- Check managed database connection count, wait events, slow queries, and CPU.
- Temporarily scale workers down before web if user reads must remain available.
- Check queue depth and scheduler claims; high worker concurrency may be exhausting the pool.
- Confirm migrations are not running repeatedly.
- Add pool capacity or a connection pooler only after bounding worker replicas.
- Do not increase all role replicas at the same time.
Object-Store Errors¶
Use this when artifacts, uploads, exports, or checkpoint restore/save fails.
- Watch
open_cowork_cloud_object_store_operations_total{status="error"}(byoperation= get/put/head/delete andcloud_object_store_kind) and theopen_cowork_cloud_object_store_operation_duration_mslatency — these cover every durable read/write, including the object-store I/O behind checkpoint save/restore. - Check object-store service health and credentials/workload identity.
- Verify bucket/container/prefix exists and has versioning enabled.
- Check checkpoint restore logs before allowing workers to resume failed sessions.
- For transient object-store failures, keep web reads available and pause worker scale-up.
- For permission failures, rotate or repair object-store credentials and run one artifact read/write smoke.
KMS Or Secret Adapter Errors¶
Use this when BYOK metadata exists but runtime reveal, cookie secret, OIDC secret, channel credential, or envelope decryption fails.
- Check secret manager/KMS availability and IAM on the runtime service account.
- Confirm
OPEN_COWORK_CLOUD_SECRET_KEY_REF,OPEN_COWORK_CLOUD_COOKIE_SECRET_REF, OIDC refs, and gateway secret refs point to current versions. - Do not copy plaintext secrets into environment variables as a workaround in managed deployments.
- If a KMS key was rotated, verify old ciphertext can still be revealed before disabling old key material.
- Run BYOK metadata and worker validation smoke after repair.
OIDC Outage¶
Use this when sign-in, token refresh, or browser callback handling fails.
- Check IdP status and OIDC discovery document.
- Check
OPEN_COWORK_CLOUD_PUBLIC_URL, callback path, client id, and client secret reference. - Check auth failure rate and backoff state.
- Keep existing authenticated sessions unless cookie secret rotation is part of the incident.
- Do not switch public deployments to
auth.mode=none. - If emergency admin access is needed, use a scoped API token through private networking and audit the action.
Gateway Provider Outage¶
Use this when Telegram, Slack, email, webhook, or another channel provider fails while cloud sessions still execute.
- Check gateway
/readyprovider status,open_cowork_gateway_provider_state, and provider-labeled retry/dead-letter counters. - Keep cloud workers running; failed channel delivery should retry or dead-letter without blocking execution.
- Rotate only the affected channel credential if provider auth failed.
- Use
/deliveries?status=failed&channelBindingId=<binding>from the affected gateway. Retry/dead-letter controls are valid only for deliveries last claimed by that gateway token unless an org channel admin performs broader Cloud-side recovery. - If
/diagnostics.deliveryOperator.disabledReasonis non-null, fix the missing Cloud client capability, admin token, or provider binding before replaying deliveries. - Notify users that desktop and web remain authoritative while chat delivery is degraded.
Webhook Abuse¶
Use this when webhook auth failures, replay rejections, or rate-limit denials spike.
- Confirm public webhook routes require HMAC/shared-secret signatures.
- Check replay and auth-failure counters by source.
- Rotate the affected webhook secret if a signing secret may be exposed.
- Tighten rate limits or temporarily disable the affected channel binding.
- Preserve audit and redacted diagnostics for incident review.
BYOK Provider Key Failure¶
Use this when model calls fail because a user provider key is missing, expired, revoked, or rejected by provider policy.
- Confirm read APIs expose metadata only: provider, last4/fingerprint, status, and health.
- Check
open_cowork_cloud_byok_reveal_failures_totalwithout logging plaintext. - Mark the provider credential invalid/expired through BYOK metadata.
- Ask the org owner/admin to rotate the provider key.
- Resume worker execution only after a bounded validation succeeds.
Secret Rotation¶
Rotate secrets without moving them through logs, chat, issue comments, or renderer state.
- Cloud envelope key: rotate through the platform secret manager and verify BYOK reveal tests before deleting old key material.
- Cookie secret: rotate during a maintenance window because existing browser sessions may be invalidated.
- Gateway service token: issue a new scoped token in the dashboard, update the gateway secret, restart the gateway, then revoke the old token.
- Channel credentials: rotate in the channel provider first, update the gateway secret, then verify provider readiness.
- Object-store keys: prefer workload identity or short-lived credentials; if a static key is used, update the secret and verify artifact read/write.
Tenant Offboarding¶
- Disable new session, workflow, Gateway, and worker claims for the org.
- Drain active workers and wait for active work to finish or checkpoint.
- Revoke org API tokens, worker credentials, gateway tokens, channel credentials, and BYOK provider refs.
- Export or delete artifacts according to the org retention policy.
- Preserve audit records required by policy while redacting credentials and user content from support bundles.
- Confirm no worker heartbeat, queued command, workflow run, or gateway delivery remains active for the org.
Suspected Key Exposure¶
Use this when a BYOK key, worker credential, gateway token, object-store key, cookie secret, OIDC secret, webhook secret, or billing secret may be exposed.
- Stop the affected ingress or worker pool if active misuse is possible.
- Revoke or rotate the exposed secret at the source of truth.
- Revoke dependent sessions/tokens where required, including worker credentials and gateway service tokens.
- Search redacted logs and diagnostics for the secret fingerprint or last4; do not paste the secret itself into tools.
- Re-run BYOK metadata, worker heartbeat, object-store read/write, webhook signature, and gateway readiness smoke tests.
- Record a private incident report with ids, timestamps, and fingerprints only. Do not commit incident evidence to the public repo.
Diagnostics¶
Diagnostics must be redacted before leaving the deployment boundary.
Allowed to include:
- service name, version, role, profile, and image tag,
- health/readiness JSON,
- worker heartbeat age and scheduler heartbeat age,
- gateway provider ids, provider kinds, and started flags,
- counters and non-secret policy verdicts,
- sanitized log excerpts.
Must be redacted:
- API tokens, BYOK keys, provider credentials, OAuth tokens, cookies, authorization headers, and webhook secrets,
- Postgres URLs with credentials,
- object-store signed URLs, bucket-private URLs, SAS tokens, and pre-signed query strings,
- local host paths and workspace paths,
- user email addresses.
Gateway /diagnostics is suitable for support only because it returns redacted gateway configuration and counters. Put it behind private networking, VPN, or operator auth in managed deployments; do not expose it as a public webhook surface.
Restore Check¶
After restoring from backup:
- Restore Postgres first.
- Restore object-store artifacts and checkpoint prefixes for the same point in time.
- Start web with workers scaled to zero.
- Verify session list and projections load from durable state.
- Start one worker, run a smoke prompt, and verify checkpoint writes.
- Start scheduler, then gateway.
- Verify channel deliveries resume from durable cursors without duplicates.