Managed Worker Service Plane¶
The managed worker service plane is the execution-capacity layer for cloud work that should continue without a user's desktop staying online. It composes the existing Open Cowork Cloud control plane and the OpenCode runtime. It is not a second agent runtime, session store, scheduler, tool system, MCP host, or approval engine.
Decision¶
V1 supports control-plane-owned worker pools:
- managed SaaS workers operated by the Open Cowork Cloud operator for hosted BYOK deployments
- fully self-hosted internal worker pools operated by the same organization that owns the Cloud control plane
V1 does not support customer-hosted workers connecting to a separate managed SaaS control plane. That mode needs a separate trust, liability, networking, update, and data-residency review because the worker would hold scoped execution credentials while running outside the control-plane operator's boundary.
This choice keeps the first implementation concrete: the same operator owns the Cloud API, Postgres control plane, object store, secret adapter, worker images, network policy, and incident response.
Ownership Boundary¶
OpenCode owns execution:
- sessions and child sessions
- runtime event streaming
- tool and MCP execution semantics
- approvals and questions
- native skills and native provider auth behavior
- provider/model calls through runtime config
Open Cowork owns the service-plane composition:
- worker identity, credentials, lifecycle, and heartbeats
- work eligibility, claims, leases, fencing, and recovery
- tenancy, policy, quotas, entitlements, and audit
- object-store artifact and checkpoint metadata
- BYOK secret reveal policy and runtime-config injection
- Web, Desktop cloud workspace, and Gateway projections
- deployment, update, restore, and operator runbooks
Only worker/runtime adapter code may import OpenCode runtime surfaces. Browser, Gateway, Desktop renderer, route modules, and control-plane store modules remain product/control-plane code.
Trust Boundary¶
flowchart LR
Browser["Browser user<br/>cookie auth"]
Desktop["Desktop cloud client<br/>bearer auth"]
Gateway["Gateway service client<br/>service token + actor identity"]
Web["Cloud web/API role<br/>auth, policy, command writes"]
Scheduler["Scheduler role<br/>due-run claims"]
Worker["Managed worker role<br/>lease owner"]
Runtime["OpenCode runtime process<br/>execution owner"]
Store["Postgres control plane<br/>commands, leases, events, projections"]
Objects["Object store<br/>artifacts, checkpoints, snapshots"]
Secrets["Secret adapter/KMS<br/>BYOK, worker credentials"]
Operator["Operator/admin<br/>worker lifecycle"]
Browser --> Web
Desktop --> Web
Gateway --> Web
Operator --> Web
Web --> Store
Scheduler --> Store
Worker --> Store
Worker --> Secrets
Worker --> Objects
Worker --> Runtime
Runtime --> Worker
Web --> Objects
classDef client fill:#dbeafe,stroke:#2563eb,color:#1e3a8a
classDef control fill:#fef3c7,stroke:#d97706,color:#78350f
classDef execution fill:#dcfce7,stroke:#16a34a,color:#14532d
classDef state fill:#ede9fe,stroke:#7c3aed,color:#3b0764
class Browser,Desktop,Gateway client
class Web,Scheduler,Operator control
class Worker,Runtime execution
class Store,Objects,Secrets state Clients never talk to OpenCode directly for cloud work. Clients write commands or decisions to the Cloud API. Workers claim eligible work from durable state, run OpenCode, and publish fenced events/projections/checkpoints back to the control plane.
Work Classes¶
V1 work classes are cloud-only:
| Work class | Required inputs | Durable owner |
|---|---|---|
| Cloud session command | tenant, session, command id, profile, provider/model, project source or restored workspace | session command record |
| Manual workflow run | tenant, workflow, run id, agent/profile, trigger actor | workflow run record |
| Scheduled workflow run | tenant, workflow, schedule trigger, due timestamp, scheduler claim id | workflow run record |
| Webhook workflow run | tenant, workflow, webhook replay claim, signed request metadata | workflow run record |
| Gateway prompt | tenant, channel binding, resolved actor, session binding, prompt command | session command record |
| Artifact/checkpoint write | tenant, session/run, claim token, object metadata, checksum/size | artifact/checkpoint metadata |
The worker may restore an approved Git source, uploaded snapshot, or managed workspace checkpoint into an app-managed sandbox before OpenCode starts.
Excluded by default:
- local desktop-only threads
- arbitrary host-path project directories
- unsandboxed local file access
- local stdio MCP commands
- machine runtime config
- provider credentials outside approved BYOK/runtime-config paths
- direct Gateway-owned execution
- peer-to-peer desktop sync
Gateway Edge Capacity¶
Standalone Team Gateway can optionally connect to Cloud through the Cloud Gateway Registration contract. This does not change the V1 managed-worker decision: edge execution is allowed only when the Cloud/Gateway trust model is self_hosted_same_operator or saas_operator_managed. Customer-hosted Gateway edge workers connected to a separate managed SaaS control plane remain customer_hosted_managed_saas_deferred.
The registration kind decides the boundary:
| Registration kind | Managed-worker relationship |
|---|---|
external_workspace | Not a worker. Cloud may store redacted Gateway workspace metadata, health, capabilities, cursors, and audit summaries. Gateway remains source of truth for Gateway-owned sessions. |
edge_worker | Worker-like capacity. Gateway claims only eligible Cloud-owned work and writes Cloud-owned output with managed-worker lease-token fencing. |
external_workspace_edge_worker | Both lanes. Gateway-owned work stays Gateway-owned; Cloud-owned work uses the managed-worker claim/fencing path. |
Edge Gateway credentials are distinct from Cloud Channel Gateway service tokens and from human/admin credentials. They are scoped to registration heartbeat, capability advertisement, optional metadata sync, and, when enabled, edge work claim/renew/fenced-output operations. They cannot call BYOK read/reveal APIs, billing APIs, tenant admin APIs, Desktop APIs, or operator APIs.
Cloud must never merge Gateway Postgres with Cloud Postgres. Cloud-owned edge work uses Cloud command, lease, event, projection, artifact, checkpoint, usage, and audit records. Gateway-owned external-workspace work uses Gateway records, with only explicitly allowed redacted metadata syncing to Cloud.
Worker Lifecycle¶
Worker records use these states:
| State | Meaning | Allowed next states |
|---|---|---|
pending | Registration exists but the worker is not trusted to claim work. | active, revoked |
active | Worker can heartbeat, claim eligible work, renew leases, and write fenced output. | draining, paused, unhealthy, retired, revoked |
draining | Worker keeps renewing active leases but cannot claim new work. | active, retired, revoked, unhealthy |
paused | Worker cannot claim or renew work until resumed by an admin/operator. | active, retired, revoked |
unhealthy | Control plane has detected stale heartbeat or repeated failures. | active, draining, retired, revoked |
retired | Worker exited intentionally after drain. It cannot claim work again. | terminal |
revoked | Credential or worker was emergency-blocked. It cannot heartbeat, claim, renew, or write. | terminal |
Lifecycle transitions emit audit events and are role-checked. Tenant admins see tenant-scoped summaries. Operators see cross-pool health only through operator auth or private networking.
Enrollment And Credentials¶
Worker enrollment is explicit:
- An operator or tenant admin creates a worker pool.
- A worker registration record is created in
pendingstate. - The control plane issues a one-time credential. Only the token hash is stored after issuance.
- The worker starts with that credential and calls heartbeat.
- An authorized admin/operator activates the worker or policy auto-activates it for self-host mode.
Worker credentials are:
- scoped to worker id, pool id, tenant id where tenant-scoped, and allowed operations
- expiring, rotatable, and revocable
- stored hash-only in the control plane
- never returned after initial issuance
- never accepted for broad tenant admin APIs
The worker principal contains workerId, poolId, tenant scope, credential id, scopes, expiry, and status. It does not inherit user/admin authority.
Phase 1 Control Plane Surface¶
Phase 1 implements worker identity and lifecycle only. It intentionally does not implement work claiming or execution routing.
Admin-managed endpoints:
GET /api/admin/worker-poolsPOST /api/admin/worker-poolsPOST /api/admin/worker-pools/{poolId}/updateGET /api/admin/workersPOST /api/admin/workersGET /api/admin/workers/{workerId}POST /api/admin/workers/{workerId}/activatePOST /api/admin/workers/{workerId}/pausePOST /api/admin/workers/{workerId}/resumePOST /api/admin/workers/{workerId}/drainPOST /api/admin/workers/{workerId}/retirePOST /api/admin/workers/{workerId}/revokeGET /api/admin/workers/{workerId}/credentialsPOST /api/admin/workers/{workerId}/credentialsPOST /api/admin/workers/{workerId}/credentials/{credentialId}/rotatePOST /api/admin/workers/{workerId}/credentials/{credentialId}/revokeGET /api/admin/workers/{workerId}/heartbeats
Worker self endpoint:
POST /api/workers/{workerId}/heartbeat
Worker credentials authenticate only the worker self endpoint. They cannot call tenant admin APIs, Desktop APIs, Gateway APIs, BYOK APIs, billing APIs, or work-claim APIs. Raw worker credential values are returned once at issuance and are never returned by list/detail APIs.
Lease And Fencing Contract¶
Every claimable work unit must carry or reference:
tenant_idwork_typework_idsession_idworkflow_idworkflow_run_idstatuspriorityavailable_atleased_bylease_expires_atlease_tokencheckpoint_versionlast_heartbeat_atclaimed_atcompleted_atfailed_at- idempotency key or command/run sequence
Claiming work is a single transaction:
- Select eligible work for the worker's tenant, pool, capabilities, profile, provider, quota, and entitlement.
- Verify worker status and credential validity.
- Assign
leased_by,lease_token, andlease_expires_at. - Increment attempt and checkpoint metadata.
- Return the claimed payload.
No database transaction may remain open while OpenCode runs.
Every worker-produced write includes the active lease_token. The control plane rejects stale-owner writes for:
- events
- projections
- session command status
- workflow run status
- workflow finalization
- checkpoint metadata
- object-store artifact metadata
- gateway/channel delivery records derived from worker output
- execution usage/metering records
Fencing is mandatory even when there is one worker replica. It is the mechanism that makes failover safe once the topology scales.
Checkpoint And Artifact Ownership¶
Workers may write object payloads only through scoped object-store adapters. Object metadata is durable control-plane state and must be written with the active lease token.
Rules:
- object keys are generated by the control plane or scoped helper, not by raw runtime paths
- object metadata includes tenant, session/run id, claim token, size, checksum, content type, and retention class
- artifact bodies are downloadable only through authorized API routes or signed URLs with bounded TTL
- checkpoint restores validate manifest checksums before runtime use
- a worker crash after object upload but before metadata write leaves an orphan that cleanup can delete; it does not expose the object to clients
- a stale worker cannot overwrite checkpoint metadata after lease loss
Heartbeats And Liveness¶
Workers heartbeat with:
- worker id and pool id
- version and runtime compatibility
- capabilities
- region or deployment label
- current load
- active work ids
- last error code and redacted summary
- monotonic heartbeat sequence where available
Heartbeat acceptance requires an active, unexpired, non-revoked worker credential. Heartbeats do not grant admin powers and cannot mutate pool policy.
Liveness policy:
- active workers renew leases before
lease_expires_at - missed heartbeat moves workers to
unhealthy - expired leases become recoverable work
- draining workers renew current leases but do not claim new work
- paused or revoked workers cannot renew leases
Recovery Rules¶
| Failure | Required behavior |
|---|---|
| Worker crash before execution | Lease expires; reaper makes work eligible for retry. |
| Worker crash during OpenCode execution | Lease expires; replacement restores checkpoint or restarts from durable command state. |
| Worker crash after runtime output before projection write | Replacement rebuilds projection from durable events; missing event output is retried only through idempotent command semantics. |
| Worker crash after object upload before metadata write | Object remains invisible until metadata write; orphan cleanup can remove it. |
| Worker loses lease then writes output | Control plane rejects the write by lease token and records stale-owner evidence. |
| Scheduler double-fire | Claim plus workflow-run creation happens atomically; one scheduler wins. |
| Gateway prompt with no capacity | Command remains queued or returns capacity_unavailable according to profile policy. |
| BYOK becomes invalid mid-run | Worker fails the command with a redacted provider/credential state; no plaintext is surfaced. |
Recovery is driven by durable state. Workers do not infer completion from local process memory.
Secret Access¶
Secrets are least-privilege:
- BYOK plaintext reveal is worker-role-only and tenant/provider/session bounded
- provider keys enter OpenCode through runtime config provider options, never ambient
process.env - object-store credentials are scoped to tenant/session/run prefixes where the provider supports it
- worker credentials cannot call tenant admin APIs
- Gateway service tokens cannot reveal BYOK secrets
- Browser and Desktop cloud clients receive only secret metadata, status, and policy verdicts
Diagnostics, logs, audit records, usage records, launch reports, and support bundles redact tokens, provider keys, signed URLs, local paths, headers, cookies, and raw attachment payloads where policy requires.
Capacity And Quota Model¶
The service plane enforces limits before expensive work starts:
- max concurrent managed sessions per org
- max concurrent workflow runs per org
- max workers per pool
- max queue depth per org/pool
- prompt and workflow run rate limits
- worker-minute usage
- provider/model quota gates
- billing entitlement gates for hosted mode
Self-host mode can run with the billing adapter disabled or stubbed. Quotas are still useful for abuse containment and runaway-cost prevention.
Audit Events¶
Audit records are emitted for:
- worker pool created, updated, paused, resumed, retired
- worker registered, activated, paused, resumed, draining, retired, revoked, unhealthy
- worker credential issued, rotated, expired, revoked
- heartbeat rejected
- work claimed, renewed, completed, failed, retried, dead-lettered
- stale-owner write rejected
- BYOK reveal attempted, succeeded, failed
- checkpoint saved, restored, rejected
- artifact metadata written, rejected, cleaned up
- quota, entitlement, and capacity denials
- emergency operator actions
Audit payloads include ids and reason codes, not raw prompt text or secrets.
Operations Skeleton¶
Public runbooks must cover:
- worker pool creation
- worker registration and activation
- credential rotation
- pause, drain, resume, retire
- emergency revoke
- rolling update and rollback
- stuck queue
- stale lease spike
- worker crash loop
- BYOK reveal failures
- checkpoint/artifact restore failure
- scheduler double-fire investigation
- Gateway delivery lag caused by worker backlog
- tenant offboarding
- suspected key exposure
Public docs define the evidence shape only. Real project ids, account ids, domains, customer names, prices, support rosters, and incident evidence belong in downstream/private operations repositories.
Compatibility Policy¶
Workers declare:
- Open Cowork version
- OpenCode runtime version
- service-plane protocol version
- runtime capability flags
- supported checkpoint schema version
- supported event/projection contract version
The control plane can reject, pause, or drain incompatible workers. Rolling updates use drain first, then revoke only for emergency response.
Phase 5 Operations Contract¶
Production worker operations are template-based. Public repo artifacts define the safe shape; real provider values and customer evidence live in downstream private operations repositories.
Supported deployment modes:
self_hosted: the same organization operates Cloud, workers, scheduler, object storage, Postgres, and Gateway. Billing can benoneorstub.saas_operated: the managed Open Cowork operator runs Cloud and workers for BYOK customers. Public templates define the evidence format; private repos hold real project ids, domains, customer data, prices, and go/no-go reports.customer_hosted: deferred from v1. Do not connect customer-hosted workers to a separate managed SaaS control plane until a separate trust, update, networking, liability, and data-residency review is complete.
Required deployer artifacts:
- config template for Cloud web, worker, scheduler, object store, secret adapter, auth, observability, quotas, and billing mode
- secret inventory with refs only, not plaintext
- network requirements for private Postgres/object store/KMS and public Web ingress
- worker pool sizing guidance and queue/claim-based scaling policy
- rolling update, drain, rollback, and emergency revoke workflow
- SLO/alert template and dashboard mapping
- backup/restore drill with Postgres and object-store consistency checks
- release evidence template with image digest/checksum/signature and compatibility decision
The concrete public templates live under deploy/managed-workers/.
Deployment Modes¶
| Mode | Required proof | Failure behavior |
|---|---|---|
| Self-hosted internal pool | pnpm deploy:validate, pnpm ops:validate, one worker smoke, one restore rehearsal | keep reads available, scale worker to zero, recover through durable claims |
| SaaS-operated pool | release evidence, SLO evidence, restore drill, support/on-call path, BYOK redaction proof | pause/drain affected pool, preserve tenant reads/exports, revoke compromised workers |
| Customer-hosted pool | unsupported in v1 | fail closed in config/docs until trust review is done |
Rolling Updates¶
Worker updates must preserve durable ownership:
- Set the worker or pool to
draining. - Wait for current load to reach zero or for approved active commands to checkpoint.
- Confirm queue age, claim latency, BYOK reveal errors, object-store errors, stale-owner rejections, and dead letters are within SLO.
- Roll the worker image by immutable digest or release tag with
maxUnavailable=0,maxSurge=1, and a termination grace at least as long asOPEN_COWORK_CLOUD_SHUTDOWN_GRACE_MS. - Confirm new heartbeats report the expected Open Cowork version, OpenCode runtime version, service-plane protocol, event/projection contract, and checkpoint schema.
- Resume the pool and run a bounded session prompt, workflow, checkpoint, and Gateway-originated prompt smoke where applicable.
The worker role waits for an active command loop to finish during process shutdown until the configured shutdown grace elapses. Operators should still drain before termination; the grace window is a last line of defense, not the primary rollout mechanism.
Rollback And Emergency Revoke¶
Rollback is image-based. Additive schema changes stay in place and are fixed forward. Roll back workers first when OpenCode execution, BYOK injection, checkpointing, or provider/model execution regresses. Roll back web or scheduler only when their own route, projection, auth, or claim behavior is the failing surface.
Emergency revoke is for suspected worker token, host, image, runtime, BYOK, or object-store compromise:
- Revoke the worker credential.
- Mark the worker
revoked. - Stop the worker host or deployment.
- Let existing leases expire or be reaped; do not hand-edit command/session records.
- Start a replacement worker from a known-good image and verify stale-owner writes are rejected.
- Preserve redacted audit events, heartbeat rejections, metrics, and diagnostics for incident review.
SLO And Alert Template¶
Operators should define concrete targets per environment. Public examples should remain generic:
| Signal | Suggested private-beta starting point | Alert trigger |
|---|---|---|
| Worker heartbeat age | p95 under 60s | no active worker heartbeat for 2 minutes |
| Queue age | p95 under 5 minutes | oldest command over 10 minutes |
| Claim latency | p95 under 5 seconds | p95 over 30 seconds |
| Command latency | p95 under 10 minutes | p95 over 30 minutes |
| Workflow latency | p95 under schedule interval + 10 minutes | due workflows do not start |
| Projection lag | latest projection within 25 events | lag keeps growing |
| Checkpoint failure rate | under 1 percent | sustained failures |
| Object-store error rate | under 1 percent | any sustained write failure |
| BYOK reveal failure rate | under 1 percent | sustained failures by org/provider |
| Stale lease reclaim count | near zero outside drills | spike after release |
| Dead-letter count | zero | any new dead letter |
| Quota denial count | expected under load tests | unexpected spike |
| Auth failure count | low and bounded | spike by IP/org/token |
| Gateway delivery lag | p95 under 5 minutes | lag caused by worker backlog |
Metrics, alert rules, and dashboard starter assets live in deploy/observability/; managed-worker-slo-template.json defines the public-safe SLO evidence shape. Run pnpm ops:validate when these artifacts change.
Backup And Restore¶
The restore order is fixed:
- Scale workers, scheduler, and Gateway to zero.
- Restore Postgres control-plane records first.
- Restore object-store artifacts/checkpoints to the same point in time.
- Start web only and verify sessions, projections, workflows, BYOK metadata, worker records, audit rows, metrics, and diagnostics.
- Start one worker and run a bounded smoke prompt with checkpoint save.
- Start scheduler and verify a due workflow claim without double-fire.
- Start Gateway and verify delivery cursors resume without duplicates.
Restore reports must prove checkpoint/artifact metadata matches restored blobs, session projection replay/repair works, workflow run status is consistent, and BYOK secret refs remain valid without exporting plaintext.
Threat Model¶
| Threat | Boundary affected | Mitigation | Evidence required | Residual risk |
|---|---|---|---|---|
| Worker token compromise | worker to control plane | scoped expiring credentials, token hash storage, revoke, audit, no admin API scope | credential revoke tests, heartbeat rejection tests, audit assertions | active work may continue until revoke/lease expiry |
| Stale worker writes after lease expiry | worker to store/object metadata | lease-token fencing on every write, stale-owner audit | Postgres concurrency tests, stale projection/checkpoint write tests | object payload orphan cleanup may be needed |
| BYOK plaintext leakage | worker to secret adapter/runtime | worker-role-only reveal, AAD-bound ciphertext, provider options not env, redaction tests | BYOK boundary tests, diagnostics/log scans | compromised worker during active reveal can misuse key until revoked |
| Object-store prefix escape | worker to object store | scoped key builder, traversal rejection, prefix-limited credentials where available | object key policy tests, restore tamper tests | object stores without scoped credentials rely on app policy |
| Checkpoint corruption | object store to runtime restore | manifest checksums, schema version, tenant/session binding, restore validation | restore tamper tests, backup/restore drill | old valid checkpoint may contain user bug or bad project state |
| Tenant crossover | control plane and object store | tenant keys on every record, tenant-scoped queries, object prefix isolation | tenant isolation API/store tests | operator-level accounts remain high-trust |
| Gateway/channel impersonation | gateway to Cloud API | service token authenticates gateway only; actor identity resolves separately; approval RBAC uses actor membership | gateway identity and interaction tests | compromised channel account can act as that user until revoked |
| Malicious webhook-triggered work | public webhook to scheduler/workflow | mandatory HMAC/timestamp/replay claim, workflow policy, quota gates | webhook replay/rate tests, workflow policy tests | signed secret compromise enables authorized trigger until rotated |
| Scheduler double-fire | scheduler to workflow store | atomic due-run claim plus run creation, idempotency keys | scheduler concurrency tests | clock skew can delay work but must not duplicate it |
| Operator endpoint exposure | operator API to global state | separate operator auth/private networking, no tenant-user access | operator auth tests, deployment validators | misconfigured private networking remains operator responsibility |
| Diagnostic/log leakage | all surfaces | centralized redaction, no raw prompt/secret in usage/audit, bundle redaction | redaction tests, support bundle tests | novel provider field names may need redaction updates |
| Worker image/version compromise | release to worker host | signed/checksummed images, compatibility gate, revoke/drain rollback | release validator, version rejection tests | running compromised image can act within scoped worker privileges |
| Customer-hosted worker trust ambiguity | managed control plane to external worker | deferred from v1, separate design required before enablement | explicit unsupported config tests/docs | customers needing this mode wait for later phase |
Phase Readiness¶
Phase 1 implemented worker identity/lifecycle. It does not by itself make work claiming safe.
Phase 2 implements claims/fencing/recovery and is the correctness gate for workflow execution, gateway execution, quotas, and operations. Real Postgres concurrency tests must stay green for every change to leases, claims, queues, workflow due-run claims, or stale-owner rejection.
Phase 5 is complete only when public deployer templates, runbooks, SLO/alert templates, restore drill templates, release evidence templates, and validators prove the worker path can be deployed, drained, rolled forward, rolled back, emergency-revoked, and restored without committing private managed-SaaS values to the public repository.