Managed Worker Service Plane¶

The managed worker service plane is the execution-capacity layer for cloud work that should continue without a user's desktop staying online. It composes the existing Open Cowork Cloud control plane and the OpenCode runtime. It is not a second agent runtime, session store, scheduler, tool system, MCP host, or approval engine.

Decision¶

V1 supports control-plane-owned worker pools:

managed SaaS workers operated by the Open Cowork Cloud operator for hosted BYOK deployments
fully self-hosted internal worker pools operated by the same organization that owns the Cloud control plane

V1 does not support customer-hosted workers connecting to a separate managed SaaS control plane. That mode needs a separate trust, liability, networking, update, and data-residency review because the worker would hold scoped execution credentials while running outside the control-plane operator's boundary.

This choice keeps the first implementation concrete: the same operator owns the Cloud API, Postgres control plane, object store, secret adapter, worker images, network policy, and incident response.

Ownership Boundary¶

OpenCode owns execution:

sessions and child sessions
runtime event streaming
tool and MCP execution semantics
approvals and questions
native skills and native provider auth behavior
provider/model calls through runtime config

Open Cowork owns the service-plane composition:

worker identity, credentials, lifecycle, and heartbeats
work eligibility, claims, leases, fencing, and recovery
tenancy, policy, quotas, entitlements, and audit
object-store artifact and checkpoint metadata
BYOK secret reveal policy and runtime-config injection
Web, Desktop cloud workspace, and Gateway projections
deployment, update, restore, and operator runbooks

Only worker/runtime adapter code may import OpenCode runtime surfaces. Browser, Gateway, Desktop renderer, route modules, and control-plane store modules remain product/control-plane code.

Trust Boundary¶

flowchart LR
    Browser["Browser user<br/>cookie auth"]
    Desktop["Desktop cloud client<br/>bearer auth"]
    Gateway["Gateway service client<br/>service token + actor identity"]

    Web["Cloud web/API role<br/>auth, policy, command writes"]
    Scheduler["Scheduler role<br/>due-run claims"]
    Worker["Managed worker role<br/>lease owner"]
    Runtime["OpenCode runtime process<br/>execution owner"]

    Store["Postgres control plane<br/>commands, leases, events, projections"]
    Objects["Object store<br/>artifacts, checkpoints, snapshots"]
    Secrets["Secret adapter/KMS<br/>BYOK, worker credentials"]
    Operator["Operator/admin<br/>worker lifecycle"]

    Browser --> Web
    Desktop --> Web
    Gateway --> Web
    Operator --> Web
    Web --> Store
    Scheduler --> Store
    Worker --> Store
    Worker --> Secrets
    Worker --> Objects
    Worker --> Runtime
    Runtime --> Worker
    Web --> Objects

    classDef client fill:#dbeafe,stroke:#2563eb,color:#1e3a8a
    classDef control fill:#fef3c7,stroke:#d97706,color:#78350f
    classDef execution fill:#dcfce7,stroke:#16a34a,color:#14532d
    classDef state fill:#ede9fe,stroke:#7c3aed,color:#3b0764

    class Browser,Desktop,Gateway client
    class Web,Scheduler,Operator control
    class Worker,Runtime execution
    class Store,Objects,Secrets state

Clients never talk to OpenCode directly for cloud work. Clients write commands or decisions to the Cloud API. Workers claim eligible work from durable state, run OpenCode, and publish fenced events/projections/checkpoints back to the control plane.

Work Classes¶

V1 work classes are cloud-only:

Work class	Required inputs	Durable owner
Cloud session command	tenant, session, command id, profile, provider/model, project source or restored workspace	session command record
Manual workflow run	tenant, workflow, run id, agent/profile, trigger actor	workflow run record
Scheduled workflow run	tenant, workflow, schedule trigger, due timestamp, scheduler claim id	workflow run record
Webhook workflow run	tenant, workflow, webhook replay claim, signed request metadata	workflow run record
Gateway prompt	tenant, channel binding, resolved actor, session binding, prompt command	session command record
Artifact/checkpoint write	tenant, session/run, claim token, object metadata, checksum/size	artifact/checkpoint metadata

The worker may restore an approved Git source, uploaded snapshot, or managed workspace checkpoint into an app-managed sandbox before OpenCode starts.

Excluded by default:

local desktop-only threads
arbitrary host-path project directories
unsandboxed local file access
local stdio MCP commands
machine runtime config
provider credentials outside approved BYOK/runtime-config paths
direct Gateway-owned execution
peer-to-peer desktop sync

Gateway Edge Capacity¶

Standalone Team Gateway can optionally connect to Cloud through the Cloud Gateway Registration contract. This does not change the V1 managed-worker decision: edge execution is allowed only when the Cloud/Gateway trust model is self_hosted_same_operator or saas_operator_managed. Customer-hosted Gateway edge workers connected to a separate managed SaaS control plane remain customer_hosted_managed_saas_deferred.

The registration kind decides the boundary:

Registration kind	Managed-worker relationship
`external_workspace`	Not a worker. Cloud may store redacted Gateway workspace metadata, health, capabilities, cursors, and audit summaries. Gateway remains source of truth for Gateway-owned sessions.
`edge_worker`	Worker-like capacity. Gateway claims only eligible Cloud-owned work and writes Cloud-owned output with managed-worker lease-token fencing.
`external_workspace_edge_worker`	Both lanes. Gateway-owned work stays Gateway-owned; Cloud-owned work uses the managed-worker claim/fencing path.

Edge Gateway credentials are distinct from Cloud Channel Gateway service tokens and from human/admin credentials. They are scoped to registration heartbeat, capability advertisement, optional metadata sync, and, when enabled, edge work claim/renew/fenced-output operations. They cannot call BYOK read/reveal APIs, billing APIs, tenant admin APIs, Desktop APIs, or operator APIs.

Cloud must never merge Gateway Postgres with Cloud Postgres. Cloud-owned edge work uses Cloud command, lease, event, projection, artifact, checkpoint, usage, and audit records. Gateway-owned external-workspace work uses Gateway records, with only explicitly allowed redacted metadata syncing to Cloud.

Worker Lifecycle¶

Worker records use these states:

State	Meaning	Allowed next states
`pending`	Registration exists but the worker is not trusted to claim work.	`active`, `revoked`
`active`	Worker can heartbeat, claim eligible work, renew leases, and write fenced output.	`draining`, `paused`, `unhealthy`, `retired`, `revoked`
`draining`	Worker keeps renewing active leases but cannot claim new work.	`active`, `retired`, `revoked`, `unhealthy`
`paused`	Worker cannot claim or renew work until resumed by an admin/operator.	`active`, `retired`, `revoked`
`unhealthy`	Control plane has detected stale heartbeat or repeated failures.	`active`, `draining`, `retired`, `revoked`
`retired`	Worker exited intentionally after drain. It cannot claim work again.	terminal
`revoked`	Credential or worker was emergency-blocked. It cannot heartbeat, claim, renew, or write.	terminal

Lifecycle transitions emit audit events and are role-checked. Tenant admins see tenant-scoped summaries. Operators see cross-pool health only through operator auth or private networking.

Enrollment And Credentials¶

Worker enrollment is explicit:

An operator or tenant admin creates a worker pool.
A worker registration record is created in pending state.
The control plane issues a one-time credential. Only the token hash is stored after issuance.
The worker starts with that credential and calls heartbeat.
An authorized admin/operator activates the worker or policy auto-activates it for self-host mode.

Worker credentials are:

scoped to worker id, pool id, tenant id where tenant-scoped, and allowed operations
expiring, rotatable, and revocable
stored hash-only in the control plane
never returned after initial issuance
never accepted for broad tenant admin APIs

The worker principal contains workerId, poolId, tenant scope, credential id, scopes, expiry, and status. It does not inherit user/admin authority.

Phase 1 Control Plane Surface¶

Phase 1 implements worker identity and lifecycle only. It intentionally does not implement work claiming or execution routing.

Admin-managed endpoints:

GET /api/admin/worker-pools
POST /api/admin/worker-pools
POST /api/admin/worker-pools/{poolId}/update
GET /api/admin/workers
POST /api/admin/workers
GET /api/admin/workers/{workerId}
POST /api/admin/workers/{workerId}/activate
POST /api/admin/workers/{workerId}/pause
POST /api/admin/workers/{workerId}/resume
POST /api/admin/workers/{workerId}/drain
POST /api/admin/workers/{workerId}/retire
POST /api/admin/workers/{workerId}/revoke
GET /api/admin/workers/{workerId}/credentials
POST /api/admin/workers/{workerId}/credentials
POST /api/admin/workers/{workerId}/credentials/{credentialId}/rotate
POST /api/admin/workers/{workerId}/credentials/{credentialId}/revoke
GET /api/admin/workers/{workerId}/heartbeats

Worker self endpoint:

POST /api/workers/{workerId}/heartbeat

Worker credentials authenticate only the worker self endpoint. They cannot call tenant admin APIs, Desktop APIs, Gateway APIs, BYOK APIs, billing APIs, or work-claim APIs. Raw worker credential values are returned once at issuance and are never returned by list/detail APIs.

Lease And Fencing Contract¶

Every claimable work unit must carry or reference:

tenant_id
work_type
work_id
session_id
workflow_id
workflow_run_id
status
priority
available_at
leased_by
lease_expires_at
lease_token
checkpoint_version
last_heartbeat_at
claimed_at
completed_at
failed_at
idempotency key or command/run sequence

Claiming work is a single transaction:

Select eligible work for the worker's tenant, pool, capabilities, profile, provider, quota, and entitlement.
Verify worker status and credential validity.
Assign leased_by, lease_token, and lease_expires_at.
Increment attempt and checkpoint metadata.
Return the claimed payload.

No database transaction may remain open while OpenCode runs.

Every worker-produced write includes the active lease_token. The control plane rejects stale-owner writes for:

events
projections
session command status
workflow run status
workflow finalization
checkpoint metadata
object-store artifact metadata
gateway/channel delivery records derived from worker output
execution usage/metering records

Fencing is mandatory even when there is one worker replica. It is the mechanism that makes failover safe once the topology scales.

Checkpoint And Artifact Ownership¶

Workers may write object payloads only through scoped object-store adapters. Object metadata is durable control-plane state and must be written with the active lease token.

Rules:

object keys are generated by the control plane or scoped helper, not by raw runtime paths
object metadata includes tenant, session/run id, claim token, size, checksum, content type, and retention class
artifact bodies are downloadable only through authorized API routes or signed URLs with bounded TTL
checkpoint restores validate manifest checksums before runtime use
a worker crash after object upload but before metadata write leaves an orphan that cleanup can delete; it does not expose the object to clients
a stale worker cannot overwrite checkpoint metadata after lease loss

Heartbeats And Liveness¶

Workers heartbeat with:

worker id and pool id
version and runtime compatibility
capabilities
region or deployment label
current load
active work ids
last error code and redacted summary
monotonic heartbeat sequence where available

Heartbeat acceptance requires an active, unexpired, non-revoked worker credential. Heartbeats do not grant admin powers and cannot mutate pool policy.

Liveness policy:

active workers renew leases before lease_expires_at
missed heartbeat moves workers to unhealthy
expired leases become recoverable work
draining workers renew current leases but do not claim new work
paused or revoked workers cannot renew leases

Recovery Rules¶

Failure	Required behavior
Worker crash before execution	Lease expires; reaper makes work eligible for retry.
Worker crash during OpenCode execution	Lease expires; replacement restores checkpoint or restarts from durable command state.
Worker crash after runtime output before projection write	Replacement rebuilds projection from durable events; missing event output is retried only through idempotent command semantics.
Worker crash after object upload before metadata write	Object remains invisible until metadata write; orphan cleanup can remove it.
Worker loses lease then writes output	Control plane rejects the write by lease token and records stale-owner evidence.
Scheduler double-fire	Claim plus workflow-run creation happens atomically; one scheduler wins.
Gateway prompt with no capacity	Command remains queued or returns `capacity_unavailable` according to profile policy.
BYOK becomes invalid mid-run	Worker fails the command with a redacted provider/credential state; no plaintext is surfaced.

Recovery is driven by durable state. Workers do not infer completion from local process memory.

Secret Access¶

Secrets are least-privilege:

BYOK plaintext reveal is worker-role-only and tenant/provider/session bounded
provider keys enter OpenCode through runtime config provider options, never ambient process.env
object-store credentials are scoped to tenant/session/run prefixes where the provider supports it
worker credentials cannot call tenant admin APIs
Gateway service tokens cannot reveal BYOK secrets
Browser and Desktop cloud clients receive only secret metadata, status, and policy verdicts

Diagnostics, logs, audit records, usage records, launch reports, and support bundles redact tokens, provider keys, signed URLs, local paths, headers, cookies, and raw attachment payloads where policy requires.

Capacity And Quota Model¶

The service plane enforces limits before expensive work starts:

max concurrent managed sessions per org
max concurrent workflow runs per org
max workers per pool
max queue depth per org/pool
prompt and workflow run rate limits
worker-minute usage
provider/model quota gates
billing entitlement gates for hosted mode

Self-host mode can run with the billing adapter disabled or stubbed. Quotas are still useful for abuse containment and runaway-cost prevention.

Audit Events¶

Audit records are emitted for:

worker pool created, updated, paused, resumed, retired
worker registered, activated, paused, resumed, draining, retired, revoked, unhealthy
worker credential issued, rotated, expired, revoked
heartbeat rejected
work claimed, renewed, completed, failed, retried, dead-lettered
stale-owner write rejected
BYOK reveal attempted, succeeded, failed
checkpoint saved, restored, rejected
artifact metadata written, rejected, cleaned up
quota, entitlement, and capacity denials
emergency operator actions

Audit payloads include ids and reason codes, not raw prompt text or secrets.

Operations Skeleton¶

Public runbooks must cover:

worker pool creation
worker registration and activation
credential rotation
pause, drain, resume, retire
emergency revoke
rolling update and rollback
stuck queue
stale lease spike
worker crash loop
BYOK reveal failures
checkpoint/artifact restore failure
scheduler double-fire investigation
Gateway delivery lag caused by worker backlog
tenant offboarding
suspected key exposure

Public docs define the evidence shape only. Real project ids, account ids, domains, customer names, prices, support rosters, and incident evidence belong in downstream/private operations repositories.

Compatibility Policy¶

Workers declare:

Open Cowork version
OpenCode runtime version
service-plane protocol version
runtime capability flags
supported checkpoint schema version
supported event/projection contract version

The control plane can reject, pause, or drain incompatible workers. Rolling updates use drain first, then revoke only for emergency response.

Phase 5 Operations Contract¶

Production worker operations are template-based. Public repo artifacts define the safe shape; real provider values and customer evidence live in downstream private operations repositories.

Supported deployment modes:

self_hosted: the same organization operates Cloud, workers, scheduler, object storage, Postgres, and Gateway. Billing can be none or stub.
saas_operated: the managed Open Cowork operator runs Cloud and workers for BYOK customers. Public templates define the evidence format; private repos hold real project ids, domains, customer data, prices, and go/no-go reports.
customer_hosted: deferred from v1. Do not connect customer-hosted workers to a separate managed SaaS control plane until a separate trust, update, networking, liability, and data-residency review is complete.

Required deployer artifacts:

config template for Cloud web, worker, scheduler, object store, secret adapter, auth, observability, quotas, and billing mode
secret inventory with refs only, not plaintext
network requirements for private Postgres/object store/KMS and public Web ingress
worker pool sizing guidance and queue/claim-based scaling policy
rolling update, drain, rollback, and emergency revoke workflow
SLO/alert template and dashboard mapping
backup/restore drill with Postgres and object-store consistency checks
release evidence template with image digest/checksum/signature and compatibility decision

The concrete public templates live under deploy/managed-workers/.

Deployment Modes¶

Mode	Required proof	Failure behavior
Self-hosted internal pool	`pnpm deploy:validate`, `pnpm ops:validate`, one worker smoke, one restore rehearsal	keep reads available, scale worker to zero, recover through durable claims
SaaS-operated pool	release evidence, SLO evidence, restore drill, support/on-call path, BYOK redaction proof	pause/drain affected pool, preserve tenant reads/exports, revoke compromised workers
Customer-hosted pool	unsupported in v1	fail closed in config/docs until trust review is done

Rolling Updates¶

Worker updates must preserve durable ownership:

Set the worker or pool to draining.
Wait for current load to reach zero or for approved active commands to checkpoint.
Confirm queue age, claim latency, BYOK reveal errors, object-store errors, stale-owner rejections, and dead letters are within SLO.
Roll the worker image by immutable digest or release tag with maxUnavailable=0, maxSurge=1, and a termination grace at least as long as OPEN_COWORK_CLOUD_SHUTDOWN_GRACE_MS.
Confirm new heartbeats report the expected Open Cowork version, OpenCode runtime version, service-plane protocol, event/projection contract, and checkpoint schema.
Resume the pool and run a bounded session prompt, workflow, checkpoint, and Gateway-originated prompt smoke where applicable.

The worker role waits for an active command loop to finish during process shutdown until the configured shutdown grace elapses. Operators should still drain before termination; the grace window is a last line of defense, not the primary rollout mechanism.

Rollback And Emergency Revoke¶

Rollback is image-based. Additive schema changes stay in place and are fixed forward. Roll back workers first when OpenCode execution, BYOK injection, checkpointing, or provider/model execution regresses. Roll back web or scheduler only when their own route, projection, auth, or claim behavior is the failing surface.

Emergency revoke is for suspected worker token, host, image, runtime, BYOK, or object-store compromise:

Revoke the worker credential.
Mark the worker revoked.
Stop the worker host or deployment.
Let existing leases expire or be reaped; do not hand-edit command/session records.
Start a replacement worker from a known-good image and verify stale-owner writes are rejected.
Preserve redacted audit events, heartbeat rejections, metrics, and diagnostics for incident review.

SLO And Alert Template¶

Operators should define concrete targets per environment. Public examples should remain generic:

Signal	Suggested private-beta starting point	Alert trigger
Worker heartbeat age	p95 under 60s	no active worker heartbeat for 2 minutes
Queue age	p95 under 5 minutes	oldest command over 10 minutes
Claim latency	p95 under 5 seconds	p95 over 30 seconds
Command latency	p95 under 10 minutes	p95 over 30 minutes
Workflow latency	p95 under schedule interval + 10 minutes	due workflows do not start
Projection lag	latest projection within 25 events	lag keeps growing
Checkpoint failure rate	under 1 percent	sustained failures
Object-store error rate	under 1 percent	any sustained write failure
BYOK reveal failure rate	under 1 percent	sustained failures by org/provider
Stale lease reclaim count	near zero outside drills	spike after release
Dead-letter count	zero	any new dead letter
Quota denial count	expected under load tests	unexpected spike
Auth failure count	low and bounded	spike by IP/org/token
Gateway delivery lag	p95 under 5 minutes	lag caused by worker backlog

Metrics, alert rules, and dashboard starter assets live in deploy/observability/; managed-worker-slo-template.json defines the public-safe SLO evidence shape. Run pnpm ops:validate when these artifacts change.

Backup And Restore¶

The restore order is fixed:

Scale workers, scheduler, and Gateway to zero.
Restore Postgres control-plane records first.
Restore object-store artifacts/checkpoints to the same point in time.
Start web only and verify sessions, projections, workflows, BYOK metadata, worker records, audit rows, metrics, and diagnostics.
Start one worker and run a bounded smoke prompt with checkpoint save.
Start scheduler and verify a due workflow claim without double-fire.
Start Gateway and verify delivery cursors resume without duplicates.

Restore reports must prove checkpoint/artifact metadata matches restored blobs, session projection replay/repair works, workflow run status is consistent, and BYOK secret refs remain valid without exporting plaintext.

Threat Model¶

Threat	Boundary affected	Mitigation	Evidence required	Residual risk
Worker token compromise	worker to control plane	scoped expiring credentials, token hash storage, revoke, audit, no admin API scope	credential revoke tests, heartbeat rejection tests, audit assertions	active work may continue until revoke/lease expiry
Stale worker writes after lease expiry	worker to store/object metadata	lease-token fencing on every write, stale-owner audit	Postgres concurrency tests, stale projection/checkpoint write tests	object payload orphan cleanup may be needed
BYOK plaintext leakage	worker to secret adapter/runtime	worker-role-only reveal, AAD-bound ciphertext, provider options not env, redaction tests	BYOK boundary tests, diagnostics/log scans	compromised worker during active reveal can misuse key until revoked
Object-store prefix escape	worker to object store	scoped key builder, traversal rejection, prefix-limited credentials where available	object key policy tests, restore tamper tests	object stores without scoped credentials rely on app policy
Checkpoint corruption	object store to runtime restore	manifest checksums, schema version, tenant/session binding, restore validation	restore tamper tests, backup/restore drill	old valid checkpoint may contain user bug or bad project state
Tenant crossover	control plane and object store	tenant keys on every record, tenant-scoped queries, object prefix isolation	tenant isolation API/store tests	operator-level accounts remain high-trust
Gateway/channel impersonation	gateway to Cloud API	service token authenticates gateway only; actor identity resolves separately; approval RBAC uses actor membership	gateway identity and interaction tests	compromised channel account can act as that user until revoked
Malicious webhook-triggered work	public webhook to scheduler/workflow	mandatory HMAC/timestamp/replay claim, workflow policy, quota gates	webhook replay/rate tests, workflow policy tests	signed secret compromise enables authorized trigger until rotated
Scheduler double-fire	scheduler to workflow store	atomic due-run claim plus run creation, idempotency keys	scheduler concurrency tests	clock skew can delay work but must not duplicate it
Operator endpoint exposure	operator API to global state	separate operator auth/private networking, no tenant-user access	operator auth tests, deployment validators	misconfigured private networking remains operator responsibility
Diagnostic/log leakage	all surfaces	centralized redaction, no raw prompt/secret in usage/audit, bundle redaction	redaction tests, support bundle tests	novel provider field names may need redaction updates
Worker image/version compromise	release to worker host	signed/checksummed images, compatibility gate, revoke/drain rollback	release validator, version rejection tests	running compromised image can act within scoped worker privileges
Customer-hosted worker trust ambiguity	managed control plane to external worker	deferred from v1, separate design required before enablement	explicit unsupported config tests/docs	customers needing this mode wait for later phase

Phase Readiness¶

Phase 1 implemented worker identity/lifecycle. It does not by itself make work claiming safe.

Phase 2 implements claims/fencing/recovery and is the correctness gate for workflow execution, gateway execution, quotas, and operations. Real Postgres concurrency tests must stay green for every change to leases, claims, queues, workflow due-run claims, or stale-owner rejection.

Phase 5 is complete only when public deployer templates, runbooks, SLO/alert templates, restore drill templates, release evidence templates, and validators prove the worker path can be deployed, drained, rolled forward, rolled back, emergency-revoked, and restored without committing private managed-SaaS values to the public repository.