Open Cowork — Deep Repository Audit (2026-06-24)¶
Whole-repo audit across security, scalability/performance, code quality, dead code, dead docs, and production readiness, run as five parallel specialist passes over checkpoint/studio-production-audit. Every finding was confirmed by reading the code; file:line cited throughout.
Verdict¶
This is a mature, security-conscious, high-quality codebase (~200K LOC; ~43 any-casts, 0 @ts-ignore, 5 stale-free TODOs, strict tsc+eslint+knip gate, dual-backend contract tests, SLSA/SBOM/Cosign release supply chain). No CRITICAL security vulnerabilities. Tenant isolation, API-token verification, cookie/CORS/webhook-signature auth, command/SQL/path-injection defense, Electron hardening, and BYOK envelope encryption are all sound.
The material gaps are operational and performance-related, not correctness/security holes. One item is P0 because it blocks CI today.
P0 — Do now (blocks CI / release)¶
- Dependency-audit gate is RED.
pnpm audit --prodexits 1: 1 high + 8 moderate + 3 low. CI (.github/workflows/ci.yml:57-61), release, and monthly-maintenance all run it fail-closed, so every PR/release is blocked. Thehono@<4.12.21override (package.json:122) is stale (advisories now need>=4.12.25); nodompurify/js-yamloverrides exist. - hono HIGH (CORS reflect-any-origin) + 4 moderate — NOT reachable. hono enters only transitively via
apps/desktop → @modelcontextprotocol/sdk → hono; there is nohono/serveStaticimport anywhere and the cloud server is rawnode:http. Bump for the gate, not for exposure. - dompurify ≤3.4.6 (moderate) — REACHABLE. Real use in
packages/app/src/components/chat/MarkdownContent.tsx:2+MermaidChart.tsx:2sanitizing untrusted LLM/agent HTML beforedangerouslySetInnerHTML. The IN_PLACE bypass applies; theaddHookvariants don't (noaddHook). Bump dompurify ≥3.4.11. (Mitigated by sandboxed renderer +script-src 'self'CSP, but DOM-clobbering remains.) - js-yaml (moderate, quadratic) — low reachability (
electron-updater, parses trusted update feed). - Fix: bump the
honooverride to>=4.12.25, adddompurify >=3.4.11+js-yamloverrides, re-runpnpm audit --prodto confirm exit 0.
P1 — High (production blockers / real risk)¶
Resilience¶
- Cloud-managed OpenCode subprocess has no crash recovery.
createDefaultCloudRuntimeFactorynever wiresonUnexpectedExit(packages/cloud-server/src/app.ts:380-389; undefined atopencode-runtime-adapter.ts:501); the worker-scoped adapter keeps the dead cache entry (no liveness check) until idle-TTL, sopromptSession/subscribeEventshit a dead server with no respawn. SSE also doesn't re-subscribe on stream drop (opencode-runtime-adapter.ts:389-420). Fix: passonUnexpectedExitthat evicts/closes the runtime entry; surface the crash to in-flight callers. - Desktop runtime reconnect loop has no max-attempt cap (
apps/desktop/src/main/index.ts:356-382) — a persistently-crashing OpenCode reconnect-loops forever (60s backoff, no circuit breaker). Fix: terminal error after N failures in a window. - Gateway HTTP server has no socket timeouts (slowloris/DoS).
createServer(apps/gateway/src/daemon.ts:134) sets norequestTimeout/headersTimeout/keepAliveTimeout/maxConnections; the body reader (:559-566) caps bytes but not time. Internet-facing webhook endpoint. Fix: set request/headers/keepAlive timeouts + max connections.
Secret hygiene (log-sanitizer)¶
- Sanitizer misses Google
AIza…and Slackxox[baprs]-…keys (packages/shared/src/log-sanitizer.ts:8-49) — both are first-class BYOK/gateway providers, so realistic in logs. Fix: add both patterns. - The redactor is copy-pasted 8× with structural drift (
postgres-store-normalizers.ts:27,in-memory-control-plane-store.ts:243,in-memory-domains/channel-deliveries.ts:175+channel-provider-events.ts:167+workers.ts:488, and the threepostgres-store-domains/{channel-deliveries,channel-provider-events,workers}.ts). Theworkers.tspair has already diverged in form. Plus 3 different sanitizers (log-sanitizer.ts,observability.ts:126-146,byok-secret-store.ts) with inconsistentsk-/AWS/high-entropy rules — the unexpected-error path logs rawerror.messageto telemetry through the weaker one (http-server.ts:1732-1745). Fix: unify on ONE shared sanitizer (complete pattern set), delete the 7+ copies, route observability through it.
Performance (criticals)¶
- Cloud SSE fan-out is O(all connected clients) per event, cross-tenant.
InMemoryCloudEventFanoutAdapter.publish(packages/cloud-server/src/session-event-bus.ts:39-44) loops the entire global subscriber Set on every projected event; one busy tenant taxes all. Fix: index subscriptions by routing key (Map<tenant:session, Set>/Map<tenant:user, Set>) — the keyedsse-replay.tshub is the model. - Synchronous
readFileSync+JSON.parseon every streamed runtime event in the Electron main thread (apps/desktop/src/main/desktop-pairing/service.ts:329→store.ts:262, registered atipc-handlers.ts:343, fires permessage.part.delta). UI jank scaling with token throughput. Fix: cache parsed pairings in memory; early-return when no online pairing. - Data retention: no TTL/pruning on any append-only table —
cloud_session_events,cloud_workspace_events,cloud_usage_events,cloud_audit_events,cloud_workflow_runs,cloud_channel_deliveries, expiredcloud_channel_interactions(allpostgres-schema.ts) grow forever; the in-memory variants too. Inflates backups (vs the 5-min RPO target) and slows boot-timeCOUNT(*)backfills. Fix: batched retention job inCloudSchedulerper the runbook retention matrix; partition the hot event tables.
Database hot-path queries (unindexed / unbounded)¶
findSession— cross-tenant scan + sort, PK can't be used,opencode_session_idunindexed (postgres-control-plane-store.ts:1034-1042). Add indexes on(session_id)and(opencode_session_id).listSessions— no LIMIT, grows per-user forever (:1045-1057);listSessionsPageexists — route callers to it.findApiTokenByPlaintextlegacy fallback — unbounded cross-org scan + per-row hash on the auth hot path (postgres-store-domains/api-tokens.ts). Derivetoken_idand drop the scanning fallback.listSessionEvents/listWorkspaceEvents— optional LIMIT; projection-rebuild loads an entire session's event log into memory (session-projection-service.ts:147). Enforce a hard server-side max page; paginate rebuild by keyset.
P2 — Medium¶
Performance / memory¶
- Renderer: session-view derived per-patch over ALL messages + timeline mints fresh
Messageobjects defeatingMessageBubblememo (packages/shared/src/session-view-messages.ts:303,chat-view-timeline.ts:64-88); SessionInspector re-serializes all tool payloads + unvirtualized MessageList (SessionInspector.tsx:267-282,467);ChatTimelineItem/ToolTraceunmemoized. Derive once per batch; stable message identity; memoize. - Unbounded in-memory growth:
viewCacheByIdnever pruned (runtime-host/session-engine.ts:92),runtimeToolCacheTTL-but-no-size (runtime-tool-cache.ts:13), desktopsession-task-state-storesessionLineage/taskRuns+ O(N) scans never swept (:111-120,345-378), gateway per-stream render-state maps never evicted (render/state.ts:20). Add LRU caps / sweep eviction. - Gateway: unbounded per-stream promise-chain queue (no backpressure) (
session-stream-manager.ts:170), streams never torn down on session end (only 30-min TTL / 2000 LRU) (:197-257), unbounded outbound delivery concurrency + no per-provider rate limiting (gateway-runtime.ts:54,72-87), fixed 250ms reconnect (no backoff/jitter → thundering herd) (:65,188). Bound + backpressure + token-bucket. - Cloud writes:
appendProjectedEventdoes ~6 sequential round-trips/event (session-projection-service.ts:100-141); reapers hold locks across ≤1000 round-trips in one tx (postgres-control-plane-store.ts:1623-1730,workflows.ts:371-477) with defaultstatement_timeout=0. Batch + smaller tx batches + non-zero statement_timeout. MAX_DIRECTORY_CLIENTS=10_000(runtime-host/runtime-state.ts:5) — 10k live SSE connections before LRU eviction; lower to a realistic working set.
Ops / deployment¶
- Dockerfile healthcheck probes liveness, not readiness — both hit
/healthz(unconditional 200) instead of/readyz(real DB/object-store/secret round-trip).docker/open-cowork-cloud/Dockerfile:46, gateway:44. Point Docker healthcheck at/readyz(Helm already does it right). - Gateway
/readydoesn't verify cloud control-plane connectivity (gateway-runtime.ts:101-104) — can report ready while its core dependency is down. Fold cloud reachability intoready(). - Migrations re-execute fully on every boot (no applied-guard on the transactional path;
postgres-migrations.ts:31-34) — safe (idempotent DDL) but re-runs full-tableCOUNT(*)backfills, tying boot time to table size. Skip migrations already incloud_schema_migrations. cloud:migratehas no first-class deploy artifact (no Helm Job/initContainer; only embedded migrate-on-boot). Add a pre-upgrade migration Job; documentRUN_MIGRATIONS=falsesplit-role flow.cloud:buildhas no standalone CI gate (only transitively via the Docker build). Addpnpm cloud:build+ cloud typecheck to the validate job.- CodeQL uses
security-and-quality, notsecurity-extended(codeql.yml:47).
Code quality¶
- Control-plane policy duplicated between
postgres-control-plane-store.tsandin-memory-control-plane-store.ts(lease-reaper tree, idempotency rules, monotonicity, lease-token format) — with one real behavioral inconsistency (Postgres comparestenantId/sessionIdon idempotency reuse, in-memory omits them:postgres…:1755vsin-memory…:1319). Extract as pure rule functions. - Gateway signature/replay primitives diverge across providers —
constantTimeStringEqualreimplemented 4× with different null-handling (webhook:872,slack:566,telegram:611,email:654); replay-cache cloned webhook↔slack. A security primitive behaving differently per provider. Extract into@open-cowork/gateway-channel. getAdminPolicyOverviewreturns org policy/feature/BYOK-policy config to any active member, not admin-gated (services/overview-service.ts:134-141) — information exposure (no secrets/cross-tenant). Gate withassertOrgAdminif intended admin-only.- Silently-swallowed errors degrade cloud features —
workspace-gateway.ts:650(cloudPolicycatch {}→ can't distinguish network failure from policy denial),session-handlers.ts:73(active-workspace check → drops events), abort handlers swallow into return-null (renderer thinks abort succeeded). Add logging; distinguish error from negative result for abort. - God objects:
WorkspaceGateway1801 lines / 6 concerns (apps/desktop/src/main/workspace-gateway.ts, two ~95%-identical subscribe methods),projectSessionHistoryone 727-line function (runtime-host/session-history-projector.ts:171),dispatchCloudWorkspaceSessionEvent216-line 13-branch translator (ipc/session-handlers.ts:225).CloudSessionService15-positional-param constructor (session-service.ts:458). Extract sub-modules / options objects.
Dead code¶
- Broken import:
apps/desktop/tests/smoke-helpers.ts:17imports from../src/main/e2e-remote-debugging.tswhich no longer exists (knip's lone unresolved import). Remove/fix. - Production command-palette leaks a DEV-only view:
command-palette-items.ts:272-281adds "UI Primitives (QA)" unconditionally, but the view isimport.meta.env.DEV-gated (App.tsx:53) → selecting it in prod navigates to an empty view. Gate the palette item behind the same flag. - Unused files (knip-confirmed, no dynamic consumers):
apps/desktop/src/main/file-session.ts,permission-inheritance.ts,renderer/components/agents/AgentAttributeBar.tsx(or wire it into agent cards),renderer/loading.ts,renderer/components/ui/utils.ts(zero-importer re-export shim). Remove. - 15 of 25
StudioPrimitives.tsxcomponents are gallery-only (CoworkerCard, ComposerShell, DeliverableCard, ArtifactCard, ChannelStatusCard/Row, PersonRow, WizardSteps/StepPane, WorkingStyleBars, StudioShell, ConversationLaneCard, KanbanBoard, PermissionEditorRow, TraitSlider) — only referenced by the gallery + barrels + tests. Decide: intentional design-system reserve, or prune / stop barrel-exporting.
P3 — Low¶
webhook_replay_claimstrim sorts on unindexedseen_at_ms;telemetry.trackEventuses syncappendFileSyncper event; gateway shutdown drain can leakackDeliveryafter timeout;session-engine.removeSessionclones the whole map per removal;unknownEventLastLoggedAtmap never cleared; MCP status poll fixed at 10s. (See scalability pass L1-L7.)- Plaintext/weak-key secret adapter allowed outside
public_productiontier (app.ts:803,810) — extend envelope + strong-key assertions (or a loud warning) toprivate_beta. - Bridge providers (Discord/WhatsApp/Signal) don't do native platform signature verification (trust an upstream relay; fail-closed) — surface the relay requirement in operator docs.
- Worker loop uses non-locking
listRunnableSessionsthen races per-session lease (correct but wasteful; atomicclaimRunnableSessionsexists unused). Single explicit: any(tool-trace-utils.ts:30).deriveKeyuses bare SHA-256 (fine for high-entropy key; HKDF would be cleaner). SECURITY.md has no PGP contact.
Dead docs (stale after the cloud-server extraction + Studio pass)¶
- Stale moved-path references (cloud
apps/desktop/src/main/cloud/**→packages/cloud-server/src/; substrate →packages/runtime-host/src/):docs/architecture.md(~14 paths),AGENTS.md(~14 paths + a broken markdown link at :23 to a removed file),docs/downstream.md,docs/downstream-contract.md,docs/oss-packaging-migration.md,docs/security-model.md:158,docs/coordination-model.md:180,docs/claw-like-agents.md(also references removedautomation.md/automation.ts). Repoint. - Superseded-behavior docs (Studio: single Mercury theme + glow removed):
docs/design-refinement-and-cloud-react-proposal.md(argues to KEEP all presets + the glow),docs/prototypes/mercury-polish-showcase.html(Nord/Gruvbox/Dracula swatches),docs/design-tokens.md/design-system.md(ambient--glow-*). Archive / add superseded banners. - Completed design docs to archive:
docs/design/cloud-server-extraction.md(Milestone C ✅ COMPLETE),docs/design/studio-production-audit.md(mostly ✅; dead link tohandoff/BRIEF.md), and the priordocs/design/production-readiness-audit.md. - Duplicate/overlapping:
docs/production-readiness-audit.md(2026-06-02, inmkdocs.ymlnav) vsdocs/design/production-readiness-audit.md(2026-06-16, broader, NOT in nav). Disambiguate; promote the newer or mark the older superseded. - Stale error string:
scripts/validate-deployment-configs.mjs:1368checks the new path but the failure message still namespackages/cloud-server/src/http-server.ts.
Verified strong (do not regress)¶
Tenant/workspace isolation (every session-scoped read/write funnels through getSessionView(principal,…) + tenant-scoped store; cross-tenant id → 404); API-token verification (scrypt + per-token salt + timingSafeEqual + revoked/expiry + active-membership); cookie auth (HMAC-signed, HttpOnly+Secure+SameSite, double-submit CSRF); CORS (exact allowlist, *+credentials blocked at boot); webhook signatures (Slack/generic/Telegram/Stripe/workflow — timing-safe, mandatory, replay-protected); injection-clean (all child_process use argv arrays, zero shell:true; all SQL parameterized; path traversal blocked by assertSafeObjectKey); Electron hardening (contextIsolation+sandbox, triple IPC channel allowlists + assertTrustedIpcSender, deny-all window-open, nav allowlist); BYOK envelope AES-256-GCM with tenant-bound AAD + strong-key validator; deployment-tier guardrails (fail-closed public_production); lease-based concurrency (FOR UPDATE [SKIP LOCKED], monotonic tokens — horizontally scalable, exactly-once scheduler); graceful shutdown + orphan process cleanup (PID ledger, SIGTERM→SIGKILL, Windows taskkill /T); observability (structured JSON logs, correlation IDs, Prometheus/OTLP, real /readyz); release supply chain (signed tags, SHA256SUMS+GPG, SLSA provenance, CycloneDX+SPDX SBOMs, Cosign, Grype gate, SHA-pinned actions). Type discipline: 1 explicit : any in src, 0 @ts-ignore, narrow justified eslint-disables, no silently-empty catch blocks.