Files
claw/docs/superpowers/specs/2026-04-03-ws-browser-integration-surface-correction-design.md
木炎 bdf8e12246 feat: align browser callback runtime and export flows
Consolidate the browser task runtime around the callback path, add safer artifact opening for Zhihu exports, and cover the new service/browser flows with focused tests and supporting docs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 21:44:53 +08:00

12 KiB

WS Browser Integration Surface Correction Design

Background

The current websocket service path already proved two things:

  1. sg_claw_client -> sg_claw request handling works.
  2. The ws-native backend/auth replacement removed the old pipe/HMAC mismatch that produced invalid hmac seed: session key must not be empty.

However, real sgBrowser smoke still does not work.

Manual probing against the configured real browser websocket endpoint (ws://127.0.0.1:12345) produced a stable pattern:

  • the connection succeeds
  • the server sends one banner text frame such as Welcome! You are client #1
  • after that, business frames receive no status frame and no callback frame
  • this remains true for:
    • valid-looking sgBrowerserOpenPage frames
    • callback-based APIs
    • no-arg/context-light APIs
    • malformed or obviously wrong frames

At the same time, local documentation and archived frontend code point to a different integration model:

  • the websocket API doc describes the websocket service as a transport replacement for page-context JavaScript calls, and requires the current page URL (requesturl) in each message
  • archived frontend/product code uses window.sgFunctionsUI(...) and window.BrowserAction(...)
  • archived architecture docs describe the supported product path as FunctionsUI -> browser host bridge -> BrowserAction/CommandRouter, not an arbitrary external process speaking raw browser websocket frames

This means the current assumption is no longer acceptable as the default architecture hypothesis:

  • Rejected default assumption: sg_claw can directly control the real browser by speaking raw business frames to browserWsUrl as an external client, with no additional browser-host bridge, page context, or bootstrap/session contract.

That assumption may still turn out to be partially true, but it is no longer justified enough to continue coding against as the mainline design.

Problem Statement

The project currently has a functioning ws-native transport implementation, but it does not have a validated real integration surface for sgBrowser.

The unresolved question is now architectural rather than syntactic:

Possibility A: raw websocket is valid, but requires hidden bootstrap/preconditions

Examples suggested by the local API document:

  • a real browser page must already exist and requesturl must refer to that page
  • one or more setup calls such as sgSetAuthInfo, sgBrowserLogin, sgOpenAgent, or sgBrowerserActiveTab must happen first
  • callbacks may require a browser-side JS/page context that an external process does not automatically have
  • some APIs may only work against agent/show/hide areas after browser-side initialization

Possibility B: raw websocket is not the supported external control surface

Instead, the real product path may require:

  • FunctionsUI / browser-host IPC
  • host-side security and routing
  • BrowserAction / CommandRouter dispatch
  • page-injected or browser-embedded execution context

If this is true, continuing to invest in raw external websocket business-frame handling as the main integration surface would be architectural drift.

Goal

Replace the current unvalidated ws-native-direct assumption with a decision-backed integration strategy.

The next implementation slice must do exactly one of these two things based on evidence:

  1. Bootstrap path: prove that raw websocket control is real and supported once the missing bootstrap/precondition sequence is performed, then codify that bootstrap sequence and keep WsBrowserBackend as the execution surface.
  2. Bridge path: prove that raw websocket is not the real supported surface for external control, then pivot the runtime design so sgClaw targets the actual browser-host bridge / BrowserAction surface instead of pretending the raw websocket is enough.

Non-goals

This correction slice does not include:

  • broad feature work on the floating chat UI
  • multi-client service redesign
  • browser process lifecycle management
  • speculative protocol expansion
  • generic reconnection/backoff work
  • rewriting the entire compat/runtime stack without evidence
  • landing both bootstrap and bridge implementations in one branch

The purpose of this slice is to choose the correct integration surface first.

Evidence Summary

Evidence that the current raw-ws-direct assumption is weak

  1. Real endpoint accepts connections but stays silent after the welcome/banner frame.
  2. Silence occurs even for malformed frames, which suggests the endpoint is not acting like an openly documented RPC surface for arbitrary external clients.
  3. The API documentation frames websocket use as a replacement for page-side JS invocation, not as a standalone public automation API.
  4. The documentation repeatedly depends on requesturl, callback function names, target pages, and browser areas (show, hide, agent).
  5. Historical frontend/product code uses window.sgFunctionsUI(...) and window.BrowserAction(...), not raw external websocket business calls.
  6. Historical architecture docs emphasize FunctionsUI, CommandRouter, and browser-host bridge seams.

Evidence that the current ws-native work is still useful

  1. The ws-native auth replacement removed a real bug.
  2. The ws backend now correctly carries forward the last navigated request URL.
  3. WsBrowserBackend and ws_protocol remain valuable as deterministic protocol tooling for fake-server tests and any future bootstrap validation.

So the conclusion is not “delete ws-native work.”

The conclusion is:

  • do not treat raw external websocket control as validated product architecture yet
  • use the ws-native code only behind a decision gate

Design Decision

Adopt a decision-gated integration strategy.

Decision Gate 1: Validate bootstrap viability first

Before any more production architecture changes, add a focused, deterministic validation harness that can exercise a candidate raw-websocket bootstrap sequence against a live endpoint.

The harness must support:

  • ordered frame scripts
  • exact frame logging
  • exact timeout/silence observation
  • trying candidate setup sequences such as:
    • sgSetAuthInfo
    • sgBrowserLogin
    • sgOpenAgent
    • sgBrowerserActiveTab
    • then a minimal action such as sgBrowerserOpenPage or sgBrowserExcuteJsCodeByArea
  • trying the same action with different requesturl assumptions
  • distinguishing these outcomes:
    • numeric status returned
    • callback returned
    • welcome only, then silence
    • close/reset
    • protocol error

This harness is not product code. It is an evidence tool that prevents blind implementation.

Decision Gate 2: Make bridge pivot the default fallback

If the validation harness cannot demonstrate a reproducible bootstrap sequence that yields real status/callback frames from the live browser endpoint, then raw websocket must be considered non-validated for external control.

At that point, the design must pivot to the bridge path:

  • sgClaw browser control targets the real browser-host integration surface
  • use the bridge already evidenced in docs/code (FunctionsUI, browser host IPC, BrowserAction, CommandRouter)
  • keep raw websocket support, if retained at all, as a diagnostic or highly constrained adapter rather than the primary product path

Architecture Options

Option A: Bootstrap-validated raw websocket path

Choose this only if the live validation harness produces repeatable evidence.

Resulting architecture

sg_claw_client
  -> sg_claw service
    -> bootstrap sequence executor
      -> WsBrowserBackend
        -> browserWsUrl
          -> sgBrowser

Required conditions

  • a reproducible bootstrap sequence exists
  • the sequence yields status/callback traffic for real business actions
  • the sequence can be encoded as a narrow service-side precondition layer
  • the sequence does not require unowned browser UI/manual setup outside a documented contract

Allowed production changes if Option A wins

  • add explicit bootstrap calls before first browser action
  • persist validated session/context state needed by the real endpoint
  • tighten request_url / target-page handling around the proven contract

Not allowed even if Option A wins

  • guessing bootstrap steps without evidence
  • silently sprinkling many setup calls into random locations
  • broadening the compat/runtime API before the bootstrap contract is known

Option B: Bridge-first integration path

Choose this if live validation does not prove a workable raw websocket bootstrap.

Resulting architecture

sg_claw_client
  -> sg_claw service
    -> bridge adapter
      -> browser host / FunctionsUI / BrowserAction / CommandRouter
        -> sgBrowser page actions

Required conditions

  • local docs/code show a stable supported bridge path
  • raw websocket remains non-validated or only page-context-scoped
  • the bridge surface can be wrapped behind the existing BrowserBackend abstraction or a sibling adapter without weakening pipe behavior

Allowed production changes if Option B wins

  • add a new browser backend implementation that targets the real bridge surface
  • redirect ws service/browser execution away from raw business frames
  • preserve ws-native code only for tests, probes, or intentionally constrained cases

Not allowed even if Option B wins

  • pretending the old raw-ws mainline still works “well enough”
  • leaving the service path ambiguously split between two competing primary surfaces

Scope Guardrails for the Next Implementation Plan

The next implementation plan must obey these guardrails:

  1. One branch, one decision. Do not implement both architecture options at once.
  2. Evidence before code. If bootstrap is unproven, the next coding task is probe/validation tooling, not another speculative service/runtime refactor.
  3. Keep pipe untouched. src/lib.rs, pipe handshake, and the pipe BrowserPipeTool path remain behaviorally unchanged.
  4. Do not delete ws-native code prematurely. It still has value for protocol tests and validation tooling.
  5. Do not broaden success claims. Removing invalid hmac seed did not make real browser control work.

Testing Strategy

Stage 1: Evidence tooling tests

Add deterministic tests for the live-probe/validation harness so it can:

  • send an ordered frame script
  • record exact received frames
  • report silence/timeout precisely
  • expose transcript output suitable for comparing candidate bootstrap sequences

These tests use a fake websocket server, not sgBrowser.

Stage 2: Live validation runs

Use the harness against the real endpoint with a fixed matrix of candidate sequences.

At minimum, compare:

  1. no bootstrap -> minimal action
  2. sgOpenAgent -> minimal action
  3. sgSetAuthInfo -> minimal action
  4. sgBrowserLogin -> minimal action
  5. sgBrowerserActiveTab -> minimal action
  6. combined documented bootstrap candidates -> minimal action
  7. alternate requesturl values representing:
    • about:blank
    • target page URL
    • a currently open page URL if known

Stage 3: Architecture-branch acceptance

If Option A wins:

  • add one automated regression that proves the validated bootstrap sequence produces the first real status frame in a controlled integration test
  • then continue with the narrowest production implementation plan

If Option B wins:

  • write a new bridge-integration implementation plan before changing production code
  • base all production tasks on the documented bridge surface

Acceptance Criteria for This Design Correction

This design correction is successful only if future work follows these rules:

  1. The repository has an explicit design document recording that raw ws-native direct control is not currently validated.
  2. The next engineering slice starts with validation or bridge selection, not another speculative runtime refactor.
  3. Any future claim that raw websocket is the supported production path must be backed by a reproducible live bootstrap transcript.
  4. If that evidence does not appear, the project pivots to the bridge path rather than continuing to guess.

Consequences

Positive

  • stops further speculative coding against an unproven surface
  • preserves useful ws-native work without over-committing to it
  • creates a clean decision point for the next implementation branch

Trade-off

  • this does not immediately unblock real browser control
  • it intentionally inserts an evidence phase before more production changes

That trade-off is acceptable because the current failure mode is architectural uncertainty, not a missing two-line fix.