Consolidate the browser task runtime around the callback path, add safer artifact opening for Zhihu exports, and cover the new service/browser flows with focused tests and supporting docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
12 KiB
WS Browser Integration Surface Correction Design
Background
The current websocket service path already proved two things:
sg_claw_client -> sg_clawrequest handling works.- The ws-native backend/auth replacement removed the old pipe/HMAC mismatch that produced
invalid hmac seed: session key must not be empty.
However, real sgBrowser smoke still does not work.
Manual probing against the configured real browser websocket endpoint (ws://127.0.0.1:12345) produced a stable pattern:
- the connection succeeds
- the server sends one banner text frame such as
Welcome! You are client #1 - after that, business frames receive no status frame and no callback frame
- this remains true for:
- valid-looking
sgBrowerserOpenPageframes - callback-based APIs
- no-arg/context-light APIs
- malformed or obviously wrong frames
- valid-looking
At the same time, local documentation and archived frontend code point to a different integration model:
- the websocket API doc describes the websocket service as a transport replacement for page-context JavaScript calls, and requires the current page URL (
requesturl) in each message - archived frontend/product code uses
window.sgFunctionsUI(...)andwindow.BrowserAction(...) - archived architecture docs describe the supported product path as
FunctionsUI -> browser host bridge -> BrowserAction/CommandRouter, not an arbitrary external process speaking raw browser websocket frames
This means the current assumption is no longer acceptable as the default architecture hypothesis:
- Rejected default assumption:
sg_clawcan directly control the real browser by speaking raw business frames tobrowserWsUrlas an external client, with no additional browser-host bridge, page context, or bootstrap/session contract.
That assumption may still turn out to be partially true, but it is no longer justified enough to continue coding against as the mainline design.
Problem Statement
The project currently has a functioning ws-native transport implementation, but it does not have a validated real integration surface for sgBrowser.
The unresolved question is now architectural rather than syntactic:
Possibility A: raw websocket is valid, but requires hidden bootstrap/preconditions
Examples suggested by the local API document:
- a real browser page must already exist and
requesturlmust refer to that page - one or more setup calls such as
sgSetAuthInfo,sgBrowserLogin,sgOpenAgent, orsgBrowerserActiveTabmust happen first - callbacks may require a browser-side JS/page context that an external process does not automatically have
- some APIs may only work against agent/show/hide areas after browser-side initialization
Possibility B: raw websocket is not the supported external control surface
Instead, the real product path may require:
FunctionsUI/ browser-host IPC- host-side security and routing
BrowserAction/CommandRouterdispatch- page-injected or browser-embedded execution context
If this is true, continuing to invest in raw external websocket business-frame handling as the main integration surface would be architectural drift.
Goal
Replace the current unvalidated ws-native-direct assumption with a decision-backed integration strategy.
The next implementation slice must do exactly one of these two things based on evidence:
- Bootstrap path: prove that raw websocket control is real and supported once the missing bootstrap/precondition sequence is performed, then codify that bootstrap sequence and keep
WsBrowserBackendas the execution surface. - Bridge path: prove that raw websocket is not the real supported surface for external control, then pivot the runtime design so sgClaw targets the actual browser-host bridge /
BrowserActionsurface instead of pretending the raw websocket is enough.
Non-goals
This correction slice does not include:
- broad feature work on the floating chat UI
- multi-client service redesign
- browser process lifecycle management
- speculative protocol expansion
- generic reconnection/backoff work
- rewriting the entire compat/runtime stack without evidence
- landing both bootstrap and bridge implementations in one branch
The purpose of this slice is to choose the correct integration surface first.
Evidence Summary
Evidence that the current raw-ws-direct assumption is weak
- Real endpoint accepts connections but stays silent after the welcome/banner frame.
- Silence occurs even for malformed frames, which suggests the endpoint is not acting like an openly documented RPC surface for arbitrary external clients.
- The API documentation frames websocket use as a replacement for page-side JS invocation, not as a standalone public automation API.
- The documentation repeatedly depends on
requesturl, callback function names, target pages, and browser areas (show,hide,agent). - Historical frontend/product code uses
window.sgFunctionsUI(...)andwindow.BrowserAction(...), not raw external websocket business calls. - Historical architecture docs emphasize
FunctionsUI,CommandRouter, and browser-host bridge seams.
Evidence that the current ws-native work is still useful
- The ws-native auth replacement removed a real bug.
- The ws backend now correctly carries forward the last navigated request URL.
WsBrowserBackendandws_protocolremain valuable as deterministic protocol tooling for fake-server tests and any future bootstrap validation.
So the conclusion is not “delete ws-native work.”
The conclusion is:
- do not treat raw external websocket control as validated product architecture yet
- use the ws-native code only behind a decision gate
Design Decision
Adopt a decision-gated integration strategy.
Decision Gate 1: Validate bootstrap viability first
Before any more production architecture changes, add a focused, deterministic validation harness that can exercise a candidate raw-websocket bootstrap sequence against a live endpoint.
The harness must support:
- ordered frame scripts
- exact frame logging
- exact timeout/silence observation
- trying candidate setup sequences such as:
sgSetAuthInfosgBrowserLoginsgOpenAgentsgBrowerserActiveTab- then a minimal action such as
sgBrowerserOpenPageorsgBrowserExcuteJsCodeByArea
- trying the same action with different
requesturlassumptions - distinguishing these outcomes:
- numeric status returned
- callback returned
- welcome only, then silence
- close/reset
- protocol error
This harness is not product code. It is an evidence tool that prevents blind implementation.
Decision Gate 2: Make bridge pivot the default fallback
If the validation harness cannot demonstrate a reproducible bootstrap sequence that yields real status/callback frames from the live browser endpoint, then raw websocket must be considered non-validated for external control.
At that point, the design must pivot to the bridge path:
- sgClaw browser control targets the real browser-host integration surface
- use the bridge already evidenced in docs/code (
FunctionsUI, browser host IPC,BrowserAction,CommandRouter) - keep raw websocket support, if retained at all, as a diagnostic or highly constrained adapter rather than the primary product path
Architecture Options
Option A: Bootstrap-validated raw websocket path
Choose this only if the live validation harness produces repeatable evidence.
Resulting architecture
sg_claw_client
-> sg_claw service
-> bootstrap sequence executor
-> WsBrowserBackend
-> browserWsUrl
-> sgBrowser
Required conditions
- a reproducible bootstrap sequence exists
- the sequence yields status/callback traffic for real business actions
- the sequence can be encoded as a narrow service-side precondition layer
- the sequence does not require unowned browser UI/manual setup outside a documented contract
Allowed production changes if Option A wins
- add explicit bootstrap calls before first browser action
- persist validated session/context state needed by the real endpoint
- tighten
request_url/ target-page handling around the proven contract
Not allowed even if Option A wins
- guessing bootstrap steps without evidence
- silently sprinkling many setup calls into random locations
- broadening the compat/runtime API before the bootstrap contract is known
Option B: Bridge-first integration path
Choose this if live validation does not prove a workable raw websocket bootstrap.
Resulting architecture
sg_claw_client
-> sg_claw service
-> bridge adapter
-> browser host / FunctionsUI / BrowserAction / CommandRouter
-> sgBrowser page actions
Required conditions
- local docs/code show a stable supported bridge path
- raw websocket remains non-validated or only page-context-scoped
- the bridge surface can be wrapped behind the existing
BrowserBackendabstraction or a sibling adapter without weakening pipe behavior
Allowed production changes if Option B wins
- add a new browser backend implementation that targets the real bridge surface
- redirect ws service/browser execution away from raw business frames
- preserve ws-native code only for tests, probes, or intentionally constrained cases
Not allowed even if Option B wins
- pretending the old raw-ws mainline still works “well enough”
- leaving the service path ambiguously split between two competing primary surfaces
Scope Guardrails for the Next Implementation Plan
The next implementation plan must obey these guardrails:
- One branch, one decision. Do not implement both architecture options at once.
- Evidence before code. If bootstrap is unproven, the next coding task is probe/validation tooling, not another speculative service/runtime refactor.
- Keep pipe untouched.
src/lib.rs, pipe handshake, and the pipeBrowserPipeToolpath remain behaviorally unchanged. - Do not delete ws-native code prematurely. It still has value for protocol tests and validation tooling.
- Do not broaden success claims. Removing
invalid hmac seeddid not make real browser control work.
Testing Strategy
Stage 1: Evidence tooling tests
Add deterministic tests for the live-probe/validation harness so it can:
- send an ordered frame script
- record exact received frames
- report silence/timeout precisely
- expose transcript output suitable for comparing candidate bootstrap sequences
These tests use a fake websocket server, not sgBrowser.
Stage 2: Live validation runs
Use the harness against the real endpoint with a fixed matrix of candidate sequences.
At minimum, compare:
- no bootstrap -> minimal action
sgOpenAgent-> minimal actionsgSetAuthInfo-> minimal actionsgBrowserLogin-> minimal actionsgBrowerserActiveTab-> minimal action- combined documented bootstrap candidates -> minimal action
- alternate
requesturlvalues representing:about:blank- target page URL
- a currently open page URL if known
Stage 3: Architecture-branch acceptance
If Option A wins:
- add one automated regression that proves the validated bootstrap sequence produces the first real status frame in a controlled integration test
- then continue with the narrowest production implementation plan
If Option B wins:
- write a new bridge-integration implementation plan before changing production code
- base all production tasks on the documented bridge surface
Acceptance Criteria for This Design Correction
This design correction is successful only if future work follows these rules:
- The repository has an explicit design document recording that raw ws-native direct control is not currently validated.
- The next engineering slice starts with validation or bridge selection, not another speculative runtime refactor.
- Any future claim that raw websocket is the supported production path must be backed by a reproducible live bootstrap transcript.
- If that evidence does not appear, the project pivots to the bridge path rather than continuing to guess.
Consequences
Positive
- stops further speculative coding against an unproven surface
- preserves useful ws-native work without over-committing to it
- creates a clean decision point for the next implementation branch
Trade-off
- this does not immediately unblock real browser control
- it intentionally inserts an evidence phase before more production changes
That trade-off is acceptable because the current failure mode is architectural uncertainty, not a missing two-line fix.