admin/claw

Files

木炎 956f0c2b68 feat: add generated scene skill platform hardening

2026-04-21 23:19:06 +08:00

9.4 KiB

Raw Blame History

102 Full Sweep Improvement Roadmap Plan

Date: 2026-04-19 Status: Draft Upstream Spec: docs/superpowers/specs/2026-04-19-102-full-sweep-improvement-roadmap-design.md Upstream Dry-Run Result: tests/fixtures/generated_scene/full_sweep_dry_run_2026-04-19.json Upstream Triage Result: tests/fixtures/generated_scene/full_sweep_dry_run_triage_2026-04-19.json

Plan Intent

Turn the 102 scene dry-run and triage findings into a governed improvement roadmap.

This plan is intentionally broad like the earlier 60-to-90 roadmap. It coordinates multiple bounded implementation tracks instead of starting isolated fixes from individual failures.

Baseline

Current measured baseline:

Metric	Count
Real-sample executed pass	5 / 102
Code-backed ledger coverage	23 / 102
Dry-run auto-pass	40 / 102
Dry-run actionable coverage	66 / 102

Current triage baseline:

Bucket	Count	Triage conclusion
Timeout	31	`19 timeout-unvalidated-source`, `8 timeout-large-source`, `4 timeout-known-family-sample`
Misclassified	5	all `route-overprefer-host-bridge`
No-report failure	25	all `readiness-before-report`
Bootstrap target	1	separate `bootstrap_target`

Scope Guardrails

do not add new scene families
do not update scene_execution_board_2026-04-18.json inside this roadmap
do not promote scenes directly from diagnostic or dry-run results
do not reopen completed real-sample passes except as regression checks
do not start G4/G5
do not implement full login recovery
do not implement full host runtime transport
do not implement local document attachment runtime
do not create unbounded micro-plans from a single failure

Workstreams

WS1 Timeout Diagnostics and Scan Budget
WS2 Routing Boundary Correction
WS3 Structured Fail-Closed Reporting
WS4 Follow-Up Sweep and Coverage Delta

Phase 0: Freeze Improvement Baseline

Objective

Freeze the dry-run and triage outputs as the only accepted inputs to this roadmap.

Tasks

freeze full_sweep_dry_run_2026-04-19.json
freeze full_sweep_dry_run_triage_2026-04-19.json
freeze the four headline metrics:
- 5/102 real-sample pass
- 23/102 code-backed ledger coverage
- 40/102 dry-run auto-pass
- 66/102 dry-run actionable coverage
freeze the problem buckets:
- 4 known-family timeouts
- 8 large-source timeouts
- 19 unvalidated-source timeouts
- 5 host-bridge over-preference cases
- 25 readiness-before-report failures
- 1 bootstrap-target failure

Deliverables

baseline statement
frozen blocker inventory
roadmap entry criteria

Acceptance Criteria

no additional scene is added to scope
no implementation starts before the baseline is frozen
dry-run and triage assets are treated as immutable inputs

Phase 1: Known-Family Timeout Diagnostics

Objective

Resolve the highest-priority ambiguity: known-family scenes that timed out in the full sweep.

Tasks

select only records labeled timeout-known-family-sample
capture source scale metrics and previous family context
run bounded diagnostic attempts if needed
classify each record as:
- known-family-rerun-pass
- known-family-source-scale-timeout
- known-family-generator-hotspot
- known-family-contract-blocked-after-long-run
- known-family-timeout-unresolved
publish diagnostic result

Deliverables

known-family timeout diagnostic JSON
known-family timeout diagnostic report

Acceptance Criteria

all 4 known-family timeout records are classified
no scene is promoted from diagnostic success
no generator logic is changed in the diagnostic step

Phase 2: Source-Scale and Scan-Budget Improvement

Objective

Reduce timeout noise caused by oversized source directories and obvious vendor/library files.

Tasks

analyze timeout-large-source and timeout-unvalidated-source
define source scan budget policy
define vendor/library ignore policy
implement only bounded source scanning or timeout reporting changes
verify no canonical or real-sample regression is introduced

Deliverables

source scan budget policy
bounded scan implementation if approved by Phase 1 evidence
timeout reporting regression tests

Acceptance Criteria

large source directories no longer dominate the full sweep by accidental vendor-file scanning
known-family samples are not made worse
archetype semantics are unchanged

Phase 3: Host-Bridge Route Over-Preference Correction

Objective

Correct or formally adjudicate the five cases where host_bridge_workflow over-absorbed G3 or G1-E expected scenes.

Tasks

select the 5 route-overprefer-host-bridge records
compare business-chain evidence against host-bridge evidence
define routing precedence rules for:
- G3 vs G6
- G1-E vs G6
implement bounded routing correction only if evidence supports it
preserve regressions for:
- G3 real-sample pass
- G1-E real-sample pass
- G6 real-sample pass
classify each case as:
- route-corrected-to-g3
- route-corrected-to-g1e
- board-expectation-reclassified
- valid-host-bridge-workflow
- route-conflict-unresolved

Deliverables

route over-preference correction report
routing regression tests
updated dry-run classification for the five fixed records

Acceptance Criteria

all 5 route conflicts are adjudicated
host_bridge_workflow no longer wins solely because host evidence exists
existing G6 pass remains stable
no broad routing rewrite is introduced

Phase 4: Structured Fail-Closed Reporting

Objective

Convert readiness-before-report failures into structured failure reports instead of process-level no-report failures.

Tasks

select the 25 readiness-before-report records
identify where generation exits before report emission
define a minimal failure-report schema for pre-package fail-closed
emit structured failure records with:
- inferred archetype
- failed gate
- blocker reason
- missing contract pieces
- stderr summary if any
keep scenes failing unless their contracts are actually complete

Deliverables

pre-report fail-closed schema
implementation of structured failure report emission
regression covering at least one paginated_enrichment, one local_doc_pipeline, one multi_mode_request, and one single_request_enrichment pre-report failure

Acceptance Criteria

no-report failures are reduced or eliminated as a category
failing scenes still fail closed
failure reasons become machine-readable
auto-pass count is not inflated by looser gates

Phase 5: Bootstrap Target Isolation

Objective

Keep the single bootstrap_target failure isolated and decide whether it belongs to later bootstrap normalization work.

Tasks

preserve 用户停电频次分析监测 as a separate bootstrap failure
inspect whether the failure is caused by missing target URL, domain mismatch, or unsupported bootstrap pattern
produce a bootstrap isolation note
do not implement login or bootstrap auto-recovery

Deliverables

bootstrap target isolation note
decision whether the case enters a later bootstrap-normalization roadmap

Acceptance Criteria

the bootstrap case does not pollute readiness-before-report work
no login recovery implementation is started

Phase 6: Follow-Up Full Sweep and Coverage Delta

Objective

Measure whether the bounded improvements improved generic coverage.

Tasks

rerun the fixed 102 scene full sweep with the same scene set
produce a new dry-run result
compare against the baseline:
- auto-pass delta
- actionable coverage delta
- timeout delta
- misclassification delta
- no-report delta
publish coverage delta report
decide whether to move to execution-board status sync or another bounded improvement cycle

Deliverables

follow-up full sweep JSON
coverage delta report
remaining blocker decision board

Acceptance Criteria

scene set remains exactly 102
baseline and follow-up are comparable
improvements are quantified, not assumed
no execution board status is changed automatically

Milestone Order

The order is fixed:

Phase 0: freeze baseline
Phase 1: known-family timeout diagnostics
Phase 2: source-scale and scan-budget improvement
Phase 3: host-bridge route over-preference correction
Phase 4: structured fail-closed reporting
Phase 5: bootstrap target isolation
Phase 6: follow-up full sweep and coverage delta

Do not start Phase 3 before Phase 1 is completed. Known-family timeout ambiguity affects the interpretation of current coverage.

Do not start Phase 6 before Phases 2-5 have either completed or been explicitly deferred with reasons.

Completion Criteria

This roadmap is complete when:

known-family timeouts are no longer mixed with generic timeout noise
host-bridge over-preference cases are adjudicated
readiness-before-report failures become structured fail-closed records
the bootstrap target case is isolated
a follow-up full sweep quantifies coverage delta
no new family is introduced as a shortcut around current blockers

Out of Plan

new family implementation
G4/G5 implementation
browser host runtime transport
login recovery
attachment/local document runtime
automatic execution board promotion

9.4 KiB Raw Blame History

102 Full Sweep Improvement Roadmap Plan

Plan Intent

Baseline

Scope Guardrails

Workstreams

Phase 0: Freeze Improvement Baseline

Objective

Tasks

Deliverables

Acceptance Criteria

Phase 1: Known-Family Timeout Diagnostics

Objective

Tasks

Deliverables

Acceptance Criteria

Phase 2: Source-Scale and Scan-Budget Improvement

Objective

Tasks

Deliverables

Acceptance Criteria

Phase 3: Host-Bridge Route Over-Preference Correction

Objective

Tasks

Deliverables

Acceptance Criteria

Phase 4: Structured Fail-Closed Reporting

Objective

Tasks

Deliverables

Acceptance Criteria

Phase 5: Bootstrap Target Isolation

Objective

Tasks

Deliverables

Acceptance Criteria

Phase 6: Follow-Up Full Sweep and Coverage Delta

Objective

Tasks

Deliverables

Acceptance Criteria

Milestone Order

Completion Criteria

Out of Plan

9.4 KiB

Raw Blame History