Files
claw/docs/superpowers/specs/2026-04-19-102-full-sweep-improvement-roadmap-design.md

6.9 KiB

102 Full Sweep Improvement Roadmap Design

Date: 2026-04-19 Status: Draft Upstream Dry-Run: docs/superpowers/reports/2026-04-19-102-full-sweep-dry-run-report.md Upstream Triage: docs/superpowers/reports/2026-04-19-102-full-sweep-dry-run-triage-report.md

Design Intent

Use the full 102 scene dry-run and triage results to define a single improvement roadmap for generic scene -> skill coverage.

This roadmap is the post-triage equivalent of the earlier 60-to-90 roadmap. It is not a single bugfix plan. It is the governing design for turning measured dry-run blockers into bounded implementation tracks.

The design answers:

how do we move from 40/102 dry-run auto-pass and 66/102 actionable coverage toward a higher verified generic conversion rate without drifting into unbounded fixes?

Current Baseline

The current measured state is:

Metric Count
Real-sample executed pass 5 / 102
Code-backed ledger coverage 23 / 102
Dry-run auto-pass 40 / 102
Dry-run actionable coverage 66 / 102

The non-pass triage state is:

Bucket Count Triage conclusion
Timeout 31 19 timeout-unvalidated-source, 8 timeout-large-source, 4 timeout-known-family-sample
Misclassified 5 all route-overprefer-host-bridge
No-report failure 25 all readiness-before-report
Bootstrap target 1 separate bootstrap_target

Problem Statement

The generic generator already auto-passes more scenes than the formal ledger coverage shows, but the result is not trustworthy enough to promote automatically because:

  1. known-family scenes still appear in the timeout bucket
  2. host_bridge_workflow can over-absorb scenes expected to remain G3 or G1-E
  3. many fail-closed cases terminate before a structured generation report exists
  4. timeout and no-report failures hide actionable blocker details

Roadmap Goal

Improve the measurable generic conversion pipeline, not by adding new families first, but by reducing ambiguity in the current failure surface.

The roadmap has four goals:

  1. make known-family timeout results explainable and repeatable
  2. correct or formally adjudicate host-bridge routing over-preference
  3. convert pre-report failures into structured fail-closed results
  4. rerun a bounded 102 sweep to measure coverage delta

Scope Guardrails

  1. do not add new scene families in this roadmap
  2. do not promote scenes directly from diagnostic runs
  3. do not update scene_execution_board_2026-04-18.json until a later explicit status-sync plan
  4. do not use one failure as justification for an unbounded rewrite
  5. do not reopen completed G1-E / G2 / G3 / G6 / G7 real-sample pass records unless they are part of a fixed regression check
  6. do not start G4 / G5
  7. do not implement login recovery, full host runtime, or attachment pipeline work in this roadmap

Workstreams

  1. WS1 Timeout and Source-Scale Diagnostics
  2. WS2 Host-Bridge Routing Boundary Correction
  3. WS3 Structured Fail-Closed Reporting
  4. WS4 Coverage Delta Sweep and Decision Board

Track A: Known-Family Timeout Diagnostics

Intent

Separate known-family timeout behavior from generic unvalidated-source timeout behavior.

Input

The 4 records labeled:

timeout-known-family-sample

Expected Output

Each known-family timeout gets one of:

  1. known-family-rerun-pass
  2. known-family-source-scale-timeout
  3. known-family-generator-hotspot
  4. known-family-contract-blocked-after-long-run
  5. known-family-timeout-unresolved

Design Constraint

A longer rerun success does not promote a scene. It only changes diagnostic classification.

Track B: Timeout Source-Scale Policy

Intent

Create a bounded input filtering and scan-budget policy for large source directories without changing family semantics.

Input

The timeout labels:

  1. timeout-large-source
  2. timeout-unvalidated-source

Expected Output

  1. source file selection policy
  2. large vendor/library ignore list policy
  3. scan-budget decision table
  4. timeout reporting shape

Design Constraint

This track is allowed to improve scan boundaries, but not allowed to change archetype semantics.

Track C: Host-Bridge Route Over-Preference Correction

Intent

Prevent host_bridge_workflow from absorbing scenes that should remain G3 or G1-E when business-chain evidence is stronger.

Input

The 5 records labeled:

route-overprefer-host-bridge

Expected Output

Each misclassification gets one of:

  1. route-corrected-to-g3
  2. route-corrected-to-g1e
  3. board-expectation-reclassified
  4. valid-host-bridge-workflow
  5. route-conflict-unresolved

Design Constraint

This track must preserve the already-passed G6 real sample and must not degrade G3 or G1-E canonical tests.

Track D: Readiness-Before-Report Structured Fail-Closed

Intent

Convert generator failed without generation report into structured, machine-readable fail-closed results.

Input

The 25 records labeled:

readiness-before-report

Expected Output

Each case produces a generation report or equivalent dry-run failure record with:

  1. inferred archetype
  2. blocker stage
  3. missing contract pieces
  4. failed gate name
  5. actionable reason

Design Constraint

This track should not make failing scenes pass. It should make failures explainable.

Track E: Bootstrap Target Isolation

Intent

Keep the single bootstrap_target failure separate so it does not pollute the no-report or route-correction work.

Input

The 1 bootstrap target failure:

用户停电频次分析监测

Expected Output

  1. isolated bootstrap failure note
  2. decision whether it belongs to later bootstrap normalization work

Design Constraint

No bootstrap auto-recovery or login work is included in this roadmap.

Track F: Coverage Delta Sweep

Intent

After bounded improvements, rerun a comparable 102 sweep and compare against the baseline.

Input

  1. baseline dry-run result
  2. updated generator after approved tracks
  3. same 102 scene board

Expected Output

  1. new dry-run result
  2. coverage delta report
  3. category movement table
  4. decision board for remaining blockers

Design Constraint

The rerun must be comparable to the baseline. It cannot silently change the scene set.

Success Criteria

This roadmap succeeds when:

  1. all known-family timeouts are separated from unvalidated timeout noise
  2. all five host-bridge over-preference cases are adjudicated
  3. no-report failures become structured fail-closed outputs
  4. a follow-up full sweep shows measurable improvement or a clearly explained plateau
  5. no new family is introduced to mask existing failure categories

Out of Scope

  1. new G4/G5 implementation
  2. full login recovery
  3. browser host runtime transport implementation
  4. local document attachment pipeline
  5. automatic scene promotion into the execution board
  6. full manual validation of all 102 generated skills