Files
claw/docs/superpowers/specs/2026-04-19-102-full-sweep-improvement-roadmap-design.md

240 lines
6.9 KiB
Markdown

# 102 Full Sweep Improvement Roadmap Design
> Date: 2026-04-19
> Status: Draft
> Upstream Dry-Run: `docs/superpowers/reports/2026-04-19-102-full-sweep-dry-run-report.md`
> Upstream Triage: `docs/superpowers/reports/2026-04-19-102-full-sweep-dry-run-triage-report.md`
## Design Intent
Use the full `102` scene dry-run and triage results to define a single improvement roadmap for generic `scene -> skill` coverage.
This roadmap is the post-triage equivalent of the earlier `60-to-90` roadmap. It is not a single bugfix plan. It is the governing design for turning measured dry-run blockers into bounded implementation tracks.
The design answers:
`how do we move from 40/102 dry-run auto-pass and 66/102 actionable coverage toward a higher verified generic conversion rate without drifting into unbounded fixes?`
## Current Baseline
The current measured state is:
| Metric | Count |
| --- | ---: |
| Real-sample executed pass | 5 / 102 |
| Code-backed ledger coverage | 23 / 102 |
| Dry-run auto-pass | 40 / 102 |
| Dry-run actionable coverage | 66 / 102 |
The non-pass triage state is:
| Bucket | Count | Triage conclusion |
| --- | ---: | --- |
| Timeout | 31 | `19 timeout-unvalidated-source`, `8 timeout-large-source`, `4 timeout-known-family-sample` |
| Misclassified | 5 | all `route-overprefer-host-bridge` |
| No-report failure | 25 | all `readiness-before-report` |
| Bootstrap target | 1 | separate `bootstrap_target` |
## Problem Statement
The generic generator already auto-passes more scenes than the formal ledger coverage shows, but the result is not trustworthy enough to promote automatically because:
1. known-family scenes still appear in the timeout bucket
2. `host_bridge_workflow` can over-absorb scenes expected to remain `G3` or `G1-E`
3. many fail-closed cases terminate before a structured generation report exists
4. timeout and no-report failures hide actionable blocker details
## Roadmap Goal
Improve the measurable generic conversion pipeline, not by adding new families first, but by reducing ambiguity in the current failure surface.
The roadmap has four goals:
1. make known-family timeout results explainable and repeatable
2. correct or formally adjudicate host-bridge routing over-preference
3. convert pre-report failures into structured fail-closed results
4. rerun a bounded `102` sweep to measure coverage delta
## Scope Guardrails
1. do not add new scene families in this roadmap
2. do not promote scenes directly from diagnostic runs
3. do not update `scene_execution_board_2026-04-18.json` until a later explicit status-sync plan
4. do not use one failure as justification for an unbounded rewrite
5. do not reopen completed `G1-E / G2 / G3 / G6 / G7` real-sample pass records unless they are part of a fixed regression check
6. do not start `G4 / G5`
7. do not implement login recovery, full host runtime, or attachment pipeline work in this roadmap
## Workstreams
1. `WS1` Timeout and Source-Scale Diagnostics
2. `WS2` Host-Bridge Routing Boundary Correction
3. `WS3` Structured Fail-Closed Reporting
4. `WS4` Coverage Delta Sweep and Decision Board
## Track A: Known-Family Timeout Diagnostics
### Intent
Separate known-family timeout behavior from generic unvalidated-source timeout behavior.
### Input
The `4` records labeled:
`timeout-known-family-sample`
### Expected Output
Each known-family timeout gets one of:
1. `known-family-rerun-pass`
2. `known-family-source-scale-timeout`
3. `known-family-generator-hotspot`
4. `known-family-contract-blocked-after-long-run`
5. `known-family-timeout-unresolved`
### Design Constraint
A longer rerun success does not promote a scene. It only changes diagnostic classification.
## Track B: Timeout Source-Scale Policy
### Intent
Create a bounded input filtering and scan-budget policy for large source directories without changing family semantics.
### Input
The timeout labels:
1. `timeout-large-source`
2. `timeout-unvalidated-source`
### Expected Output
1. source file selection policy
2. large vendor/library ignore list policy
3. scan-budget decision table
4. timeout reporting shape
### Design Constraint
This track is allowed to improve scan boundaries, but not allowed to change archetype semantics.
## Track C: Host-Bridge Route Over-Preference Correction
### Intent
Prevent `host_bridge_workflow` from absorbing scenes that should remain `G3` or `G1-E` when business-chain evidence is stronger.
### Input
The `5` records labeled:
`route-overprefer-host-bridge`
### Expected Output
Each misclassification gets one of:
1. `route-corrected-to-g3`
2. `route-corrected-to-g1e`
3. `board-expectation-reclassified`
4. `valid-host-bridge-workflow`
5. `route-conflict-unresolved`
### Design Constraint
This track must preserve the already-passed `G6` real sample and must not degrade `G3` or `G1-E` canonical tests.
## Track D: Readiness-Before-Report Structured Fail-Closed
### Intent
Convert `generator failed without generation report` into structured, machine-readable fail-closed results.
### Input
The `25` records labeled:
`readiness-before-report`
### Expected Output
Each case produces a generation report or equivalent dry-run failure record with:
1. inferred archetype
2. blocker stage
3. missing contract pieces
4. failed gate name
5. actionable reason
### Design Constraint
This track should not make failing scenes pass. It should make failures explainable.
## Track E: Bootstrap Target Isolation
### Intent
Keep the single `bootstrap_target` failure separate so it does not pollute the no-report or route-correction work.
### Input
The `1` bootstrap target failure:
`用户停电频次分析监测`
### Expected Output
1. isolated bootstrap failure note
2. decision whether it belongs to later bootstrap normalization work
### Design Constraint
No bootstrap auto-recovery or login work is included in this roadmap.
## Track F: Coverage Delta Sweep
### Intent
After bounded improvements, rerun a comparable `102` sweep and compare against the baseline.
### Input
1. baseline dry-run result
2. updated generator after approved tracks
3. same `102` scene board
### Expected Output
1. new dry-run result
2. coverage delta report
3. category movement table
4. decision board for remaining blockers
### Design Constraint
The rerun must be comparable to the baseline. It cannot silently change the scene set.
## Success Criteria
This roadmap succeeds when:
1. all known-family timeouts are separated from unvalidated timeout noise
2. all five host-bridge over-preference cases are adjudicated
3. no-report failures become structured fail-closed outputs
4. a follow-up full sweep shows measurable improvement or a clearly explained plateau
5. no new family is introduced to mask existing failure categories
## Out of Scope
1. new `G4/G5` implementation
2. full login recovery
3. browser host runtime transport implementation
4. local document attachment pipeline
5. automatic scene promotion into the execution board
6. full manual validation of all `102` generated skills