Files
claw/docs/superpowers/plans/2026-04-19-102-full-sweep-improvement-roadmap-plan.md

306 lines
9.4 KiB
Markdown

# 102 Full Sweep Improvement Roadmap Plan
> Date: 2026-04-19
> Status: Draft
> Upstream Spec: `docs/superpowers/specs/2026-04-19-102-full-sweep-improvement-roadmap-design.md`
> Upstream Dry-Run Result: `tests/fixtures/generated_scene/full_sweep_dry_run_2026-04-19.json`
> Upstream Triage Result: `tests/fixtures/generated_scene/full_sweep_dry_run_triage_2026-04-19.json`
## Plan Intent
Turn the `102` scene dry-run and triage findings into a governed improvement roadmap.
This plan is intentionally broad like the earlier `60-to-90` roadmap. It coordinates multiple bounded implementation tracks instead of starting isolated fixes from individual failures.
## Baseline
Current measured baseline:
| Metric | Count |
| --- | ---: |
| Real-sample executed pass | 5 / 102 |
| Code-backed ledger coverage | 23 / 102 |
| Dry-run auto-pass | 40 / 102 |
| Dry-run actionable coverage | 66 / 102 |
Current triage baseline:
| Bucket | Count | Triage conclusion |
| --- | ---: | --- |
| Timeout | 31 | `19 timeout-unvalidated-source`, `8 timeout-large-source`, `4 timeout-known-family-sample` |
| Misclassified | 5 | all `route-overprefer-host-bridge` |
| No-report failure | 25 | all `readiness-before-report` |
| Bootstrap target | 1 | separate `bootstrap_target` |
## Scope Guardrails
1. do not add new scene families
2. do not update `scene_execution_board_2026-04-18.json` inside this roadmap
3. do not promote scenes directly from diagnostic or dry-run results
4. do not reopen completed real-sample passes except as regression checks
5. do not start `G4/G5`
6. do not implement full login recovery
7. do not implement full host runtime transport
8. do not implement local document attachment runtime
9. do not create unbounded micro-plans from a single failure
## Workstreams
1. `WS1` Timeout Diagnostics and Scan Budget
2. `WS2` Routing Boundary Correction
3. `WS3` Structured Fail-Closed Reporting
4. `WS4` Follow-Up Sweep and Coverage Delta
## Phase 0: Freeze Improvement Baseline
### Objective
Freeze the dry-run and triage outputs as the only accepted inputs to this roadmap.
### Tasks
1. freeze `full_sweep_dry_run_2026-04-19.json`
2. freeze `full_sweep_dry_run_triage_2026-04-19.json`
3. freeze the four headline metrics:
- `5/102` real-sample pass
- `23/102` code-backed ledger coverage
- `40/102` dry-run auto-pass
- `66/102` dry-run actionable coverage
4. freeze the problem buckets:
- `4` known-family timeouts
- `8` large-source timeouts
- `19` unvalidated-source timeouts
- `5` host-bridge over-preference cases
- `25` readiness-before-report failures
- `1` bootstrap-target failure
### Deliverables
1. baseline statement
2. frozen blocker inventory
3. roadmap entry criteria
### Acceptance Criteria
1. no additional scene is added to scope
2. no implementation starts before the baseline is frozen
3. dry-run and triage assets are treated as immutable inputs
## Phase 1: Known-Family Timeout Diagnostics
### Objective
Resolve the highest-priority ambiguity: known-family scenes that timed out in the full sweep.
### Tasks
1. select only records labeled `timeout-known-family-sample`
2. capture source scale metrics and previous family context
3. run bounded diagnostic attempts if needed
4. classify each record as:
- `known-family-rerun-pass`
- `known-family-source-scale-timeout`
- `known-family-generator-hotspot`
- `known-family-contract-blocked-after-long-run`
- `known-family-timeout-unresolved`
5. publish diagnostic result
### Deliverables
1. known-family timeout diagnostic JSON
2. known-family timeout diagnostic report
### Acceptance Criteria
1. all `4` known-family timeout records are classified
2. no scene is promoted from diagnostic success
3. no generator logic is changed in the diagnostic step
## Phase 2: Source-Scale and Scan-Budget Improvement
### Objective
Reduce timeout noise caused by oversized source directories and obvious vendor/library files.
### Tasks
1. analyze `timeout-large-source` and `timeout-unvalidated-source`
2. define source scan budget policy
3. define vendor/library ignore policy
4. implement only bounded source scanning or timeout reporting changes
5. verify no canonical or real-sample regression is introduced
### Deliverables
1. source scan budget policy
2. bounded scan implementation if approved by Phase 1 evidence
3. timeout reporting regression tests
### Acceptance Criteria
1. large source directories no longer dominate the full sweep by accidental vendor-file scanning
2. known-family samples are not made worse
3. archetype semantics are unchanged
## Phase 3: Host-Bridge Route Over-Preference Correction
### Objective
Correct or formally adjudicate the five cases where `host_bridge_workflow` over-absorbed `G3` or `G1-E` expected scenes.
### Tasks
1. select the `5` `route-overprefer-host-bridge` records
2. compare business-chain evidence against host-bridge evidence
3. define routing precedence rules for:
- `G3` vs `G6`
- `G1-E` vs `G6`
4. implement bounded routing correction only if evidence supports it
5. preserve regressions for:
- `G3` real-sample pass
- `G1-E` real-sample pass
- `G6` real-sample pass
6. classify each case as:
- `route-corrected-to-g3`
- `route-corrected-to-g1e`
- `board-expectation-reclassified`
- `valid-host-bridge-workflow`
- `route-conflict-unresolved`
### Deliverables
1. route over-preference correction report
2. routing regression tests
3. updated dry-run classification for the five fixed records
### Acceptance Criteria
1. all `5` route conflicts are adjudicated
2. `host_bridge_workflow` no longer wins solely because host evidence exists
3. existing `G6` pass remains stable
4. no broad routing rewrite is introduced
## Phase 4: Structured Fail-Closed Reporting
### Objective
Convert `readiness-before-report` failures into structured failure reports instead of process-level no-report failures.
### Tasks
1. select the `25` `readiness-before-report` records
2. identify where generation exits before report emission
3. define a minimal failure-report schema for pre-package fail-closed
4. emit structured failure records with:
- inferred archetype
- failed gate
- blocker reason
- missing contract pieces
- stderr summary if any
5. keep scenes failing unless their contracts are actually complete
### Deliverables
1. pre-report fail-closed schema
2. implementation of structured failure report emission
3. regression covering at least one `paginated_enrichment`, one `local_doc_pipeline`, one `multi_mode_request`, and one `single_request_enrichment` pre-report failure
### Acceptance Criteria
1. no-report failures are reduced or eliminated as a category
2. failing scenes still fail closed
3. failure reasons become machine-readable
4. auto-pass count is not inflated by looser gates
## Phase 5: Bootstrap Target Isolation
### Objective
Keep the single `bootstrap_target` failure isolated and decide whether it belongs to later bootstrap normalization work.
### Tasks
1. preserve `用户停电频次分析监测` as a separate bootstrap failure
2. inspect whether the failure is caused by missing target URL, domain mismatch, or unsupported bootstrap pattern
3. produce a bootstrap isolation note
4. do not implement login or bootstrap auto-recovery
### Deliverables
1. bootstrap target isolation note
2. decision whether the case enters a later bootstrap-normalization roadmap
### Acceptance Criteria
1. the bootstrap case does not pollute readiness-before-report work
2. no login recovery implementation is started
## Phase 6: Follow-Up Full Sweep and Coverage Delta
### Objective
Measure whether the bounded improvements improved generic coverage.
### Tasks
1. rerun the fixed `102` scene full sweep with the same scene set
2. produce a new dry-run result
3. compare against the baseline:
- auto-pass delta
- actionable coverage delta
- timeout delta
- misclassification delta
- no-report delta
4. publish coverage delta report
5. decide whether to move to execution-board status sync or another bounded improvement cycle
### Deliverables
1. follow-up full sweep JSON
2. coverage delta report
3. remaining blocker decision board
### Acceptance Criteria
1. scene set remains exactly `102`
2. baseline and follow-up are comparable
3. improvements are quantified, not assumed
4. no execution board status is changed automatically
## Milestone Order
The order is fixed:
1. Phase 0: freeze baseline
2. Phase 1: known-family timeout diagnostics
3. Phase 2: source-scale and scan-budget improvement
4. Phase 3: host-bridge route over-preference correction
5. Phase 4: structured fail-closed reporting
6. Phase 5: bootstrap target isolation
7. Phase 6: follow-up full sweep and coverage delta
Do not start Phase 3 before Phase 1 is completed. Known-family timeout ambiguity affects the interpretation of current coverage.
Do not start Phase 6 before Phases 2-5 have either completed or been explicitly deferred with reasons.
## Completion Criteria
This roadmap is complete when:
1. known-family timeouts are no longer mixed with generic timeout noise
2. host-bridge over-preference cases are adjudicated
3. readiness-before-report failures become structured fail-closed records
4. the bootstrap target case is isolated
5. a follow-up full sweep quantifies coverage delta
6. no new family is introduced as a shortcut around current blockers
## Out of Plan
1. new family implementation
2. `G4/G5` implementation
3. browser host runtime transport
4. login recovery
5. attachment/local document runtime
6. automatic execution board promotion