feat: add generated scene skill platform hardening

2026-04-21 23:19:06 +08:00
parent 118fc77935
commit 956f0c2b68
439 changed files with 61974 additions and 3645 deletions
--- a/docs/superpowers/specs/2026-04-19-102-full-sweep-improvement-roadmap-design.md
+++ b/docs/superpowers/specs/2026-04-19-102-full-sweep-improvement-roadmap-design.md
@@ -0,0 +1,239 @@
+# 102 Full Sweep Improvement Roadmap Design
+
+> Date: 2026-04-19
+> Status: Draft
+> Upstream Dry-Run: `docs/superpowers/reports/2026-04-19-102-full-sweep-dry-run-report.md`
+> Upstream Triage: `docs/superpowers/reports/2026-04-19-102-full-sweep-dry-run-triage-report.md`
+
+## Design Intent
+
+Use the full `102` scene dry-run and triage results to define a single improvement roadmap for generic `scene -> skill` coverage.
+
+This roadmap is the post-triage equivalent of the earlier `60-to-90` roadmap. It is not a single bugfix plan. It is the governing design for turning measured dry-run blockers into bounded implementation tracks.
+
+The design answers:
+
+`how do we move from 40/102 dry-run auto-pass and 66/102 actionable coverage toward a higher verified generic conversion rate without drifting into unbounded fixes?`
+
+## Current Baseline
+
+The current measured state is:
+
+| Metric | Count |
+| --- | ---: |
+| Real-sample executed pass | 5 / 102 |
+| Code-backed ledger coverage | 23 / 102 |
+| Dry-run auto-pass | 40 / 102 |
+| Dry-run actionable coverage | 66 / 102 |
+
+The non-pass triage state is:
+
+| Bucket | Count | Triage conclusion |
+| --- | ---: | --- |
+| Timeout | 31 | `19 timeout-unvalidated-source`, `8 timeout-large-source`, `4 timeout-known-family-sample` |
+| Misclassified | 5 | all `route-overprefer-host-bridge` |
+| No-report failure | 25 | all `readiness-before-report` |
+| Bootstrap target | 1 | separate `bootstrap_target` |
+
+## Problem Statement
+
+The generic generator already auto-passes more scenes than the formal ledger coverage shows, but the result is not trustworthy enough to promote automatically because:
+
+1. known-family scenes still appear in the timeout bucket
+2. `host_bridge_workflow` can over-absorb scenes expected to remain `G3` or `G1-E`
+3. many fail-closed cases terminate before a structured generation report exists
+4. timeout and no-report failures hide actionable blocker details
+
+## Roadmap Goal
+
+Improve the measurable generic conversion pipeline, not by adding new families first, but by reducing ambiguity in the current failure surface.
+
+The roadmap has four goals:
+
+1. make known-family timeout results explainable and repeatable
+2. correct or formally adjudicate host-bridge routing over-preference
+3. convert pre-report failures into structured fail-closed results
+4. rerun a bounded `102` sweep to measure coverage delta
+
+## Scope Guardrails
+
+1. do not add new scene families in this roadmap
+2. do not promote scenes directly from diagnostic runs
+3. do not update `scene_execution_board_2026-04-18.json` until a later explicit status-sync plan
+4. do not use one failure as justification for an unbounded rewrite
+5. do not reopen completed `G1-E / G2 / G3 / G6 / G7` real-sample pass records unless they are part of a fixed regression check
+6. do not start `G4 / G5`
+7. do not implement login recovery, full host runtime, or attachment pipeline work in this roadmap
+
+## Workstreams
+
+1. `WS1` Timeout and Source-Scale Diagnostics
+2. `WS2` Host-Bridge Routing Boundary Correction
+3. `WS3` Structured Fail-Closed Reporting
+4. `WS4` Coverage Delta Sweep and Decision Board
+
+## Track A: Known-Family Timeout Diagnostics
+
+### Intent
+
+Separate known-family timeout behavior from generic unvalidated-source timeout behavior.
+
+### Input
+
+The `4` records labeled:
+
+`timeout-known-family-sample`
+
+### Expected Output
+
+Each known-family timeout gets one of:
+
+1. `known-family-rerun-pass`
+2. `known-family-source-scale-timeout`
+3. `known-family-generator-hotspot`
+4. `known-family-contract-blocked-after-long-run`
+5. `known-family-timeout-unresolved`
+
+### Design Constraint
+
+A longer rerun success does not promote a scene. It only changes diagnostic classification.
+
+## Track B: Timeout Source-Scale Policy
+
+### Intent
+
+Create a bounded input filtering and scan-budget policy for large source directories without changing family semantics.
+
+### Input
+
+The timeout labels:
+
+1. `timeout-large-source`
+2. `timeout-unvalidated-source`
+
+### Expected Output
+
+1. source file selection policy
+2. large vendor/library ignore list policy
+3. scan-budget decision table
+4. timeout reporting shape
+
+### Design Constraint
+
+This track is allowed to improve scan boundaries, but not allowed to change archetype semantics.
+
+## Track C: Host-Bridge Route Over-Preference Correction
+
+### Intent
+
+Prevent `host_bridge_workflow` from absorbing scenes that should remain `G3` or `G1-E` when business-chain evidence is stronger.
+
+### Input
+
+The `5` records labeled:
+
+`route-overprefer-host-bridge`
+
+### Expected Output
+
+Each misclassification gets one of:
+
+1. `route-corrected-to-g3`
+2. `route-corrected-to-g1e`
+3. `board-expectation-reclassified`
+4. `valid-host-bridge-workflow`
+5. `route-conflict-unresolved`
+
+### Design Constraint
+
+This track must preserve the already-passed `G6` real sample and must not degrade `G3` or `G1-E` canonical tests.
+
+## Track D: Readiness-Before-Report Structured Fail-Closed
+
+### Intent
+
+Convert `generator failed without generation report` into structured, machine-readable fail-closed results.
+
+### Input
+
+The `25` records labeled:
+
+`readiness-before-report`
+
+### Expected Output
+
+Each case produces a generation report or equivalent dry-run failure record with:
+
+1. inferred archetype
+2. blocker stage
+3. missing contract pieces
+4. failed gate name
+5. actionable reason
+
+### Design Constraint
+
+This track should not make failing scenes pass. It should make failures explainable.
+
+## Track E: Bootstrap Target Isolation
+
+### Intent
+
+Keep the single `bootstrap_target` failure separate so it does not pollute the no-report or route-correction work.
+
+### Input
+
+The `1` bootstrap target failure:
+
+`用户停电频次分析监测`
+
+### Expected Output
+
+1. isolated bootstrap failure note
+2. decision whether it belongs to later bootstrap normalization work
+
+### Design Constraint
+
+No bootstrap auto-recovery or login work is included in this roadmap.
+
+## Track F: Coverage Delta Sweep
+
+### Intent
+
+After bounded improvements, rerun a comparable `102` sweep and compare against the baseline.
+
+### Input
+
+1. baseline dry-run result
+2. updated generator after approved tracks
+3. same `102` scene board
+
+### Expected Output
+
+1. new dry-run result
+2. coverage delta report
+3. category movement table
+4. decision board for remaining blockers
+
+### Design Constraint
+
+The rerun must be comparable to the baseline. It cannot silently change the scene set.
+
+## Success Criteria
+
+This roadmap succeeds when:
+
+1. all known-family timeouts are separated from unvalidated timeout noise
+2. all five host-bridge over-preference cases are adjudicated
+3. no-report failures become structured fail-closed outputs
+4. a follow-up full sweep shows measurable improvement or a clearly explained plateau
+5. no new family is introduced to mask existing failure categories
+
+## Out of Scope
+
+1. new `G4/G5` implementation
+2. full login recovery
+3. browser host runtime transport implementation
+4. local document attachment pipeline
+5. automatic scene promotion into the execution board
+6. full manual validation of all `102` generated skills
+