feat: add generated scene skill platform hardening
This commit is contained in:
@@ -0,0 +1,239 @@
|
||||
# 102 Full Sweep Improvement Roadmap Design
|
||||
|
||||
> Date: 2026-04-19
|
||||
> Status: Draft
|
||||
> Upstream Dry-Run: `docs/superpowers/reports/2026-04-19-102-full-sweep-dry-run-report.md`
|
||||
> Upstream Triage: `docs/superpowers/reports/2026-04-19-102-full-sweep-dry-run-triage-report.md`
|
||||
|
||||
## Design Intent
|
||||
|
||||
Use the full `102` scene dry-run and triage results to define a single improvement roadmap for generic `scene -> skill` coverage.
|
||||
|
||||
This roadmap is the post-triage equivalent of the earlier `60-to-90` roadmap. It is not a single bugfix plan. It is the governing design for turning measured dry-run blockers into bounded implementation tracks.
|
||||
|
||||
The design answers:
|
||||
|
||||
`how do we move from 40/102 dry-run auto-pass and 66/102 actionable coverage toward a higher verified generic conversion rate without drifting into unbounded fixes?`
|
||||
|
||||
## Current Baseline
|
||||
|
||||
The current measured state is:
|
||||
|
||||
| Metric | Count |
|
||||
| --- | ---: |
|
||||
| Real-sample executed pass | 5 / 102 |
|
||||
| Code-backed ledger coverage | 23 / 102 |
|
||||
| Dry-run auto-pass | 40 / 102 |
|
||||
| Dry-run actionable coverage | 66 / 102 |
|
||||
|
||||
The non-pass triage state is:
|
||||
|
||||
| Bucket | Count | Triage conclusion |
|
||||
| --- | ---: | --- |
|
||||
| Timeout | 31 | `19 timeout-unvalidated-source`, `8 timeout-large-source`, `4 timeout-known-family-sample` |
|
||||
| Misclassified | 5 | all `route-overprefer-host-bridge` |
|
||||
| No-report failure | 25 | all `readiness-before-report` |
|
||||
| Bootstrap target | 1 | separate `bootstrap_target` |
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The generic generator already auto-passes more scenes than the formal ledger coverage shows, but the result is not trustworthy enough to promote automatically because:
|
||||
|
||||
1. known-family scenes still appear in the timeout bucket
|
||||
2. `host_bridge_workflow` can over-absorb scenes expected to remain `G3` or `G1-E`
|
||||
3. many fail-closed cases terminate before a structured generation report exists
|
||||
4. timeout and no-report failures hide actionable blocker details
|
||||
|
||||
## Roadmap Goal
|
||||
|
||||
Improve the measurable generic conversion pipeline, not by adding new families first, but by reducing ambiguity in the current failure surface.
|
||||
|
||||
The roadmap has four goals:
|
||||
|
||||
1. make known-family timeout results explainable and repeatable
|
||||
2. correct or formally adjudicate host-bridge routing over-preference
|
||||
3. convert pre-report failures into structured fail-closed results
|
||||
4. rerun a bounded `102` sweep to measure coverage delta
|
||||
|
||||
## Scope Guardrails
|
||||
|
||||
1. do not add new scene families in this roadmap
|
||||
2. do not promote scenes directly from diagnostic runs
|
||||
3. do not update `scene_execution_board_2026-04-18.json` until a later explicit status-sync plan
|
||||
4. do not use one failure as justification for an unbounded rewrite
|
||||
5. do not reopen completed `G1-E / G2 / G3 / G6 / G7` real-sample pass records unless they are part of a fixed regression check
|
||||
6. do not start `G4 / G5`
|
||||
7. do not implement login recovery, full host runtime, or attachment pipeline work in this roadmap
|
||||
|
||||
## Workstreams
|
||||
|
||||
1. `WS1` Timeout and Source-Scale Diagnostics
|
||||
2. `WS2` Host-Bridge Routing Boundary Correction
|
||||
3. `WS3` Structured Fail-Closed Reporting
|
||||
4. `WS4` Coverage Delta Sweep and Decision Board
|
||||
|
||||
## Track A: Known-Family Timeout Diagnostics
|
||||
|
||||
### Intent
|
||||
|
||||
Separate known-family timeout behavior from generic unvalidated-source timeout behavior.
|
||||
|
||||
### Input
|
||||
|
||||
The `4` records labeled:
|
||||
|
||||
`timeout-known-family-sample`
|
||||
|
||||
### Expected Output
|
||||
|
||||
Each known-family timeout gets one of:
|
||||
|
||||
1. `known-family-rerun-pass`
|
||||
2. `known-family-source-scale-timeout`
|
||||
3. `known-family-generator-hotspot`
|
||||
4. `known-family-contract-blocked-after-long-run`
|
||||
5. `known-family-timeout-unresolved`
|
||||
|
||||
### Design Constraint
|
||||
|
||||
A longer rerun success does not promote a scene. It only changes diagnostic classification.
|
||||
|
||||
## Track B: Timeout Source-Scale Policy
|
||||
|
||||
### Intent
|
||||
|
||||
Create a bounded input filtering and scan-budget policy for large source directories without changing family semantics.
|
||||
|
||||
### Input
|
||||
|
||||
The timeout labels:
|
||||
|
||||
1. `timeout-large-source`
|
||||
2. `timeout-unvalidated-source`
|
||||
|
||||
### Expected Output
|
||||
|
||||
1. source file selection policy
|
||||
2. large vendor/library ignore list policy
|
||||
3. scan-budget decision table
|
||||
4. timeout reporting shape
|
||||
|
||||
### Design Constraint
|
||||
|
||||
This track is allowed to improve scan boundaries, but not allowed to change archetype semantics.
|
||||
|
||||
## Track C: Host-Bridge Route Over-Preference Correction
|
||||
|
||||
### Intent
|
||||
|
||||
Prevent `host_bridge_workflow` from absorbing scenes that should remain `G3` or `G1-E` when business-chain evidence is stronger.
|
||||
|
||||
### Input
|
||||
|
||||
The `5` records labeled:
|
||||
|
||||
`route-overprefer-host-bridge`
|
||||
|
||||
### Expected Output
|
||||
|
||||
Each misclassification gets one of:
|
||||
|
||||
1. `route-corrected-to-g3`
|
||||
2. `route-corrected-to-g1e`
|
||||
3. `board-expectation-reclassified`
|
||||
4. `valid-host-bridge-workflow`
|
||||
5. `route-conflict-unresolved`
|
||||
|
||||
### Design Constraint
|
||||
|
||||
This track must preserve the already-passed `G6` real sample and must not degrade `G3` or `G1-E` canonical tests.
|
||||
|
||||
## Track D: Readiness-Before-Report Structured Fail-Closed
|
||||
|
||||
### Intent
|
||||
|
||||
Convert `generator failed without generation report` into structured, machine-readable fail-closed results.
|
||||
|
||||
### Input
|
||||
|
||||
The `25` records labeled:
|
||||
|
||||
`readiness-before-report`
|
||||
|
||||
### Expected Output
|
||||
|
||||
Each case produces a generation report or equivalent dry-run failure record with:
|
||||
|
||||
1. inferred archetype
|
||||
2. blocker stage
|
||||
3. missing contract pieces
|
||||
4. failed gate name
|
||||
5. actionable reason
|
||||
|
||||
### Design Constraint
|
||||
|
||||
This track should not make failing scenes pass. It should make failures explainable.
|
||||
|
||||
## Track E: Bootstrap Target Isolation
|
||||
|
||||
### Intent
|
||||
|
||||
Keep the single `bootstrap_target` failure separate so it does not pollute the no-report or route-correction work.
|
||||
|
||||
### Input
|
||||
|
||||
The `1` bootstrap target failure:
|
||||
|
||||
`用户停电频次分析监测`
|
||||
|
||||
### Expected Output
|
||||
|
||||
1. isolated bootstrap failure note
|
||||
2. decision whether it belongs to later bootstrap normalization work
|
||||
|
||||
### Design Constraint
|
||||
|
||||
No bootstrap auto-recovery or login work is included in this roadmap.
|
||||
|
||||
## Track F: Coverage Delta Sweep
|
||||
|
||||
### Intent
|
||||
|
||||
After bounded improvements, rerun a comparable `102` sweep and compare against the baseline.
|
||||
|
||||
### Input
|
||||
|
||||
1. baseline dry-run result
|
||||
2. updated generator after approved tracks
|
||||
3. same `102` scene board
|
||||
|
||||
### Expected Output
|
||||
|
||||
1. new dry-run result
|
||||
2. coverage delta report
|
||||
3. category movement table
|
||||
4. decision board for remaining blockers
|
||||
|
||||
### Design Constraint
|
||||
|
||||
The rerun must be comparable to the baseline. It cannot silently change the scene set.
|
||||
|
||||
## Success Criteria
|
||||
|
||||
This roadmap succeeds when:
|
||||
|
||||
1. all known-family timeouts are separated from unvalidated timeout noise
|
||||
2. all five host-bridge over-preference cases are adjudicated
|
||||
3. no-report failures become structured fail-closed outputs
|
||||
4. a follow-up full sweep shows measurable improvement or a clearly explained plateau
|
||||
5. no new family is introduced to mask existing failure categories
|
||||
|
||||
## Out of Scope
|
||||
|
||||
1. new `G4/G5` implementation
|
||||
2. full login recovery
|
||||
3. browser host runtime transport implementation
|
||||
4. local document attachment pipeline
|
||||
5. automatic scene promotion into the execution board
|
||||
6. full manual validation of all `102` generated skills
|
||||
|
||||
Reference in New Issue
Block a user