admin/claw

Files

木炎 956f0c2b68 feat: add generated scene skill platform hardening

2026-04-21 23:19:06 +08:00

4.6 KiB

Raw Blame History

102 Full Sweep Dry-Run Report

Date: 2026-04-19 Plan: docs/superpowers/plans/2026-04-19-102-full-sweep-dry-run-plan.md Result: tests/fixtures/generated_scene/full_sweep_dry_run_2026-04-19.json Output Root: examples/full_sweep_dry_run_2026-04-19

Scope

This run measured current generic scene -> skill coverage over the fixed 102 scene execution board.

It was a measurement-only dry-run:

no analyzer logic was changed
no generator logic was changed
scene_execution_board_2026-04-18.json was not updated
no scene was promoted from this result
failures were recorded, not fixed

Headline Numbers

Metric	Count
Real-sample executed pass	5 / 102
Code-backed ledger coverage	23 / 102
Dry-run auto-pass	40 / 102
Dry-run actionable coverage	66 / 102

dry-run actionable coverage is auto-pass + fail-closed-known.

Dry-Run Summary

Dry-run status	Count
`auto-pass`	40
`fail-closed-known`	26
`misclassified`	5
`unsupported-family`	0
`missing-source`	0
`source-unreadable`	31
Total	102

Archetype Distribution

Inferred archetype	Count
`host_bridge_workflow`	31
`paginated_enrichment`	8
`multi_mode_request`	3
`multi_endpoint_inventory`	2
`page_state_eval`	2
`none`	56

The none bucket includes generator failures and timeout cases that did not produce a generation-report.json.

Auto-Pass Shape

The 40 auto-pass scenes are distributed as:

Inferred archetype	Auto-pass count
`host_bridge_workflow`	26
`paginated_enrichment`	8
`multi_mode_request`	3
`multi_endpoint_inventory`	2
`page_state_eval`	1

This means the current generic generator is no longer limited to the 23 code-backed ledger scenes. The conservative ledger coverage is lower because it only counts scenes already mapped into formal baseline or boundary assets.

Non-Pass Buckets

Source-Unreadable

31 scenes timed out during this bounded dry-run.

All timeout records use:

generator timeout after 30s

These should not be interpreted as unsupported family evidence. They are dry-run execution-limit failures and need separate timeout/performance triage before capability conclusions are drawn.

Fail-Closed-Known

26 scenes failed without an auto-pass result but were recorded with a known dry-run failure category.

Top reasons:

Reason	Count
`generator failed without generation report`	25
`bootstrap_target`	1

The generator failed without generation report bucket is actionable but too broad for implementation work. It should be split in a later bounded triage pass before any fixes are attempted.

Misclassified

5 scenes produced a package, but the inferred archetype conflicted with the current board group:

Scene	Current group	Inferred archetype
`95598报修工单日管控`	`G3`	`host_bridge_workflow`
`95598重要服务事项报备统计表`	`G3`	`host_bridge_workflow`
`用电报装信息统计列表`	`G1-E`	`host_bridge_workflow`
`配网支撑月报(95598抢修统计报表)`	`G3`	`host_bridge_workflow`
`高低压新增报装容量月度统计表`	`G1-E`	`host_bridge_workflow`

This is the clearest blocker category from the dry-run because it indicates current generic routing can over-prefer host_bridge_workflow on some scenes that already have board-level family expectations.

Interpretation

The four coverage numbers answer different questions:

5 / 102 is the strict real-sample pass count.
23 / 102 is the formal code-backed ledger coverage.
40 / 102 is the current generic dry-run auto-pass count.
66 / 102 is the current generic actionable coverage count.

The key result is that the generic generator currently auto-passes more scenes than the formal ledger coverage shows, but the result is not clean enough to promote automatically because:

31 scenes hit bounded dry-run timeouts.
5 scenes show board-vs-archetype mismatch.
26 scenes need more specific failure extraction before implementation work.

Recommended Next Blocker

Do not start implementation from this report directly.

The next bounded step should be a dry-run triage pass, with priority:

split the 31 timeout cases into true timeout, oversized source, and command-level hang
inspect the 5 misclassified cases as the first routing-quality sample
refine the 25 generic no-report failures into concrete failure categories

This report does not update the execution board and does not promote any scene.

4.6 KiB Raw Blame History