# ZeroClaw Prompt Safety Hardening Implementation Plan > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. **Goal:** Harden ZeroClaw prompt handling and tool execution so non-skill freeform operations degrade to read-only or business-approved execution, while trusted skill-defined operations retain bounded execution privileges. **Architecture:** Build a security gate around the existing prompt and tool-entry paths instead of rewriting the full prompt architecture. The gate classifies prompt-injection risk, records operation provenance (`trusted_skill` vs `non_skill`), sanitizes injected workspace/skill content, and enforces execution mode transitions (`clean`, `suspect_readonly`, `suspect_waiting_approval`, `suspect_business_approved`). Trusted skills gain structured business-operation metadata; non-skill operations require business-level approval before any privileged capability is released. **Tech Stack:** Rust, vendored ZeroClaw (`third_party/zeroclaw`), existing approval/autonomy system, current prompt guard and prompt builder tests, `cargo test`. ### Task 1: Create an Isolated Worktree and Verify a Clean Baseline **Files:** - Modify: `/home/zyl/projects/sgClaw/claw/.gitignore` - Create: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/**` **Step 1: Verify the worktree directory is safe to use** Run: ```bash cd /home/zyl/projects/sgClaw/claw ls -d .worktrees git check-ignore -v .worktrees ``` Expected: `.worktrees` exists and is ignored by git. **Step 2: Create the implementation worktree** Run: ```bash cd /home/zyl/projects/sgClaw/claw git worktree add .worktrees/zeroclaw-prompt-safety-hardening -b zeroclaw-prompt-safety-hardening ``` Expected: a new branch and worktree are created. **Step 3: Build the baseline in the worktree** Run: ```bash cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening cargo test -p zeroclawlabs prompt_guard -- --nocapture cargo test -p zeroclawlabs build_system_prompt -- --nocapture ``` Expected: existing relevant tests pass before any code changes. **Step 4: Commit the clean worktree setup if `.gitignore` changed** Run: ```bash git add .gitignore git commit -m "chore: prepare worktree for prompt safety hardening" ``` Expected: commit only if `.gitignore` required an adjustment. ### Task 2: Add the Core Security-Mode Data Model **Files:** - Create: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/operation_policy.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs` - Test: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/operation_policy.rs` **Step 1: Write the failing policy tests** Add tests that prove: - suspicious non-skill input maps to `suspect_readonly` - trusted skill operations can request bounded privileged execution - any out-of-scope capability request downgrades the operation Use concrete enums and assertions, for example: ```rust assert_eq!( ExecutionMode::from_guard_and_provenance(GuardRisk::Suspicious, OperationProvenance::NonSkill), ExecutionMode::SuspectReadOnly ); ``` **Step 2: Run the tests to verify RED** Run: ```bash cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening cargo test -p zeroclawlabs operation_policy -- --nocapture ``` Expected: fail because the new types do not exist yet. **Step 3: Implement the minimal policy model** Define: - `GuardRisk` (`Clean`, `Suspicious`, `Dangerous`) - `OperationProvenance` (`TrustedSkill`, `NonSkill`, `Mixed`) - `ExecutionMode` (`Clean`, `SuspectReadOnly`, `SuspectWaitingApproval`, `SuspectBusinessApproved`) - `CapabilityClass` for privileged business actions Add small helper functions that do only state mapping. Do not pull prompt-building logic into this module. **Step 4: Re-run the policy tests to verify GREEN** Run: ```bash cargo test -p zeroclawlabs operation_policy -- --nocapture ``` Expected: the new policy tests pass. **Step 5: Commit** Run: ```bash git add third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/operation_policy.rs git commit -m "feat: add prompt security execution mode model" ``` ### Task 3: Add Structured Skill Trust Metadata **Files:** - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/skills/mod.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/tools/read_skill.rs` - Test: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/skills/mod.rs` **Step 1: Write failing skill metadata tests** Add tests that prove: - `SKILL.toml` can declare a business operation type, capability list, argument constraints, and `step_budget` - markdown-only skills default to unprivileged metadata - malformed privileged metadata is rejected or downgraded safely Use a manifest shape like: ```toml [skill] name = "export-report" description = "Export the monthly report" [security] operation_type = "browser_export_data" allowed_capabilities = ["browser_read", "browser_export"] step_budget = 6 approval_mode = "trusted_skill" ``` **Step 2: Run the tests to verify RED** Run: ```bash cargo test -p zeroclawlabs skill -- --nocapture ``` Expected: fail because the structured metadata fields are missing. **Step 3: Implement minimal structured metadata** Extend `Skill` with a structured security block, for example: - `operation_type` - `business_description` - `allowed_capabilities` - `arg_constraints` - `step_budget` - `approval_mode` Default markdown-only skills to unprivileged metadata so existing skills remain compatible. **Step 4: Make `read_skill` expose the metadata** Return or prepend enough structured metadata so the runtime can distinguish trusted skill operations from plain prompt text. **Step 5: Re-run the tests to verify GREEN** Run: ```bash cargo test -p zeroclawlabs skill -- --nocapture ``` Expected: skill parsing and `read_skill` tests pass. **Step 6: Commit** Run: ```bash git add third_party/zeroclaw/src/skills/mod.rs third_party/zeroclaw/src/tools/read_skill.rs git commit -m "feat: add trusted skill security metadata" ``` ### Task 4: Sanitize Injected Workspace and Skill Content Before Prompt Assembly **Files:** - Create: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/prompt_sanitizer.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/prompt.rs` - Test: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs` **Step 1: Write failing sanitizer tests** Add tests that prove: - dangerous bootstrap phrases are removed, escaped, or summarized before prompt injection - control characters are stripped - overlong files are truncated with an audit-friendly marker - safe business content remains readable **Step 2: Run the tests to verify RED** Run: ```bash cargo test -p zeroclawlabs build_system_prompt -- --nocapture ``` Expected: fail because injected files are still copied verbatim. **Step 3: Implement the sanitizer** Create a small sanitizer that: - strips control characters - caps content length - flags prompt-override phrases - emits sanitized content plus metadata such as `truncated` and matched rules Use this sanitizer in: - `load_openclaw_bootstrap_files` - any shared path in `agent/prompt.rs` that renders workspace or skill text into the system prompt **Step 4: Re-run the tests to verify GREEN** Run: ```bash cargo test -p zeroclawlabs build_system_prompt -- --nocapture ``` Expected: prompt-building tests pass with the new sanitized behavior. **Step 5: Commit** Run: ```bash git add third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/prompt_sanitizer.rs third_party/zeroclaw/src/channels/mod.rs third_party/zeroclaw/src/agent/prompt.rs git commit -m "feat: sanitize injected workspace prompt content" ``` ### Task 5: Wire `PromptGuard` into Main Agent and Gateway Entry Points **Files:** - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/prompt_guard.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/gateway/mod.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/gateway/ws.rs` - Test: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs` **Step 1: Write failing entry-point tests** Add tests that prove: - suspicious input marks the turn as degraded instead of silently continuing - dangerous input is blocked - clean input remains unchanged Prefer tests that assert on a security decision object instead of brittle prompt strings. **Step 2: Run the tests to verify RED** Run: ```bash cargo test -p zeroclawlabs prompt_guard -- --nocapture cargo test -p zeroclawlabs agent -- --nocapture ``` Expected: fail because no entry path consumes the guard result. **Step 3: Implement guarded entry evaluation** Before each turn: - scan the inbound user content - map the guard result into `GuardRisk` - create an execution context carrying risk and provenance - attach audit details for later logging Keep the existing `PromptGuard` regexes unless a test demands a specific adjustment. **Step 4: Re-run the tests to verify GREEN** Run: ```bash cargo test -p zeroclawlabs prompt_guard -- --nocapture cargo test -p zeroclawlabs agent -- --nocapture ``` Expected: suspicious and blocked paths now behave deterministically. **Step 5: Commit** Run: ```bash git add third_party/zeroclaw/src/security/prompt_guard.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/gateway/mod.rs third_party/zeroclaw/src/gateway/ws.rs git commit -m "feat: enforce prompt guard at runtime entry points" ``` ### Task 6: Add Business-Level Privileged Operation Registry and Approval Tokens **Files:** - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/approval/mod.rs` - Create: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/business_approval.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs` - Test: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/business_approval.rs` **Step 1: Write failing business approval tests** Add tests that prove: - only operations in the privileged registry can request approval - approval tokens bind to `session_id`, `operation_type`, `allowed_capabilities`, `step_budget`, and expiration - a mismatched or expired approval token is rejected **Step 2: Run the tests to verify RED** Run: ```bash cargo test -p zeroclawlabs business_approval -- --nocapture ``` Expected: fail because the business approval registry does not exist yet. **Step 3: Implement the registry and token model** Create: - a privileged business operation registry - a single-operation approval token - helper checks for `can_request_approval` and `matches_execution_request` Model approval at the business-operation level, not raw tool calls. **Step 4: Extend the existing approval module** Teach the approval module to carry business-level fields through the current request/response flow without breaking old call sites. **Step 5: Re-run the tests to verify GREEN** Run: ```bash cargo test -p zeroclawlabs business_approval -- --nocapture ``` Expected: the token validation and registry tests pass. **Step 6: Commit** Run: ```bash git add third_party/zeroclaw/src/approval/mod.rs third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/business_approval.rs git commit -m "feat: add business-level approval registry" ``` ### Task 7: Enforce Execution Modes in Tool Dispatch **Files:** - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/dispatcher.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/loop_.rs` - Test: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/dispatcher.rs` **Step 1: Write failing dispatcher tests** Add tests that prove: - `suspect_readonly` allows only safe read capabilities - `trusted_skill` can execute capabilities listed in its metadata within `step_budget` - `mixed` or non-skill privileged calls require a matching business approval token **Step 2: Run the tests to verify RED** Run: ```bash cargo test -p zeroclawlabs dispatcher -- --nocapture ``` Expected: fail because the dispatcher does not yet know about execution modes. **Step 3: Implement capability enforcement** Before dispatching any tool: - resolve the operation context - map the tool call to a capability class - reject calls outside the current execution mode - decrement or validate `step_budget` for approved bounded flows Do not rely on prompt text for enforcement. **Step 4: Re-run the tests to verify GREEN** Run: ```bash cargo test -p zeroclawlabs dispatcher -- --nocapture ``` Expected: dispatch now respects read-only, trusted skill, and business-approved modes. **Step 5: Commit** Run: ```bash git add third_party/zeroclaw/src/agent/dispatcher.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/agent/loop_.rs git commit -m "feat: enforce execution mode in tool dispatch" ``` ### Task 8: Default Skills Prompt Injection to Compact for Safer Runtime Behavior **Files:** - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/config/schema.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/prompt.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs` - Test: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/config/schema.rs` **Step 1: Write the failing configuration test** Add a test that asserts the default skill prompt injection mode is `Compact` unless explicitly configured otherwise. **Step 2: Run the test to verify RED** Run: ```bash cargo test -p zeroclawlabs skills_prompt_injection_mode -- --nocapture ``` Expected: fail because defaults still point to `Full`. **Step 3: Implement the default flip** Update config defaults and any prompt-builder defaults that currently assume `Full`. Keep explicit user config backward compatible. **Step 4: Re-run the test to verify GREEN** Run: ```bash cargo test -p zeroclawlabs skills_prompt_injection_mode -- --nocapture ``` Expected: default configuration now resolves to `Compact`. **Step 5: Commit** Run: ```bash git add third_party/zeroclaw/src/config/schema.rs third_party/zeroclaw/src/agent/prompt.rs third_party/zeroclaw/src/channels/mod.rs git commit -m "feat: default skills prompt injection to compact" ``` ### Task 9: Add Audit Logging and Regression Coverage **Files:** - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/observability/mod.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs` - Create: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/tests/prompt_safety_regression.rs` **Step 1: Write the failing regression tests** Cover: - prompt override attack from user content - malicious `AGENTS.md` bootstrap content - trusted skill execution within bounds - non-skill privileged request requiring business approval - approval token mismatch - session history restore preserving degraded mode **Step 2: Run the tests to verify RED** Run: ```bash cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture ``` Expected: fail because the end-to-end behavior is not wired together yet. **Step 3: Implement audit logging** Record: - input hash - matched guard rules - risk level - provenance - execution mode transitions - approval decisions Avoid logging raw sensitive content. **Step 4: Re-run the regression tests to verify GREEN** Run: ```bash cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture ``` Expected: the regression suite passes. **Step 5: Commit** Run: ```bash git add third_party/zeroclaw/src/observability/mod.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/channels/mod.rs third_party/zeroclaw/tests/prompt_safety_regression.rs git commit -m "test: add prompt safety regression coverage" ``` ### Task 10: Final Verification and Integration Review **Files:** - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/docs/L5-提示词分布与安全改造方案.md` - Modify: `/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/docs/README.md` **Step 1: Run targeted verification** Run: ```bash cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening cargo test -p zeroclawlabs prompt_guard -- --nocapture cargo test -p zeroclawlabs build_system_prompt -- --nocapture cargo test -p zeroclawlabs dispatcher -- --nocapture cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture ``` Expected: all prompt safety and dispatcher tests pass. **Step 2: Run a broad ZeroClaw package test pass if time permits** Run: ```bash cargo test -p zeroclawlabs -- --nocapture ``` Expected: no regressions in the vendored package test suite, or a documented list of unrelated existing failures. **Step 3: Update the security design docs** Document: - execution modes - trusted skill metadata contract - business approval flow - why non-skill privileged actions are gated **Step 4: Commit the docs** Run: ```bash git add docs/L5-提示词分布与安全改造方案.md docs/README.md git commit -m "docs: record prompt safety hardening design" ``` **Step 5: Prepare merge review notes** Write a short integration summary covering: - changed entry points - backward-compatibility expectations - any skills that need metadata upgrades - rollout recommendation for existing integrators