Files
claw/docs/plans/2026-03-26-zeroclaw-prompt-safety-hardening-plan.md

19 KiB

ZeroClaw Prompt Safety Hardening Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Harden ZeroClaw prompt handling and tool execution so non-skill freeform operations degrade to read-only or business-approved execution, while trusted skill-defined operations retain bounded execution privileges.

Architecture: Build a security gate around the existing prompt and tool-entry paths instead of rewriting the full prompt architecture. The gate classifies prompt-injection risk, records operation provenance (trusted_skill vs non_skill), sanitizes injected workspace/skill content, and enforces execution mode transitions (clean, suspect_readonly, suspect_waiting_approval, suspect_business_approved). Trusted skills gain structured business-operation metadata; non-skill operations require business-level approval before any privileged capability is released.

Tech Stack: Rust, vendored ZeroClaw (third_party/zeroclaw), existing approval/autonomy system, current prompt guard and prompt builder tests, cargo test.

Task 1: Create an Isolated Worktree and Verify a Clean Baseline

Files:

  • Modify: /home/zyl/projects/sgClaw/claw/.gitignore
  • Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/**

Step 1: Verify the worktree directory is safe to use

Run:

cd /home/zyl/projects/sgClaw/claw
ls -d .worktrees
git check-ignore -v .worktrees

Expected: .worktrees exists and is ignored by git.

Step 2: Create the implementation worktree

Run:

cd /home/zyl/projects/sgClaw/claw
git worktree add .worktrees/zeroclaw-prompt-safety-hardening -b zeroclaw-prompt-safety-hardening

Expected: a new branch and worktree are created.

Step 3: Build the baseline in the worktree

Run:

cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening
cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs build_system_prompt -- --nocapture

Expected: existing relevant tests pass before any code changes.

Step 4: Commit the clean worktree setup if .gitignore changed

Run:

git add .gitignore
git commit -m "chore: prepare worktree for prompt safety hardening"

Expected: commit only if .gitignore required an adjustment.

Task 2: Add the Core Security-Mode Data Model

Files:

  • Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/operation_policy.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs
  • Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/operation_policy.rs

Step 1: Write the failing policy tests

Add tests that prove:

  • suspicious non-skill input maps to suspect_readonly
  • trusted skill operations can request bounded privileged execution
  • any out-of-scope capability request downgrades the operation

Use concrete enums and assertions, for example:

assert_eq!(
    ExecutionMode::from_guard_and_provenance(GuardRisk::Suspicious, OperationProvenance::NonSkill),
    ExecutionMode::SuspectReadOnly
);

Step 2: Run the tests to verify RED

Run:

cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening
cargo test -p zeroclawlabs operation_policy -- --nocapture

Expected: fail because the new types do not exist yet.

Step 3: Implement the minimal policy model

Define:

  • GuardRisk (Clean, Suspicious, Dangerous)
  • OperationProvenance (TrustedSkill, NonSkill, Mixed)
  • ExecutionMode (Clean, SuspectReadOnly, SuspectWaitingApproval, SuspectBusinessApproved)
  • CapabilityClass for privileged business actions

Add small helper functions that do only state mapping. Do not pull prompt-building logic into this module.

Step 4: Re-run the policy tests to verify GREEN

Run:

cargo test -p zeroclawlabs operation_policy -- --nocapture

Expected: the new policy tests pass.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/operation_policy.rs
git commit -m "feat: add prompt security execution mode model"

Task 3: Add Structured Skill Trust Metadata

Files:

  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/skills/mod.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/tools/read_skill.rs
  • Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/skills/mod.rs

Step 1: Write failing skill metadata tests

Add tests that prove:

  • SKILL.toml can declare a business operation type, capability list, argument constraints, and step_budget
  • markdown-only skills default to unprivileged metadata
  • malformed privileged metadata is rejected or downgraded safely

Use a manifest shape like:

[skill]
name = "export-report"
description = "Export the monthly report"

[security]
operation_type = "browser_export_data"
allowed_capabilities = ["browser_read", "browser_export"]
step_budget = 6
approval_mode = "trusted_skill"

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs skill -- --nocapture

Expected: fail because the structured metadata fields are missing.

Step 3: Implement minimal structured metadata

Extend Skill with a structured security block, for example:

  • operation_type
  • business_description
  • allowed_capabilities
  • arg_constraints
  • step_budget
  • approval_mode

Default markdown-only skills to unprivileged metadata so existing skills remain compatible.

Step 4: Make read_skill expose the metadata

Return or prepend enough structured metadata so the runtime can distinguish trusted skill operations from plain prompt text.

Step 5: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs skill -- --nocapture

Expected: skill parsing and read_skill tests pass.

Step 6: Commit

Run:

git add third_party/zeroclaw/src/skills/mod.rs third_party/zeroclaw/src/tools/read_skill.rs
git commit -m "feat: add trusted skill security metadata"

Task 4: Sanitize Injected Workspace and Skill Content Before Prompt Assembly

Files:

  • Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/prompt_sanitizer.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/prompt.rs
  • Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs

Step 1: Write failing sanitizer tests

Add tests that prove:

  • dangerous bootstrap phrases are removed, escaped, or summarized before prompt injection
  • control characters are stripped
  • overlong files are truncated with an audit-friendly marker
  • safe business content remains readable

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs build_system_prompt -- --nocapture

Expected: fail because injected files are still copied verbatim.

Step 3: Implement the sanitizer

Create a small sanitizer that:

  • strips control characters
  • caps content length
  • flags prompt-override phrases
  • emits sanitized content plus metadata such as truncated and matched rules

Use this sanitizer in:

  • load_openclaw_bootstrap_files
  • any shared path in agent/prompt.rs that renders workspace or skill text into the system prompt

Step 4: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs build_system_prompt -- --nocapture

Expected: prompt-building tests pass with the new sanitized behavior.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/prompt_sanitizer.rs third_party/zeroclaw/src/channels/mod.rs third_party/zeroclaw/src/agent/prompt.rs
git commit -m "feat: sanitize injected workspace prompt content"

Task 5: Wire PromptGuard into Main Agent and Gateway Entry Points

Files:

  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/prompt_guard.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/gateway/mod.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/gateway/ws.rs
  • Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs

Step 1: Write failing entry-point tests

Add tests that prove:

  • suspicious input marks the turn as degraded instead of silently continuing
  • dangerous input is blocked
  • clean input remains unchanged

Prefer tests that assert on a security decision object instead of brittle prompt strings.

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs agent -- --nocapture

Expected: fail because no entry path consumes the guard result.

Step 3: Implement guarded entry evaluation

Before each turn:

  • scan the inbound user content
  • map the guard result into GuardRisk
  • create an execution context carrying risk and provenance
  • attach audit details for later logging

Keep the existing PromptGuard regexes unless a test demands a specific adjustment.

Step 4: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs agent -- --nocapture

Expected: suspicious and blocked paths now behave deterministically.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/security/prompt_guard.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/gateway/mod.rs third_party/zeroclaw/src/gateway/ws.rs
git commit -m "feat: enforce prompt guard at runtime entry points"

Task 6: Add Business-Level Privileged Operation Registry and Approval Tokens

Files:

  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/approval/mod.rs
  • Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/business_approval.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs
  • Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/business_approval.rs

Step 1: Write failing business approval tests

Add tests that prove:

  • only operations in the privileged registry can request approval
  • approval tokens bind to session_id, operation_type, allowed_capabilities, step_budget, and expiration
  • a mismatched or expired approval token is rejected

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs business_approval -- --nocapture

Expected: fail because the business approval registry does not exist yet.

Step 3: Implement the registry and token model

Create:

  • a privileged business operation registry
  • a single-operation approval token
  • helper checks for can_request_approval and matches_execution_request

Model approval at the business-operation level, not raw tool calls.

Step 4: Extend the existing approval module

Teach the approval module to carry business-level fields through the current request/response flow without breaking old call sites.

Step 5: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs business_approval -- --nocapture

Expected: the token validation and registry tests pass.

Step 6: Commit

Run:

git add third_party/zeroclaw/src/approval/mod.rs third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/business_approval.rs
git commit -m "feat: add business-level approval registry"

Task 7: Enforce Execution Modes in Tool Dispatch

Files:

  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/dispatcher.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/loop_.rs
  • Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/dispatcher.rs

Step 1: Write failing dispatcher tests

Add tests that prove:

  • suspect_readonly allows only safe read capabilities
  • trusted_skill can execute capabilities listed in its metadata within step_budget
  • mixed or non-skill privileged calls require a matching business approval token

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs dispatcher -- --nocapture

Expected: fail because the dispatcher does not yet know about execution modes.

Step 3: Implement capability enforcement

Before dispatching any tool:

  • resolve the operation context
  • map the tool call to a capability class
  • reject calls outside the current execution mode
  • decrement or validate step_budget for approved bounded flows

Do not rely on prompt text for enforcement.

Step 4: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs dispatcher -- --nocapture

Expected: dispatch now respects read-only, trusted skill, and business-approved modes.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/agent/dispatcher.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/agent/loop_.rs
git commit -m "feat: enforce execution mode in tool dispatch"

Task 8: Default Skills Prompt Injection to Compact for Safer Runtime Behavior

Files:

  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/config/schema.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/prompt.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs
  • Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/config/schema.rs

Step 1: Write the failing configuration test

Add a test that asserts the default skill prompt injection mode is Compact unless explicitly configured otherwise.

Step 2: Run the test to verify RED

Run:

cargo test -p zeroclawlabs skills_prompt_injection_mode -- --nocapture

Expected: fail because defaults still point to Full.

Step 3: Implement the default flip

Update config defaults and any prompt-builder defaults that currently assume Full. Keep explicit user config backward compatible.

Step 4: Re-run the test to verify GREEN

Run:

cargo test -p zeroclawlabs skills_prompt_injection_mode -- --nocapture

Expected: default configuration now resolves to Compact.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/config/schema.rs third_party/zeroclaw/src/agent/prompt.rs third_party/zeroclaw/src/channels/mod.rs
git commit -m "feat: default skills prompt injection to compact"

Task 9: Add Audit Logging and Regression Coverage

Files:

  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/observability/mod.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs
  • Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/tests/prompt_safety_regression.rs

Step 1: Write the failing regression tests

Cover:

  • prompt override attack from user content
  • malicious AGENTS.md bootstrap content
  • trusted skill execution within bounds
  • non-skill privileged request requiring business approval
  • approval token mismatch
  • session history restore preserving degraded mode

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture

Expected: fail because the end-to-end behavior is not wired together yet.

Step 3: Implement audit logging

Record:

  • input hash
  • matched guard rules
  • risk level
  • provenance
  • execution mode transitions
  • approval decisions

Avoid logging raw sensitive content.

Step 4: Re-run the regression tests to verify GREEN

Run:

cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture

Expected: the regression suite passes.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/observability/mod.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/channels/mod.rs third_party/zeroclaw/tests/prompt_safety_regression.rs
git commit -m "test: add prompt safety regression coverage"

Task 10: Final Verification and Integration Review

Files:

  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/docs/L5-提示词分布与安全改造方案.md
  • Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/docs/README.md

Step 1: Run targeted verification

Run:

cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening
cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs build_system_prompt -- --nocapture
cargo test -p zeroclawlabs dispatcher -- --nocapture
cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture

Expected: all prompt safety and dispatcher tests pass.

Step 2: Run a broad ZeroClaw package test pass if time permits

Run:

cargo test -p zeroclawlabs -- --nocapture

Expected: no regressions in the vendored package test suite, or a documented list of unrelated existing failures.

Step 3: Update the security design docs

Document:

  • execution modes
  • trusted skill metadata contract
  • business approval flow
  • why non-skill privileged actions are gated

Step 4: Commit the docs

Run:

git add docs/L5-提示词分布与安全改造方案.md docs/README.md
git commit -m "docs: record prompt safety hardening design"

Step 5: Prepare merge review notes

Write a short integration summary covering:

  • changed entry points
  • backward-compatibility expectations
  • any skills that need metadata upgrades
  • rollout recommendation for existing integrators