19 KiB
ZeroClaw Prompt Safety Hardening Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Harden ZeroClaw prompt handling and tool execution so non-skill freeform operations degrade to read-only or business-approved execution, while trusted skill-defined operations retain bounded execution privileges.
Architecture: Build a security gate around the existing prompt and tool-entry paths instead of rewriting the full prompt architecture. The gate classifies prompt-injection risk, records operation provenance (trusted_skill vs non_skill), sanitizes injected workspace/skill content, and enforces execution mode transitions (clean, suspect_readonly, suspect_waiting_approval, suspect_business_approved). Trusted skills gain structured business-operation metadata; non-skill operations require business-level approval before any privileged capability is released.
Tech Stack: Rust, vendored ZeroClaw (third_party/zeroclaw), existing approval/autonomy system, current prompt guard and prompt builder tests, cargo test.
Task 1: Create an Isolated Worktree and Verify a Clean Baseline
Files:
- Modify:
/home/zyl/projects/sgClaw/claw/.gitignore - Create:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/**
Step 1: Verify the worktree directory is safe to use
Run:
cd /home/zyl/projects/sgClaw/claw
ls -d .worktrees
git check-ignore -v .worktrees
Expected: .worktrees exists and is ignored by git.
Step 2: Create the implementation worktree
Run:
cd /home/zyl/projects/sgClaw/claw
git worktree add .worktrees/zeroclaw-prompt-safety-hardening -b zeroclaw-prompt-safety-hardening
Expected: a new branch and worktree are created.
Step 3: Build the baseline in the worktree
Run:
cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening
cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs build_system_prompt -- --nocapture
Expected: existing relevant tests pass before any code changes.
Step 4: Commit the clean worktree setup if .gitignore changed
Run:
git add .gitignore
git commit -m "chore: prepare worktree for prompt safety hardening"
Expected: commit only if .gitignore required an adjustment.
Task 2: Add the Core Security-Mode Data Model
Files:
- Create:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/operation_policy.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs - Test:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/operation_policy.rs
Step 1: Write the failing policy tests
Add tests that prove:
- suspicious non-skill input maps to
suspect_readonly - trusted skill operations can request bounded privileged execution
- any out-of-scope capability request downgrades the operation
Use concrete enums and assertions, for example:
assert_eq!(
ExecutionMode::from_guard_and_provenance(GuardRisk::Suspicious, OperationProvenance::NonSkill),
ExecutionMode::SuspectReadOnly
);
Step 2: Run the tests to verify RED
Run:
cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening
cargo test -p zeroclawlabs operation_policy -- --nocapture
Expected: fail because the new types do not exist yet.
Step 3: Implement the minimal policy model
Define:
GuardRisk(Clean,Suspicious,Dangerous)OperationProvenance(TrustedSkill,NonSkill,Mixed)ExecutionMode(Clean,SuspectReadOnly,SuspectWaitingApproval,SuspectBusinessApproved)CapabilityClassfor privileged business actions
Add small helper functions that do only state mapping. Do not pull prompt-building logic into this module.
Step 4: Re-run the policy tests to verify GREEN
Run:
cargo test -p zeroclawlabs operation_policy -- --nocapture
Expected: the new policy tests pass.
Step 5: Commit
Run:
git add third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/operation_policy.rs
git commit -m "feat: add prompt security execution mode model"
Task 3: Add Structured Skill Trust Metadata
Files:
- Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/skills/mod.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/tools/read_skill.rs - Test:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/skills/mod.rs
Step 1: Write failing skill metadata tests
Add tests that prove:
SKILL.tomlcan declare a business operation type, capability list, argument constraints, andstep_budget- markdown-only skills default to unprivileged metadata
- malformed privileged metadata is rejected or downgraded safely
Use a manifest shape like:
[skill]
name = "export-report"
description = "Export the monthly report"
[security]
operation_type = "browser_export_data"
allowed_capabilities = ["browser_read", "browser_export"]
step_budget = 6
approval_mode = "trusted_skill"
Step 2: Run the tests to verify RED
Run:
cargo test -p zeroclawlabs skill -- --nocapture
Expected: fail because the structured metadata fields are missing.
Step 3: Implement minimal structured metadata
Extend Skill with a structured security block, for example:
operation_typebusiness_descriptionallowed_capabilitiesarg_constraintsstep_budgetapproval_mode
Default markdown-only skills to unprivileged metadata so existing skills remain compatible.
Step 4: Make read_skill expose the metadata
Return or prepend enough structured metadata so the runtime can distinguish trusted skill operations from plain prompt text.
Step 5: Re-run the tests to verify GREEN
Run:
cargo test -p zeroclawlabs skill -- --nocapture
Expected: skill parsing and read_skill tests pass.
Step 6: Commit
Run:
git add third_party/zeroclaw/src/skills/mod.rs third_party/zeroclaw/src/tools/read_skill.rs
git commit -m "feat: add trusted skill security metadata"
Task 4: Sanitize Injected Workspace and Skill Content Before Prompt Assembly
Files:
- Create:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/prompt_sanitizer.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/prompt.rs - Test:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs
Step 1: Write failing sanitizer tests
Add tests that prove:
- dangerous bootstrap phrases are removed, escaped, or summarized before prompt injection
- control characters are stripped
- overlong files are truncated with an audit-friendly marker
- safe business content remains readable
Step 2: Run the tests to verify RED
Run:
cargo test -p zeroclawlabs build_system_prompt -- --nocapture
Expected: fail because injected files are still copied verbatim.
Step 3: Implement the sanitizer
Create a small sanitizer that:
- strips control characters
- caps content length
- flags prompt-override phrases
- emits sanitized content plus metadata such as
truncatedand matched rules
Use this sanitizer in:
load_openclaw_bootstrap_files- any shared path in
agent/prompt.rsthat renders workspace or skill text into the system prompt
Step 4: Re-run the tests to verify GREEN
Run:
cargo test -p zeroclawlabs build_system_prompt -- --nocapture
Expected: prompt-building tests pass with the new sanitized behavior.
Step 5: Commit
Run:
git add third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/prompt_sanitizer.rs third_party/zeroclaw/src/channels/mod.rs third_party/zeroclaw/src/agent/prompt.rs
git commit -m "feat: sanitize injected workspace prompt content"
Task 5: Wire PromptGuard into Main Agent and Gateway Entry Points
Files:
- Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/prompt_guard.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/gateway/mod.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/gateway/ws.rs - Test:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs
Step 1: Write failing entry-point tests
Add tests that prove:
- suspicious input marks the turn as degraded instead of silently continuing
- dangerous input is blocked
- clean input remains unchanged
Prefer tests that assert on a security decision object instead of brittle prompt strings.
Step 2: Run the tests to verify RED
Run:
cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs agent -- --nocapture
Expected: fail because no entry path consumes the guard result.
Step 3: Implement guarded entry evaluation
Before each turn:
- scan the inbound user content
- map the guard result into
GuardRisk - create an execution context carrying risk and provenance
- attach audit details for later logging
Keep the existing PromptGuard regexes unless a test demands a specific adjustment.
Step 4: Re-run the tests to verify GREEN
Run:
cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs agent -- --nocapture
Expected: suspicious and blocked paths now behave deterministically.
Step 5: Commit
Run:
git add third_party/zeroclaw/src/security/prompt_guard.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/gateway/mod.rs third_party/zeroclaw/src/gateway/ws.rs
git commit -m "feat: enforce prompt guard at runtime entry points"
Task 6: Add Business-Level Privileged Operation Registry and Approval Tokens
Files:
- Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/approval/mod.rs - Create:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/business_approval.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs - Test:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/business_approval.rs
Step 1: Write failing business approval tests
Add tests that prove:
- only operations in the privileged registry can request approval
- approval tokens bind to
session_id,operation_type,allowed_capabilities,step_budget, and expiration - a mismatched or expired approval token is rejected
Step 2: Run the tests to verify RED
Run:
cargo test -p zeroclawlabs business_approval -- --nocapture
Expected: fail because the business approval registry does not exist yet.
Step 3: Implement the registry and token model
Create:
- a privileged business operation registry
- a single-operation approval token
- helper checks for
can_request_approvalandmatches_execution_request
Model approval at the business-operation level, not raw tool calls.
Step 4: Extend the existing approval module
Teach the approval module to carry business-level fields through the current request/response flow without breaking old call sites.
Step 5: Re-run the tests to verify GREEN
Run:
cargo test -p zeroclawlabs business_approval -- --nocapture
Expected: the token validation and registry tests pass.
Step 6: Commit
Run:
git add third_party/zeroclaw/src/approval/mod.rs third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/business_approval.rs
git commit -m "feat: add business-level approval registry"
Task 7: Enforce Execution Modes in Tool Dispatch
Files:
- Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/dispatcher.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/loop_.rs - Test:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/dispatcher.rs
Step 1: Write failing dispatcher tests
Add tests that prove:
suspect_readonlyallows only safe read capabilitiestrusted_skillcan execute capabilities listed in its metadata withinstep_budgetmixedor non-skill privileged calls require a matching business approval token
Step 2: Run the tests to verify RED
Run:
cargo test -p zeroclawlabs dispatcher -- --nocapture
Expected: fail because the dispatcher does not yet know about execution modes.
Step 3: Implement capability enforcement
Before dispatching any tool:
- resolve the operation context
- map the tool call to a capability class
- reject calls outside the current execution mode
- decrement or validate
step_budgetfor approved bounded flows
Do not rely on prompt text for enforcement.
Step 4: Re-run the tests to verify GREEN
Run:
cargo test -p zeroclawlabs dispatcher -- --nocapture
Expected: dispatch now respects read-only, trusted skill, and business-approved modes.
Step 5: Commit
Run:
git add third_party/zeroclaw/src/agent/dispatcher.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/agent/loop_.rs
git commit -m "feat: enforce execution mode in tool dispatch"
Task 8: Default Skills Prompt Injection to Compact for Safer Runtime Behavior
Files:
- Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/config/schema.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/prompt.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs - Test:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/config/schema.rs
Step 1: Write the failing configuration test
Add a test that asserts the default skill prompt injection mode is Compact unless explicitly configured otherwise.
Step 2: Run the test to verify RED
Run:
cargo test -p zeroclawlabs skills_prompt_injection_mode -- --nocapture
Expected: fail because defaults still point to Full.
Step 3: Implement the default flip
Update config defaults and any prompt-builder defaults that currently assume Full. Keep explicit user config backward compatible.
Step 4: Re-run the test to verify GREEN
Run:
cargo test -p zeroclawlabs skills_prompt_injection_mode -- --nocapture
Expected: default configuration now resolves to Compact.
Step 5: Commit
Run:
git add third_party/zeroclaw/src/config/schema.rs third_party/zeroclaw/src/agent/prompt.rs third_party/zeroclaw/src/channels/mod.rs
git commit -m "feat: default skills prompt injection to compact"
Task 9: Add Audit Logging and Regression Coverage
Files:
- Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/observability/mod.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs - Create:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/tests/prompt_safety_regression.rs
Step 1: Write the failing regression tests
Cover:
- prompt override attack from user content
- malicious
AGENTS.mdbootstrap content - trusted skill execution within bounds
- non-skill privileged request requiring business approval
- approval token mismatch
- session history restore preserving degraded mode
Step 2: Run the tests to verify RED
Run:
cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture
Expected: fail because the end-to-end behavior is not wired together yet.
Step 3: Implement audit logging
Record:
- input hash
- matched guard rules
- risk level
- provenance
- execution mode transitions
- approval decisions
Avoid logging raw sensitive content.
Step 4: Re-run the regression tests to verify GREEN
Run:
cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture
Expected: the regression suite passes.
Step 5: Commit
Run:
git add third_party/zeroclaw/src/observability/mod.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/channels/mod.rs third_party/zeroclaw/tests/prompt_safety_regression.rs
git commit -m "test: add prompt safety regression coverage"
Task 10: Final Verification and Integration Review
Files:
- Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/docs/L5-提示词分布与安全改造方案.md - Modify:
/home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/docs/README.md
Step 1: Run targeted verification
Run:
cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening
cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs build_system_prompt -- --nocapture
cargo test -p zeroclawlabs dispatcher -- --nocapture
cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture
Expected: all prompt safety and dispatcher tests pass.
Step 2: Run a broad ZeroClaw package test pass if time permits
Run:
cargo test -p zeroclawlabs -- --nocapture
Expected: no regressions in the vendored package test suite, or a documented list of unrelated existing failures.
Step 3: Update the security design docs
Document:
- execution modes
- trusted skill metadata contract
- business approval flow
- why non-skill privileged actions are gated
Step 4: Commit the docs
Run:
git add docs/L5-提示词分布与安全改造方案.md docs/README.md
git commit -m "docs: record prompt safety hardening design"
Step 5: Prepare merge review notes
Write a short integration summary covering:
- changed entry points
- backward-compatibility expectations
- any skills that need metadata upgrades
- rollout recommendation for existing integrators