admin/claw

Files

zyl f51d6b7659 sgclaw: snapshot today's runtime and skill updates

2026-03-30 18:39:49 +08:00

19 KiB

Raw Permalink Blame History

ZeroClaw Prompt Safety Hardening Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Harden ZeroClaw prompt handling and tool execution so non-skill freeform operations degrade to read-only or business-approved execution, while trusted skill-defined operations retain bounded execution privileges.

Architecture: Build a security gate around the existing prompt and tool-entry paths instead of rewriting the full prompt architecture. The gate classifies prompt-injection risk, records operation provenance (trusted_skill vs non_skill), sanitizes injected workspace/skill content, and enforces execution mode transitions (clean, suspect_readonly, suspect_waiting_approval, suspect_business_approved). Trusted skills gain structured business-operation metadata; non-skill operations require business-level approval before any privileged capability is released.

Tech Stack: Rust, vendored ZeroClaw (third_party/zeroclaw), existing approval/autonomy system, current prompt guard and prompt builder tests, cargo test.

Task 1: Create an Isolated Worktree and Verify a Clean Baseline

Files:

Modify: /home/zyl/projects/sgClaw/claw/.gitignore
Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/**

Step 1: Verify the worktree directory is safe to use

Run:

cd /home/zyl/projects/sgClaw/claw
ls -d .worktrees
git check-ignore -v .worktrees

Expected: .worktrees exists and is ignored by git.

Step 2: Create the implementation worktree

Run:

cd /home/zyl/projects/sgClaw/claw
git worktree add .worktrees/zeroclaw-prompt-safety-hardening -b zeroclaw-prompt-safety-hardening

Expected: a new branch and worktree are created.

Step 3: Build the baseline in the worktree

Run:

cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening
cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs build_system_prompt -- --nocapture

Expected: existing relevant tests pass before any code changes.

Step 4: Commit the clean worktree setup if .gitignore changed

Run:

git add .gitignore
git commit -m "chore: prepare worktree for prompt safety hardening"

Expected: commit only if .gitignore required an adjustment.

Task 2: Add the Core Security-Mode Data Model

Files:

Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/operation_policy.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs
Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/operation_policy.rs

Step 1: Write the failing policy tests

Add tests that prove:

suspicious non-skill input maps to suspect_readonly
trusted skill operations can request bounded privileged execution
any out-of-scope capability request downgrades the operation

Use concrete enums and assertions, for example:

assert_eq!(
    ExecutionMode::from_guard_and_provenance(GuardRisk::Suspicious, OperationProvenance::NonSkill),
    ExecutionMode::SuspectReadOnly
);

Step 2: Run the tests to verify RED

Run:

cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening
cargo test -p zeroclawlabs operation_policy -- --nocapture

Expected: fail because the new types do not exist yet.

Step 3: Implement the minimal policy model

Define:

GuardRisk (Clean, Suspicious, Dangerous)
OperationProvenance (TrustedSkill, NonSkill, Mixed)
ExecutionMode (Clean, SuspectReadOnly, SuspectWaitingApproval, SuspectBusinessApproved)
CapabilityClass for privileged business actions

Add small helper functions that do only state mapping. Do not pull prompt-building logic into this module.

Step 4: Re-run the policy tests to verify GREEN

Run:

cargo test -p zeroclawlabs operation_policy -- --nocapture

Expected: the new policy tests pass.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/operation_policy.rs
git commit -m "feat: add prompt security execution mode model"

Task 3: Add Structured Skill Trust Metadata

Files:

Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/skills/mod.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/tools/read_skill.rs
Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/skills/mod.rs

Step 1: Write failing skill metadata tests

Add tests that prove:

SKILL.toml can declare a business operation type, capability list, argument constraints, and step_budget
markdown-only skills default to unprivileged metadata
malformed privileged metadata is rejected or downgraded safely

Use a manifest shape like:

[skill]
name = "export-report"
description = "Export the monthly report"

[security]
operation_type = "browser_export_data"
allowed_capabilities = ["browser_read", "browser_export"]
step_budget = 6
approval_mode = "trusted_skill"

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs skill -- --nocapture

Expected: fail because the structured metadata fields are missing.

Step 3: Implement minimal structured metadata

Extend Skill with a structured security block, for example:

operation_type
business_description
allowed_capabilities
arg_constraints
step_budget
approval_mode

Default markdown-only skills to unprivileged metadata so existing skills remain compatible.

Step 4: Make read_skill expose the metadata

Return or prepend enough structured metadata so the runtime can distinguish trusted skill operations from plain prompt text.

Step 5: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs skill -- --nocapture

Expected: skill parsing and read_skill tests pass.

Step 6: Commit

Run:

git add third_party/zeroclaw/src/skills/mod.rs third_party/zeroclaw/src/tools/read_skill.rs
git commit -m "feat: add trusted skill security metadata"

Task 4: Sanitize Injected Workspace and Skill Content Before Prompt Assembly

Files:

Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/prompt_sanitizer.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/prompt.rs
Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs

Step 1: Write failing sanitizer tests

Add tests that prove:

dangerous bootstrap phrases are removed, escaped, or summarized before prompt injection
control characters are stripped
overlong files are truncated with an audit-friendly marker
safe business content remains readable

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs build_system_prompt -- --nocapture

Expected: fail because injected files are still copied verbatim.

Step 3: Implement the sanitizer

Create a small sanitizer that:

strips control characters
caps content length
flags prompt-override phrases
emits sanitized content plus metadata such as truncated and matched rules

Use this sanitizer in:

load_openclaw_bootstrap_files
any shared path in agent/prompt.rs that renders workspace or skill text into the system prompt

Step 4: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs build_system_prompt -- --nocapture

Expected: prompt-building tests pass with the new sanitized behavior.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/prompt_sanitizer.rs third_party/zeroclaw/src/channels/mod.rs third_party/zeroclaw/src/agent/prompt.rs
git commit -m "feat: sanitize injected workspace prompt content"

Task 5: Wire `PromptGuard` into Main Agent and Gateway Entry Points

Files:

Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/prompt_guard.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/gateway/mod.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/gateway/ws.rs
Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs

Step 1: Write failing entry-point tests

Add tests that prove:

suspicious input marks the turn as degraded instead of silently continuing
dangerous input is blocked
clean input remains unchanged

Prefer tests that assert on a security decision object instead of brittle prompt strings.

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs agent -- --nocapture

Expected: fail because no entry path consumes the guard result.

Step 3: Implement guarded entry evaluation

Before each turn:

scan the inbound user content
map the guard result into GuardRisk
create an execution context carrying risk and provenance
attach audit details for later logging

Keep the existing PromptGuard regexes unless a test demands a specific adjustment.

Step 4: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs agent -- --nocapture

Expected: suspicious and blocked paths now behave deterministically.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/security/prompt_guard.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/gateway/mod.rs third_party/zeroclaw/src/gateway/ws.rs
git commit -m "feat: enforce prompt guard at runtime entry points"

Task 6: Add Business-Level Privileged Operation Registry and Approval Tokens

Files:

Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/approval/mod.rs
Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/business_approval.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/mod.rs
Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/security/business_approval.rs

Step 1: Write failing business approval tests

Add tests that prove:

only operations in the privileged registry can request approval
approval tokens bind to session_id, operation_type, allowed_capabilities, step_budget, and expiration
a mismatched or expired approval token is rejected

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs business_approval -- --nocapture

Expected: fail because the business approval registry does not exist yet.

Step 3: Implement the registry and token model

Create:

a privileged business operation registry
a single-operation approval token
helper checks for can_request_approval and matches_execution_request

Model approval at the business-operation level, not raw tool calls.

Step 4: Extend the existing approval module

Teach the approval module to carry business-level fields through the current request/response flow without breaking old call sites.

Step 5: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs business_approval -- --nocapture

Expected: the token validation and registry tests pass.

Step 6: Commit

Run:

git add third_party/zeroclaw/src/approval/mod.rs third_party/zeroclaw/src/security/mod.rs third_party/zeroclaw/src/security/business_approval.rs
git commit -m "feat: add business-level approval registry"

Task 7: Enforce Execution Modes in Tool Dispatch

Files:

Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/dispatcher.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/loop_.rs
Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/dispatcher.rs

Step 1: Write failing dispatcher tests

Add tests that prove:

suspect_readonly allows only safe read capabilities
trusted_skill can execute capabilities listed in its metadata within step_budget
mixed or non-skill privileged calls require a matching business approval token

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs dispatcher -- --nocapture

Expected: fail because the dispatcher does not yet know about execution modes.

Step 3: Implement capability enforcement

Before dispatching any tool:

resolve the operation context
map the tool call to a capability class
reject calls outside the current execution mode
decrement or validate step_budget for approved bounded flows

Do not rely on prompt text for enforcement.

Step 4: Re-run the tests to verify GREEN

Run:

cargo test -p zeroclawlabs dispatcher -- --nocapture

Expected: dispatch now respects read-only, trusted skill, and business-approved modes.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/agent/dispatcher.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/agent/loop_.rs
git commit -m "feat: enforce execution mode in tool dispatch"

Task 8: Default Skills Prompt Injection to Compact for Safer Runtime Behavior

Files:

Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/config/schema.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/prompt.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs
Test: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/config/schema.rs

Step 1: Write the failing configuration test

Add a test that asserts the default skill prompt injection mode is Compact unless explicitly configured otherwise.

Step 2: Run the test to verify RED

Run:

cargo test -p zeroclawlabs skills_prompt_injection_mode -- --nocapture

Expected: fail because defaults still point to Full.

Step 3: Implement the default flip

Update config defaults and any prompt-builder defaults that currently assume Full. Keep explicit user config backward compatible.

Step 4: Re-run the test to verify GREEN

Run:

cargo test -p zeroclawlabs skills_prompt_injection_mode -- --nocapture

Expected: default configuration now resolves to Compact.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/config/schema.rs third_party/zeroclaw/src/agent/prompt.rs third_party/zeroclaw/src/channels/mod.rs
git commit -m "feat: default skills prompt injection to compact"

Task 9: Add Audit Logging and Regression Coverage

Files:

Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/observability/mod.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/agent/agent.rs
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/src/channels/mod.rs
Create: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/third_party/zeroclaw/tests/prompt_safety_regression.rs

Step 1: Write the failing regression tests

Cover:

prompt override attack from user content
malicious AGENTS.md bootstrap content
trusted skill execution within bounds
non-skill privileged request requiring business approval
approval token mismatch
session history restore preserving degraded mode

Step 2: Run the tests to verify RED

Run:

cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture

Expected: fail because the end-to-end behavior is not wired together yet.

Step 3: Implement audit logging

Record:

input hash
matched guard rules
risk level
provenance
execution mode transitions
approval decisions

Avoid logging raw sensitive content.

Step 4: Re-run the regression tests to verify GREEN

Run:

cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture

Expected: the regression suite passes.

Step 5: Commit

Run:

git add third_party/zeroclaw/src/observability/mod.rs third_party/zeroclaw/src/agent/agent.rs third_party/zeroclaw/src/channels/mod.rs third_party/zeroclaw/tests/prompt_safety_regression.rs
git commit -m "test: add prompt safety regression coverage"

Task 10: Final Verification and Integration Review

Files:

Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/docs/L5-提示词分布与安全改造方案.md
Modify: /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening/docs/README.md

Step 1: Run targeted verification

Run:

cd /home/zyl/projects/sgClaw/claw/.worktrees/zeroclaw-prompt-safety-hardening
cargo test -p zeroclawlabs prompt_guard -- --nocapture
cargo test -p zeroclawlabs build_system_prompt -- --nocapture
cargo test -p zeroclawlabs dispatcher -- --nocapture
cargo test -p zeroclawlabs --test prompt_safety_regression -- --nocapture

Expected: all prompt safety and dispatcher tests pass.

Step 2: Run a broad ZeroClaw package test pass if time permits

Run:

cargo test -p zeroclawlabs -- --nocapture

Expected: no regressions in the vendored package test suite, or a documented list of unrelated existing failures.

Step 3: Update the security design docs

Document:

execution modes
trusted skill metadata contract
business approval flow
why non-skill privileged actions are gated

Step 4: Commit the docs

Run:

git add docs/L5-提示词分布与安全改造方案.md docs/README.md
git commit -m "docs: record prompt safety hardening design"

Step 5: Prepare merge review notes

Write a short integration summary covering:

changed entry points
backward-compatibility expectations
any skills that need metadata upgrades
rollout recommendation for existing integrators

19 KiB Raw Permalink Blame History

ZeroClaw Prompt Safety Hardening Implementation Plan

Task 1: Create an Isolated Worktree and Verify a Clean Baseline

Task 2: Add the Core Security-Mode Data Model

Task 3: Add Structured Skill Trust Metadata

Task 4: Sanitize Injected Workspace and Skill Content Before Prompt Assembly

Task 5: Wire PromptGuard into Main Agent and Gateway Entry Points

Task 6: Add Business-Level Privileged Operation Registry and Approval Tokens

Task 7: Enforce Execution Modes in Tool Dispatch

Task 8: Default Skills Prompt Injection to Compact for Safer Runtime Behavior

Task 9: Add Audit Logging and Regression Coverage

Task 10: Final Verification and Integration Review

19 KiB

Raw Permalink Blame History

Task 5: Wire `PromptGuard` into Main Agent and Gateway Entry Points