feat: add initial skill authoring workspace

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
木炎
2026-04-02 18:34:56 +08:00
parent a461b0734e
commit 51913555ad
30 changed files with 7114 additions and 0 deletions

View File

@@ -0,0 +1,113 @@
---
name: zhihu-hotlist
description: Use when the user wants to collect, snapshot, summarize, or report Zhihu hot list items and related comment metrics from browser-visible page data.
version: 0.1.0
author: sgclaw
tags:
- zhihu
- browser
- hotlist
---
# Zhihu Hotlist
Collect Zhihu hot list items, optionally collect visible comment metrics from each items detail page, and render a compact report from the resulting snapshot. Use this skill for hotlist collection and reporting, not for article editing or general Zhihu navigation.
## When to Use
- The user asks to collect Zhihu hot list data.
- The user asks for a snapshot, ranking summary, or report of current Zhihu hot list items.
- The user wants visible comment metrics such as replies, upvotes, favorites, or heart counts from hot items.
- The task needs a structured report from an existing or newly captured snapshot.
Do not use this skill for:
- arbitrary Zhihu page navigation without hotlist collection
- writing or publishing Zhihu articles
- claiming complete data quality when comment collection partially fails
## Workflow
1. Decide whether the task is a collection run, a report run, or both.
2. For collection runs, call the packaged browser script tool `zhihu-hotlist.extract_hotlist` before any generic `getText` probing.
3. For collection rules and guard conditions, follow [collection-flow.md](references/collection-flow.md).
4. Inside the packaged script, prefer stable structured page state first, then broader DOM candidates, then controlled page-text fallback.
5. Produce the `Export Artifact` immediately after the browser data is stable.
6. If the page is blocked by login, captcha, or anti-bot state, fail explicitly instead of collapsing the issue into "no rows".
7. Surface partial failures explicitly instead of hiding them behind a success summary.
8. For report runs, format output using [report-format.md](references/report-format.md).
9. Apply the caution rules in [data-quality.md](references/data-quality.md) whenever metrics are partial, missing, or inferred from fragile selectors.
## SuperRPA Interface Contract
- Inside the sgClaw browser host, prefer `superrpa_browser` for Zhihu page actions. `browser_action` is only the compatibility alias.
- Always pass `expected_domain` as the bare hostname only, for example `www.zhihu.com`.
- All selectors must be valid CSS selectors because the host executes `document.querySelector(...)`.
- Never use XPath or jQuery-style pseudo-selectors such as `:contains(...)`.
- Prefer canonical route navigation such as `https://www.zhihu.com/hot` before fallback click chains.
- The primary deterministic extractor is the packaged browser script tool `zhihu-hotlist.extract_hotlist`.
- Use generic `getText` only as a last-resort fallback when the packaged extractor fails.
- Do not keep thrashing through selector variants once the packaged script has already produced the structured artifact.
## Partial-Failure Rule
- If hotlist items are captured but some comment-metric collections fail, report the run as partial.
- Include how many items lacked comment metrics.
- Do not phrase the result as fully complete when `partial_items > 0`.
## Blocked-Page Rule
- If Zhihu responds with a login wall, captcha, security verification page, or anti-bot interstitial, state that condition explicitly.
- Do not misreport those states as ordinary "empty hotlist" outcomes.
## Export Artifact
The primary output of this skill is a structured artifact for downstream Office export. The structured artifact is primary. Any prose summary is secondary.
Return this shape as soon as hotlist collection is complete:
```json
{
"source": "https://www.zhihu.com/hot",
"sheet_name": "知乎热榜",
"columns": ["rank", "title", "heat"],
"rows": [[1, "标题", "344万"]]
}
```
Rules:
- `sheet_name` must be exactly `知乎热榜`.
- `columns` must remain `["rank", "title", "heat"]`.
- `rows` must preserve the collected ranking order from the page.
- Each row must contain exactly three values: numeric rank, title text, and heat text.
- If fewer than the requested rows are visible, return the visible rows and mark the result as partial.
- After the artifact is complete, stop exploratory tool use and do not resume browser wandering.
- Do not switch to `shell`, `glob_search`, or unrelated file browsing once the hotlist rows are collected.
## Output
Return a concise result with:
- operation type: `collect` or `report`
- requested `top_n`
- snapshot identifier when available
- item count
- whether comment metrics are complete or partial
- any missing or weak data areas
- the `Export Artifact` block shown above
- an optional short prose summary only after the artifact
## References
- Use [collection-flow.md](references/collection-flow.md) for browser-side collection steps.
- Use [report-format.md](references/report-format.md) for report rendering.
- Use [data-quality.md](references/data-quality.md) before making claims about completeness.
- Use `assets/zhihu_hotlist_flow.source.json` for exact selectors and guard text from the source flow.
## Common Mistakes
- Treating visible hotlist capture as equivalent to complete comment-metric capture.
- Forgetting that report mode can use an existing snapshot instead of recollecting.
- Ignoring weak selectors and generic button captures in comment areas.
- Reporting zeros as if they were confirmed values when the DOM capture may be incomplete.

View File

@@ -0,0 +1,19 @@
[skill]
name = "zhihu-hotlist"
description = "Use when the user wants to collect, snapshot, summarize, or export Zhihu hot list items from the current browser page."
version = "0.1.0"
author = "sgclaw"
tags = ["zhihu", "browser", "hotlist"]
prompts = [
"For live Zhihu hotlist extraction, call zhihu-hotlist.extract_hotlist before generic browser getText probing.",
]
[[tools]]
name = "extract_hotlist"
description = "Primary deterministic extractor for Zhihu hotlist rows on the current page. Use this before generic browser getText probing."
kind = "browser_script"
command = "scripts/extract_hotlist.js"
[tools.args]
top_n = "Maximum number of hotlist rows to return."

View File

@@ -0,0 +1,20 @@
{
"hotlist_url": "https://www.zhihu.com/hot",
"domains": {
"zhihu": "www.zhihu.com"
},
"literals": {
"hotlist_guard": "热榜"
},
"selectors": {
"hotlist_root": "main, body",
"hotlist_item": ".HotList-item, [data-hot-item], section ol li",
"hotlist_title_link": ".HotList-item-title a, h2 a, .ContentItem-title a",
"hotlist_summary": ".HotList-item-summary, .HotItem-content, .RichContent-inner, .ContentItem-excerpt",
"hotlist_heat": ".HotList-item-heat, .HotItem-metrics, .HotItem-hot",
"comment_list": ".Comments-list, .CommentListV2, [data-testid='comment-list'], .CommentList",
"comment_item": ".Comments-list > .CommentItem, .CommentListV2 > .CommentItem, .CommentItemV2, .CommentItem",
"comment_metric": ".CommentItem-metric, .CommentItem-footer button, .ContentItem-actions button, button"
}
}

View File

@@ -0,0 +1,68 @@
# Collection Flow
This skill uses the preserved source flow in `assets/zhihu_hotlist_flow.source.json`.
## Source Model
The source implementation does four things:
1. ensure the browser is on the hotlist page
2. capture hotlist HTML
3. extract the top N items from the page
4. visit each item detail page and try to collect visible comment metrics
## Hotlist Page Detection
- Preferred page URL: `https://www.zhihu.com/hot`
- Domain: `www.zhihu.com`
- Guard text: `热榜`
The source flow first probes the current page for the guard text before deciding whether it must navigate.
## Hotlist Extraction
The source selectors look for:
- hotlist root
- hotlist item
- title link
- summary
- heat text
If the page HTML is empty or exposes no items, the collection should be treated as failed.
## Comment Metric Collection
For each hot item:
1. navigate to the item detail page
2. wait for page root
3. scroll toward comments
4. wait for comment list
5. scroll comment list into view
6. capture page HTML
7. parse visible metrics from comment items
## Parsed Metrics
The source collector tries to extract:
- reply count
- upvote count
- favorite count
- heart count
It also preserves unmatched numeric metrics as raw metric fields when possible.
## Count Parsing
The source parser recognizes compact counts such as:
- plain integers
- `万`
- `亿`
- `k`
- `m`
Use caution when summarizing parsed counts from compact display text.

View File

@@ -0,0 +1,46 @@
# Data Quality
This skill can return useful partial data, but it must not overclaim completeness.
## Main Quality Risks
- comment areas may not load for every hot item
- the DOM may expose only visible comments, not the full set
- generic selectors may match the wrong footer controls
- compact text counts can be parsed but still reflect display approximations
## Partial Success Rule
The source implementation tracks partial item failures during comment collection. If some detail pages fail but the run still returns a snapshot:
- report the run as partial
- include how many items were missing comment metrics
- keep the successful hotlist capture separate from comment-metric completeness
## Snapshot Caveats
The source store design keeps:
- `snapshot_id`
- capture timestamp
- page URL
- collector version
- item list
- collection stats
This is enough for reproducible reporting, but it does not guarantee that every metric field was fully captured.
## Recommended Caution Language
Use wording like:
- `热榜列表已采集,评论指标为部分完成。`
- `报告基于最新快照生成,部分条目缺少评论指标。`
- `数字来自页面可见指标,可能低于完整站内统计。`
Avoid wording like:
- `全部评论指标已准确采集`
- `完整真实热度`
- `无缺失`

View File

@@ -0,0 +1,41 @@
# Report Format
The source report mode renders a compact text report from a snapshot.
## Header Line
Use this structure:
```text
知乎热榜报告 <snapshot_id>: 共 <item_count> 条,采集于 <captured_at_ms>
```
## Per-Item Line
Use this structure:
```text
<rank>. <title> | 热度 <heat_text> | 评论指标 <metric_count> 条 | 回复 <reply_total> | 赞同 <upvote_total> | 收藏 <favorite_total> | 红心 <heart_total>
```
## Field Semantics
- `metric_count`: number of collected comment metric records for the item
- `reply_total`: sum of reply counts across collected records
- `upvote_total`: sum of upvote counts across collected records
- `favorite_total`: sum of favorite counts across collected records
- `heart_total`: sum of heart counts across collected records
## Missing-Metric Handling
If an item has no collected comment metrics:
- keep the item in the report
- show metric count as `0`
- explicitly note partial data elsewhere in the result summary if the run was incomplete
## Report Mode Behavior
- If a specific snapshot ID is supplied, report from that snapshot.
- Otherwise, use the latest known snapshot.

View File

@@ -0,0 +1,262 @@
const limit = Math.max(1, Number(args.top_n || 10));
function cleanText(value) {
return String(value || '')
.replace(/\s+/g, ' ')
.replace(/\u200b/g, '')
.trim();
}
function pickText(root, selectors) {
for (const selector of selectors) {
const node = root.querySelector(selector);
const text = cleanText(node && node.textContent);
if (text) {
return text;
}
}
return '';
}
function inferHeat(text) {
const compact = cleanText(text);
const match = compact.match(/(\d+(?:\.\d+)?)\s*(万|亿|k|K|m|M)(?:热度)?/);
if (match) {
return `${match[1]}${match[2]}`.replace('K', 'k').replace('M', 'm');
}
const plain = compact.match(/(\d+(?:\.\d+)?)(?:热度)?/);
return plain ? plain[1] : '';
}
function extractHeatToken(text) {
const compact = cleanText(text);
const match = compact.match(/(\d+(?:\.\d+)?)\s*(万|亿|k|K|m|M)(?:热度)?$/);
if (match) {
return `${match[1]}${match[2]}`.replace('K', 'k').replace('M', 'm');
}
return '';
}
function inferRank(item, index) {
const direct = pickText(item, [
'.HotList-item-index',
'.HotItem-index',
'[data-rank]',
'.RankingIndex',
]);
const directNumber = Number.parseInt(direct, 10);
if (Number.isFinite(directNumber) && directNumber > 0) {
return directNumber;
}
const text = cleanText(item.textContent);
const leading = text.match(/^(\d{1,2})\b/);
if (leading) {
return Number.parseInt(leading[1], 10);
}
return index + 1;
}
function collectRows() {
const candidates = collectDomCandidates();
const seenTitles = new Set();
const rows = [];
for (const item of candidates) {
const title = pickText(item, [
'.HotList-item-title',
'.HotList-item-title a',
'.HotItem-content a',
'h2 a',
'h2',
'a[href*="/question/"]',
]);
if (!title || seenTitles.has(title)) {
continue;
}
let heat = pickText(item, [
'.HotList-item-metrics',
'.HotList-item-heat',
'.HotItem-metrics',
'.HotItem-hot',
'[data-heat]',
]);
if (!heat) {
heat = inferHeat(item.textContent);
}
if (!heat) {
continue;
}
seenTitles.add(title);
rows.push([
inferRank(item, rows.length),
title,
heat,
]);
if (rows.length >= limit) {
break;
}
}
return rows;
}
function collectDomCandidates() {
const selectors = [
'.HotList-item',
'.HotItem',
'.HotList-list > *',
'[data-hot-item]',
'section ol li',
'main li',
'main article',
'main [class*="Hot"]',
];
const seen = new Set();
const candidates = [];
selectors.forEach((selector) => {
const nodes = Array.from(document.querySelectorAll(selector));
nodes.forEach((node) => {
if (seen.has(node)) {
return;
}
seen.add(node);
candidates.push(node);
});
});
return candidates;
}
function collectTextSources() {
const selectors = ['.HotList-list', '.HotList', '#root', 'main', 'body'];
const sources = [];
const seen = new Set();
selectors.forEach((selector) => {
const node = document.querySelector(selector);
const rawText = String(node && (node.innerText || node.textContent || '') || '');
const dedupeKey = cleanText(rawText);
if (!dedupeKey || seen.has(dedupeKey)) {
return;
}
seen.add(dedupeKey);
sources.push(rawText);
});
return sources.sort((left, right) => right.length - left.length);
}
function looksLikeBlockedPage(text) {
return /安全验证|异常访问|请完成验证|登录后继续|登录即可查看|验证码|访问受限/.test(text);
}
function shouldIgnoreTextLine(line) {
if (!line) {
return true;
}
if (line === '知乎热榜' || line === '首页 - 知乎' || line === '首页-知乎') {
return true;
}
if (line.startsWith('/ ') || line.startsWith('当前页面 ·') ||
line.startsWith('继续输入任务')) {
return true;
}
return false;
}
function collectRowsFromText() {
const sources = collectTextSources();
for (const source of sources) {
if (!source) {
continue;
}
if (looksLikeBlockedPage(source)) {
throw new Error('知乎页面当前需要登录或完成安全验证,无法读取热榜条目');
}
const rows = parseRowsFromText(source);
if (rows.length) {
return rows.slice(0, limit);
}
}
return [];
}
function parseRowsFromText(text) {
const lines = String(text || '')
.split(/\n+/)
.map(cleanText)
.filter((line) => !!line && !shouldIgnoreTextLine(line));
const seenTitles = new Set();
const rows = [];
let pendingRank = null;
let titleParts = [];
function pushRow(title, heat) {
const normalizedTitle = cleanText(title);
if (!normalizedTitle || !heat || seenTitles.has(normalizedTitle)) {
return;
}
seenTitles.add(normalizedTitle);
rows.push([
pendingRank || rows.length + 1,
normalizedTitle,
heat,
]);
pendingRank = null;
titleParts = [];
}
for (const rawLine of lines) {
let line = rawLine;
const rankOnly = line.match(/^(\d{1,2})$/);
if (rankOnly && !titleParts.length) {
pendingRank = Number(rankOnly[1]);
continue;
}
const rankedLine = line.match(/^(\d{1,2})[.、\s]+(.+)$/);
if (rankedLine) {
pendingRank = Number(rankedLine[1]);
line = cleanText(rankedLine[2]);
}
const inlineMatch = line.match(/^(.*?)(\d+(?:\.\d+)?)\s*(万|亿|k|K|m|M)(?:热度)?$/);
if (inlineMatch && cleanText(inlineMatch[1])) {
pushRow(cleanText(inlineMatch[1]), `${inlineMatch[2]}${inlineMatch[3]}`.replace('K', 'k').replace('M', 'm'));
if (rows.length >= limit) {
break;
}
continue;
}
const heatOnly = extractHeatToken(line);
if (heatOnly && titleParts.length) {
pushRow(titleParts.join(' '), heatOnly);
if (rows.length >= limit) {
break;
}
continue;
}
titleParts.push(line);
}
return rows;
}
const domRows = collectRows();
const rows = domRows.length ? domRows : collectRowsFromText();
if (!rows.length) {
throw new Error('未能从页面 DOM 中提取到知乎热榜条目');
}
return {
source: `${location.origin}${location.pathname}`,
sheet_name: '知乎热榜',
columns: ['rank', 'title', 'heat'],
rows,
};