feat: add initial skill authoring workspace
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
113
skills/zhihu-hotlist/SKILL.md
Normal file
113
skills/zhihu-hotlist/SKILL.md
Normal file
@@ -0,0 +1,113 @@
|
||||
---
|
||||
name: zhihu-hotlist
|
||||
description: Use when the user wants to collect, snapshot, summarize, or report Zhihu hot list items and related comment metrics from browser-visible page data.
|
||||
version: 0.1.0
|
||||
author: sgclaw
|
||||
tags:
|
||||
- zhihu
|
||||
- browser
|
||||
- hotlist
|
||||
---
|
||||
|
||||
# Zhihu Hotlist
|
||||
|
||||
Collect Zhihu hot list items, optionally collect visible comment metrics from each item’s detail page, and render a compact report from the resulting snapshot. Use this skill for hotlist collection and reporting, not for article editing or general Zhihu navigation.
|
||||
|
||||
## When to Use
|
||||
|
||||
- The user asks to collect Zhihu hot list data.
|
||||
- The user asks for a snapshot, ranking summary, or report of current Zhihu hot list items.
|
||||
- The user wants visible comment metrics such as replies, upvotes, favorites, or heart counts from hot items.
|
||||
- The task needs a structured report from an existing or newly captured snapshot.
|
||||
|
||||
Do not use this skill for:
|
||||
|
||||
- arbitrary Zhihu page navigation without hotlist collection
|
||||
- writing or publishing Zhihu articles
|
||||
- claiming complete data quality when comment collection partially fails
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Decide whether the task is a collection run, a report run, or both.
|
||||
2. For collection runs, call the packaged browser script tool `zhihu-hotlist.extract_hotlist` before any generic `getText` probing.
|
||||
3. For collection rules and guard conditions, follow [collection-flow.md](references/collection-flow.md).
|
||||
4. Inside the packaged script, prefer stable structured page state first, then broader DOM candidates, then controlled page-text fallback.
|
||||
5. Produce the `Export Artifact` immediately after the browser data is stable.
|
||||
6. If the page is blocked by login, captcha, or anti-bot state, fail explicitly instead of collapsing the issue into "no rows".
|
||||
7. Surface partial failures explicitly instead of hiding them behind a success summary.
|
||||
8. For report runs, format output using [report-format.md](references/report-format.md).
|
||||
9. Apply the caution rules in [data-quality.md](references/data-quality.md) whenever metrics are partial, missing, or inferred from fragile selectors.
|
||||
|
||||
## SuperRPA Interface Contract
|
||||
|
||||
- Inside the sgClaw browser host, prefer `superrpa_browser` for Zhihu page actions. `browser_action` is only the compatibility alias.
|
||||
- Always pass `expected_domain` as the bare hostname only, for example `www.zhihu.com`.
|
||||
- All selectors must be valid CSS selectors because the host executes `document.querySelector(...)`.
|
||||
- Never use XPath or jQuery-style pseudo-selectors such as `:contains(...)`.
|
||||
- Prefer canonical route navigation such as `https://www.zhihu.com/hot` before fallback click chains.
|
||||
- The primary deterministic extractor is the packaged browser script tool `zhihu-hotlist.extract_hotlist`.
|
||||
- Use generic `getText` only as a last-resort fallback when the packaged extractor fails.
|
||||
- Do not keep thrashing through selector variants once the packaged script has already produced the structured artifact.
|
||||
|
||||
## Partial-Failure Rule
|
||||
|
||||
- If hotlist items are captured but some comment-metric collections fail, report the run as partial.
|
||||
- Include how many items lacked comment metrics.
|
||||
- Do not phrase the result as fully complete when `partial_items > 0`.
|
||||
|
||||
## Blocked-Page Rule
|
||||
|
||||
- If Zhihu responds with a login wall, captcha, security verification page, or anti-bot interstitial, state that condition explicitly.
|
||||
- Do not misreport those states as ordinary "empty hotlist" outcomes.
|
||||
|
||||
## Export Artifact
|
||||
|
||||
The primary output of this skill is a structured artifact for downstream Office export. The structured artifact is primary. Any prose summary is secondary.
|
||||
|
||||
Return this shape as soon as hotlist collection is complete:
|
||||
|
||||
```json
|
||||
{
|
||||
"source": "https://www.zhihu.com/hot",
|
||||
"sheet_name": "知乎热榜",
|
||||
"columns": ["rank", "title", "heat"],
|
||||
"rows": [[1, "标题", "344万"]]
|
||||
}
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- `sheet_name` must be exactly `知乎热榜`.
|
||||
- `columns` must remain `["rank", "title", "heat"]`.
|
||||
- `rows` must preserve the collected ranking order from the page.
|
||||
- Each row must contain exactly three values: numeric rank, title text, and heat text.
|
||||
- If fewer than the requested rows are visible, return the visible rows and mark the result as partial.
|
||||
- After the artifact is complete, stop exploratory tool use and do not resume browser wandering.
|
||||
- Do not switch to `shell`, `glob_search`, or unrelated file browsing once the hotlist rows are collected.
|
||||
|
||||
## Output
|
||||
|
||||
Return a concise result with:
|
||||
|
||||
- operation type: `collect` or `report`
|
||||
- requested `top_n`
|
||||
- snapshot identifier when available
|
||||
- item count
|
||||
- whether comment metrics are complete or partial
|
||||
- any missing or weak data areas
|
||||
- the `Export Artifact` block shown above
|
||||
- an optional short prose summary only after the artifact
|
||||
|
||||
## References
|
||||
|
||||
- Use [collection-flow.md](references/collection-flow.md) for browser-side collection steps.
|
||||
- Use [report-format.md](references/report-format.md) for report rendering.
|
||||
- Use [data-quality.md](references/data-quality.md) before making claims about completeness.
|
||||
- Use `assets/zhihu_hotlist_flow.source.json` for exact selectors and guard text from the source flow.
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
- Treating visible hotlist capture as equivalent to complete comment-metric capture.
|
||||
- Forgetting that report mode can use an existing snapshot instead of recollecting.
|
||||
- Ignoring weak selectors and generic button captures in comment areas.
|
||||
- Reporting zeros as if they were confirmed values when the DOM capture may be incomplete.
|
||||
19
skills/zhihu-hotlist/SKILL.toml
Normal file
19
skills/zhihu-hotlist/SKILL.toml
Normal file
@@ -0,0 +1,19 @@
|
||||
[skill]
|
||||
name = "zhihu-hotlist"
|
||||
description = "Use when the user wants to collect, snapshot, summarize, or export Zhihu hot list items from the current browser page."
|
||||
version = "0.1.0"
|
||||
author = "sgclaw"
|
||||
tags = ["zhihu", "browser", "hotlist"]
|
||||
|
||||
prompts = [
|
||||
"For live Zhihu hotlist extraction, call zhihu-hotlist.extract_hotlist before generic browser getText probing.",
|
||||
]
|
||||
|
||||
[[tools]]
|
||||
name = "extract_hotlist"
|
||||
description = "Primary deterministic extractor for Zhihu hotlist rows on the current page. Use this before generic browser getText probing."
|
||||
kind = "browser_script"
|
||||
command = "scripts/extract_hotlist.js"
|
||||
|
||||
[tools.args]
|
||||
top_n = "Maximum number of hotlist rows to return."
|
||||
20
skills/zhihu-hotlist/assets/zhihu_hotlist_flow.source.json
Normal file
20
skills/zhihu-hotlist/assets/zhihu_hotlist_flow.source.json
Normal file
@@ -0,0 +1,20 @@
|
||||
{
|
||||
"hotlist_url": "https://www.zhihu.com/hot",
|
||||
"domains": {
|
||||
"zhihu": "www.zhihu.com"
|
||||
},
|
||||
"literals": {
|
||||
"hotlist_guard": "热榜"
|
||||
},
|
||||
"selectors": {
|
||||
"hotlist_root": "main, body",
|
||||
"hotlist_item": ".HotList-item, [data-hot-item], section ol li",
|
||||
"hotlist_title_link": ".HotList-item-title a, h2 a, .ContentItem-title a",
|
||||
"hotlist_summary": ".HotList-item-summary, .HotItem-content, .RichContent-inner, .ContentItem-excerpt",
|
||||
"hotlist_heat": ".HotList-item-heat, .HotItem-metrics, .HotItem-hot",
|
||||
"comment_list": ".Comments-list, .CommentListV2, [data-testid='comment-list'], .CommentList",
|
||||
"comment_item": ".Comments-list > .CommentItem, .CommentListV2 > .CommentItem, .CommentItemV2, .CommentItem",
|
||||
"comment_metric": ".CommentItem-metric, .CommentItem-footer button, .ContentItem-actions button, button"
|
||||
}
|
||||
}
|
||||
|
||||
68
skills/zhihu-hotlist/references/collection-flow.md
Normal file
68
skills/zhihu-hotlist/references/collection-flow.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# Collection Flow
|
||||
|
||||
This skill uses the preserved source flow in `assets/zhihu_hotlist_flow.source.json`.
|
||||
|
||||
## Source Model
|
||||
|
||||
The source implementation does four things:
|
||||
|
||||
1. ensure the browser is on the hotlist page
|
||||
2. capture hotlist HTML
|
||||
3. extract the top N items from the page
|
||||
4. visit each item detail page and try to collect visible comment metrics
|
||||
|
||||
## Hotlist Page Detection
|
||||
|
||||
- Preferred page URL: `https://www.zhihu.com/hot`
|
||||
- Domain: `www.zhihu.com`
|
||||
- Guard text: `热榜`
|
||||
|
||||
The source flow first probes the current page for the guard text before deciding whether it must navigate.
|
||||
|
||||
## Hotlist Extraction
|
||||
|
||||
The source selectors look for:
|
||||
|
||||
- hotlist root
|
||||
- hotlist item
|
||||
- title link
|
||||
- summary
|
||||
- heat text
|
||||
|
||||
If the page HTML is empty or exposes no items, the collection should be treated as failed.
|
||||
|
||||
## Comment Metric Collection
|
||||
|
||||
For each hot item:
|
||||
|
||||
1. navigate to the item detail page
|
||||
2. wait for page root
|
||||
3. scroll toward comments
|
||||
4. wait for comment list
|
||||
5. scroll comment list into view
|
||||
6. capture page HTML
|
||||
7. parse visible metrics from comment items
|
||||
|
||||
## Parsed Metrics
|
||||
|
||||
The source collector tries to extract:
|
||||
|
||||
- reply count
|
||||
- upvote count
|
||||
- favorite count
|
||||
- heart count
|
||||
|
||||
It also preserves unmatched numeric metrics as raw metric fields when possible.
|
||||
|
||||
## Count Parsing
|
||||
|
||||
The source parser recognizes compact counts such as:
|
||||
|
||||
- plain integers
|
||||
- `万`
|
||||
- `亿`
|
||||
- `k`
|
||||
- `m`
|
||||
|
||||
Use caution when summarizing parsed counts from compact display text.
|
||||
|
||||
46
skills/zhihu-hotlist/references/data-quality.md
Normal file
46
skills/zhihu-hotlist/references/data-quality.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# Data Quality
|
||||
|
||||
This skill can return useful partial data, but it must not overclaim completeness.
|
||||
|
||||
## Main Quality Risks
|
||||
|
||||
- comment areas may not load for every hot item
|
||||
- the DOM may expose only visible comments, not the full set
|
||||
- generic selectors may match the wrong footer controls
|
||||
- compact text counts can be parsed but still reflect display approximations
|
||||
|
||||
## Partial Success Rule
|
||||
|
||||
The source implementation tracks partial item failures during comment collection. If some detail pages fail but the run still returns a snapshot:
|
||||
|
||||
- report the run as partial
|
||||
- include how many items were missing comment metrics
|
||||
- keep the successful hotlist capture separate from comment-metric completeness
|
||||
|
||||
## Snapshot Caveats
|
||||
|
||||
The source store design keeps:
|
||||
|
||||
- `snapshot_id`
|
||||
- capture timestamp
|
||||
- page URL
|
||||
- collector version
|
||||
- item list
|
||||
- collection stats
|
||||
|
||||
This is enough for reproducible reporting, but it does not guarantee that every metric field was fully captured.
|
||||
|
||||
## Recommended Caution Language
|
||||
|
||||
Use wording like:
|
||||
|
||||
- `热榜列表已采集,评论指标为部分完成。`
|
||||
- `报告基于最新快照生成,部分条目缺少评论指标。`
|
||||
- `数字来自页面可见指标,可能低于完整站内统计。`
|
||||
|
||||
Avoid wording like:
|
||||
|
||||
- `全部评论指标已准确采集`
|
||||
- `完整真实热度`
|
||||
- `无缺失`
|
||||
|
||||
41
skills/zhihu-hotlist/references/report-format.md
Normal file
41
skills/zhihu-hotlist/references/report-format.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# Report Format
|
||||
|
||||
The source report mode renders a compact text report from a snapshot.
|
||||
|
||||
## Header Line
|
||||
|
||||
Use this structure:
|
||||
|
||||
```text
|
||||
知乎热榜报告 <snapshot_id>: 共 <item_count> 条,采集于 <captured_at_ms>
|
||||
```
|
||||
|
||||
## Per-Item Line
|
||||
|
||||
Use this structure:
|
||||
|
||||
```text
|
||||
<rank>. <title> | 热度 <heat_text> | 评论指标 <metric_count> 条 | 回复 <reply_total> | 赞同 <upvote_total> | 收藏 <favorite_total> | 红心 <heart_total>
|
||||
```
|
||||
|
||||
## Field Semantics
|
||||
|
||||
- `metric_count`: number of collected comment metric records for the item
|
||||
- `reply_total`: sum of reply counts across collected records
|
||||
- `upvote_total`: sum of upvote counts across collected records
|
||||
- `favorite_total`: sum of favorite counts across collected records
|
||||
- `heart_total`: sum of heart counts across collected records
|
||||
|
||||
## Missing-Metric Handling
|
||||
|
||||
If an item has no collected comment metrics:
|
||||
|
||||
- keep the item in the report
|
||||
- show metric count as `0`
|
||||
- explicitly note partial data elsewhere in the result summary if the run was incomplete
|
||||
|
||||
## Report Mode Behavior
|
||||
|
||||
- If a specific snapshot ID is supplied, report from that snapshot.
|
||||
- Otherwise, use the latest known snapshot.
|
||||
|
||||
262
skills/zhihu-hotlist/scripts/extract_hotlist.js
Normal file
262
skills/zhihu-hotlist/scripts/extract_hotlist.js
Normal file
@@ -0,0 +1,262 @@
|
||||
const limit = Math.max(1, Number(args.top_n || 10));
|
||||
|
||||
function cleanText(value) {
|
||||
return String(value || '')
|
||||
.replace(/\s+/g, ' ')
|
||||
.replace(/\u200b/g, '')
|
||||
.trim();
|
||||
}
|
||||
|
||||
function pickText(root, selectors) {
|
||||
for (const selector of selectors) {
|
||||
const node = root.querySelector(selector);
|
||||
const text = cleanText(node && node.textContent);
|
||||
if (text) {
|
||||
return text;
|
||||
}
|
||||
}
|
||||
return '';
|
||||
}
|
||||
|
||||
function inferHeat(text) {
|
||||
const compact = cleanText(text);
|
||||
const match = compact.match(/(\d+(?:\.\d+)?)\s*(万|亿|k|K|m|M)(?:热度)?/);
|
||||
if (match) {
|
||||
return `${match[1]}${match[2]}`.replace('K', 'k').replace('M', 'm');
|
||||
}
|
||||
const plain = compact.match(/(\d+(?:\.\d+)?)(?:热度)?/);
|
||||
return plain ? plain[1] : '';
|
||||
}
|
||||
|
||||
function extractHeatToken(text) {
|
||||
const compact = cleanText(text);
|
||||
const match = compact.match(/(\d+(?:\.\d+)?)\s*(万|亿|k|K|m|M)(?:热度)?$/);
|
||||
if (match) {
|
||||
return `${match[1]}${match[2]}`.replace('K', 'k').replace('M', 'm');
|
||||
}
|
||||
return '';
|
||||
}
|
||||
|
||||
function inferRank(item, index) {
|
||||
const direct = pickText(item, [
|
||||
'.HotList-item-index',
|
||||
'.HotItem-index',
|
||||
'[data-rank]',
|
||||
'.RankingIndex',
|
||||
]);
|
||||
const directNumber = Number.parseInt(direct, 10);
|
||||
if (Number.isFinite(directNumber) && directNumber > 0) {
|
||||
return directNumber;
|
||||
}
|
||||
|
||||
const text = cleanText(item.textContent);
|
||||
const leading = text.match(/^(\d{1,2})\b/);
|
||||
if (leading) {
|
||||
return Number.parseInt(leading[1], 10);
|
||||
}
|
||||
|
||||
return index + 1;
|
||||
}
|
||||
|
||||
function collectRows() {
|
||||
const candidates = collectDomCandidates();
|
||||
const seenTitles = new Set();
|
||||
const rows = [];
|
||||
|
||||
for (const item of candidates) {
|
||||
const title = pickText(item, [
|
||||
'.HotList-item-title',
|
||||
'.HotList-item-title a',
|
||||
'.HotItem-content a',
|
||||
'h2 a',
|
||||
'h2',
|
||||
'a[href*="/question/"]',
|
||||
]);
|
||||
if (!title || seenTitles.has(title)) {
|
||||
continue;
|
||||
}
|
||||
|
||||
let heat = pickText(item, [
|
||||
'.HotList-item-metrics',
|
||||
'.HotList-item-heat',
|
||||
'.HotItem-metrics',
|
||||
'.HotItem-hot',
|
||||
'[data-heat]',
|
||||
]);
|
||||
if (!heat) {
|
||||
heat = inferHeat(item.textContent);
|
||||
}
|
||||
if (!heat) {
|
||||
continue;
|
||||
}
|
||||
|
||||
seenTitles.add(title);
|
||||
rows.push([
|
||||
inferRank(item, rows.length),
|
||||
title,
|
||||
heat,
|
||||
]);
|
||||
|
||||
if (rows.length >= limit) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
return rows;
|
||||
}
|
||||
|
||||
function collectDomCandidates() {
|
||||
const selectors = [
|
||||
'.HotList-item',
|
||||
'.HotItem',
|
||||
'.HotList-list > *',
|
||||
'[data-hot-item]',
|
||||
'section ol li',
|
||||
'main li',
|
||||
'main article',
|
||||
'main [class*="Hot"]',
|
||||
];
|
||||
const seen = new Set();
|
||||
const candidates = [];
|
||||
selectors.forEach((selector) => {
|
||||
const nodes = Array.from(document.querySelectorAll(selector));
|
||||
nodes.forEach((node) => {
|
||||
if (seen.has(node)) {
|
||||
return;
|
||||
}
|
||||
seen.add(node);
|
||||
candidates.push(node);
|
||||
});
|
||||
});
|
||||
return candidates;
|
||||
}
|
||||
|
||||
function collectTextSources() {
|
||||
const selectors = ['.HotList-list', '.HotList', '#root', 'main', 'body'];
|
||||
const sources = [];
|
||||
const seen = new Set();
|
||||
selectors.forEach((selector) => {
|
||||
const node = document.querySelector(selector);
|
||||
const rawText = String(node && (node.innerText || node.textContent || '') || '');
|
||||
const dedupeKey = cleanText(rawText);
|
||||
if (!dedupeKey || seen.has(dedupeKey)) {
|
||||
return;
|
||||
}
|
||||
seen.add(dedupeKey);
|
||||
sources.push(rawText);
|
||||
});
|
||||
return sources.sort((left, right) => right.length - left.length);
|
||||
}
|
||||
|
||||
function looksLikeBlockedPage(text) {
|
||||
return /安全验证|异常访问|请完成验证|登录后继续|登录即可查看|验证码|访问受限/.test(text);
|
||||
}
|
||||
|
||||
function shouldIgnoreTextLine(line) {
|
||||
if (!line) {
|
||||
return true;
|
||||
}
|
||||
if (line === '知乎热榜' || line === '首页 - 知乎' || line === '首页-知乎') {
|
||||
return true;
|
||||
}
|
||||
if (line.startsWith('/ ') || line.startsWith('当前页面 ·') ||
|
||||
line.startsWith('继续输入任务')) {
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
function collectRowsFromText() {
|
||||
const sources = collectTextSources();
|
||||
for (const source of sources) {
|
||||
if (!source) {
|
||||
continue;
|
||||
}
|
||||
if (looksLikeBlockedPage(source)) {
|
||||
throw new Error('知乎页面当前需要登录或完成安全验证,无法读取热榜条目');
|
||||
}
|
||||
|
||||
const rows = parseRowsFromText(source);
|
||||
if (rows.length) {
|
||||
return rows.slice(0, limit);
|
||||
}
|
||||
}
|
||||
return [];
|
||||
}
|
||||
|
||||
function parseRowsFromText(text) {
|
||||
const lines = String(text || '')
|
||||
.split(/\n+/)
|
||||
.map(cleanText)
|
||||
.filter((line) => !!line && !shouldIgnoreTextLine(line));
|
||||
const seenTitles = new Set();
|
||||
const rows = [];
|
||||
let pendingRank = null;
|
||||
let titleParts = [];
|
||||
|
||||
function pushRow(title, heat) {
|
||||
const normalizedTitle = cleanText(title);
|
||||
if (!normalizedTitle || !heat || seenTitles.has(normalizedTitle)) {
|
||||
return;
|
||||
}
|
||||
seenTitles.add(normalizedTitle);
|
||||
rows.push([
|
||||
pendingRank || rows.length + 1,
|
||||
normalizedTitle,
|
||||
heat,
|
||||
]);
|
||||
pendingRank = null;
|
||||
titleParts = [];
|
||||
}
|
||||
|
||||
for (const rawLine of lines) {
|
||||
let line = rawLine;
|
||||
|
||||
const rankOnly = line.match(/^(\d{1,2})$/);
|
||||
if (rankOnly && !titleParts.length) {
|
||||
pendingRank = Number(rankOnly[1]);
|
||||
continue;
|
||||
}
|
||||
|
||||
const rankedLine = line.match(/^(\d{1,2})[.、\s]+(.+)$/);
|
||||
if (rankedLine) {
|
||||
pendingRank = Number(rankedLine[1]);
|
||||
line = cleanText(rankedLine[2]);
|
||||
}
|
||||
|
||||
const inlineMatch = line.match(/^(.*?)(\d+(?:\.\d+)?)\s*(万|亿|k|K|m|M)(?:热度)?$/);
|
||||
if (inlineMatch && cleanText(inlineMatch[1])) {
|
||||
pushRow(cleanText(inlineMatch[1]), `${inlineMatch[2]}${inlineMatch[3]}`.replace('K', 'k').replace('M', 'm'));
|
||||
if (rows.length >= limit) {
|
||||
break;
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
const heatOnly = extractHeatToken(line);
|
||||
if (heatOnly && titleParts.length) {
|
||||
pushRow(titleParts.join(' '), heatOnly);
|
||||
if (rows.length >= limit) {
|
||||
break;
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
titleParts.push(line);
|
||||
}
|
||||
|
||||
return rows;
|
||||
}
|
||||
|
||||
const domRows = collectRows();
|
||||
const rows = domRows.length ? domRows : collectRowsFromText();
|
||||
if (!rows.length) {
|
||||
throw new Error('未能从页面 DOM 中提取到知乎热榜条目');
|
||||
}
|
||||
|
||||
return {
|
||||
source: `${location.origin}${location.pathname}`,
|
||||
sheet_name: '知乎热榜',
|
||||
columns: ['rank', 'title', 'heat'],
|
||||
rows,
|
||||
};
|
||||
Reference in New Issue
Block a user