feat: add initial skill authoring workspace

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 18:34:56 +08:00
parent a461b0734e
commit 51913555ad
30 changed files with 7114 additions and 0 deletions
--- a/skills/zhihu-hotlist/SKILL.md
+++ b/skills/zhihu-hotlist/SKILL.md
@@ -0,0 +1,113 @@
+---
+name: zhihu-hotlist
+description: Use when the user wants to collect, snapshot, summarize, or report Zhihu hot list items and related comment metrics from browser-visible page data.
+version: 0.1.0
+author: sgclaw
+tags:
+  - zhihu
+  - browser
+  - hotlist
+---
+
+# Zhihu Hotlist
+
+Collect Zhihu hot list items, optionally collect visible comment metrics from each item’s detail page, and render a compact report from the resulting snapshot. Use this skill for hotlist collection and reporting, not for article editing or general Zhihu navigation.
+
+## When to Use
+
+- The user asks to collect Zhihu hot list data.
+- The user asks for a snapshot, ranking summary, or report of current Zhihu hot list items.
+- The user wants visible comment metrics such as replies, upvotes, favorites, or heart counts from hot items.
+- The task needs a structured report from an existing or newly captured snapshot.
+
+Do not use this skill for:
+
+- arbitrary Zhihu page navigation without hotlist collection
+- writing or publishing Zhihu articles
+- claiming complete data quality when comment collection partially fails
+
+## Workflow
+
+1. Decide whether the task is a collection run, a report run, or both.
+2. For collection runs, call the packaged browser script tool `zhihu-hotlist.extract_hotlist` before any generic `getText` probing.
+3. For collection rules and guard conditions, follow [collection-flow.md](references/collection-flow.md).
+4. Inside the packaged script, prefer stable structured page state first, then broader DOM candidates, then controlled page-text fallback.
+5. Produce the `Export Artifact` immediately after the browser data is stable.
+6. If the page is blocked by login, captcha, or anti-bot state, fail explicitly instead of collapsing the issue into "no rows".
+7. Surface partial failures explicitly instead of hiding them behind a success summary.
+8. For report runs, format output using [report-format.md](references/report-format.md).
+9. Apply the caution rules in [data-quality.md](references/data-quality.md) whenever metrics are partial, missing, or inferred from fragile selectors.
+
+## SuperRPA Interface Contract
+
+- Inside the sgClaw browser host, prefer `superrpa_browser` for Zhihu page actions. `browser_action` is only the compatibility alias.
+- Always pass `expected_domain` as the bare hostname only, for example `www.zhihu.com`.
+- All selectors must be valid CSS selectors because the host executes `document.querySelector(...)`.
+- Never use XPath or jQuery-style pseudo-selectors such as `:contains(...)`.
+- Prefer canonical route navigation such as `https://www.zhihu.com/hot` before fallback click chains.
+- The primary deterministic extractor is the packaged browser script tool `zhihu-hotlist.extract_hotlist`.
+- Use generic `getText` only as a last-resort fallback when the packaged extractor fails.
+- Do not keep thrashing through selector variants once the packaged script has already produced the structured artifact.
+
+## Partial-Failure Rule
+
+- If hotlist items are captured but some comment-metric collections fail, report the run as partial.
+- Include how many items lacked comment metrics.
+- Do not phrase the result as fully complete when `partial_items > 0`.
+
+## Blocked-Page Rule
+
+- If Zhihu responds with a login wall, captcha, security verification page, or anti-bot interstitial, state that condition explicitly.
+- Do not misreport those states as ordinary "empty hotlist" outcomes.
+
+## Export Artifact
+
+The primary output of this skill is a structured artifact for downstream Office export. The structured artifact is primary. Any prose summary is secondary.
+
+Return this shape as soon as hotlist collection is complete:
+
+```json
+{
+  "source": "https://www.zhihu.com/hot",
+  "sheet_name": "知乎热榜",
+  "columns": ["rank", "title", "heat"],
+  "rows": [[1, "标题", "344万"]]
+}
+```
+
+Rules:
+
+- `sheet_name` must be exactly `知乎热榜`.
+- `columns` must remain `["rank", "title", "heat"]`.
+- `rows` must preserve the collected ranking order from the page.
+- Each row must contain exactly three values: numeric rank, title text, and heat text.
+- If fewer than the requested rows are visible, return the visible rows and mark the result as partial.
+- After the artifact is complete, stop exploratory tool use and do not resume browser wandering.
+- Do not switch to `shell`, `glob_search`, or unrelated file browsing once the hotlist rows are collected.
+
+## Output
+
+Return a concise result with:
+
+- operation type: `collect` or `report`
+- requested `top_n`
+- snapshot identifier when available
+- item count
+- whether comment metrics are complete or partial
+- any missing or weak data areas
+- the `Export Artifact` block shown above
+- an optional short prose summary only after the artifact
+
+## References
+
+- Use [collection-flow.md](references/collection-flow.md) for browser-side collection steps.
+- Use [report-format.md](references/report-format.md) for report rendering.
+- Use [data-quality.md](references/data-quality.md) before making claims about completeness.
+- Use `assets/zhihu_hotlist_flow.source.json` for exact selectors and guard text from the source flow.
+
+## Common Mistakes
+
+- Treating visible hotlist capture as equivalent to complete comment-metric capture.
+- Forgetting that report mode can use an existing snapshot instead of recollecting.
+- Ignoring weak selectors and generic button captures in comment areas.
+- Reporting zeros as if they were confirmed values when the DOM capture may be incomplete.
--- a/skills/zhihu-hotlist/SKILL.toml
+++ b/skills/zhihu-hotlist/SKILL.toml
@@ -0,0 +1,19 @@
+[skill]
+name = "zhihu-hotlist"
+description = "Use when the user wants to collect, snapshot, summarize, or export Zhihu hot list items from the current browser page."
+version = "0.1.0"
+author = "sgclaw"
+tags = ["zhihu", "browser", "hotlist"]
+
+prompts = [
+  "For live Zhihu hotlist extraction, call zhihu-hotlist.extract_hotlist before generic browser getText probing.",
+]
+
+[[tools]]
+name = "extract_hotlist"
+description = "Primary deterministic extractor for Zhihu hotlist rows on the current page. Use this before generic browser getText probing."
+kind = "browser_script"
+command = "scripts/extract_hotlist.js"
+
+[tools.args]
+top_n = "Maximum number of hotlist rows to return."
--- a/skills/zhihu-hotlist/assets/zhihu_hotlist_flow.source.json
+++ b/skills/zhihu-hotlist/assets/zhihu_hotlist_flow.source.json
@@ -0,0 +1,20 @@
+{
+  "hotlist_url": "https://www.zhihu.com/hot",
+  "domains": {
+    "zhihu": "www.zhihu.com"
+  },
+  "literals": {
+    "hotlist_guard": "热榜"
+  },
+  "selectors": {
+    "hotlist_root": "main, body",
+    "hotlist_item": ".HotList-item, [data-hot-item], section ol li",
+    "hotlist_title_link": ".HotList-item-title a, h2 a, .ContentItem-title a",
+    "hotlist_summary": ".HotList-item-summary, .HotItem-content, .RichContent-inner, .ContentItem-excerpt",
+    "hotlist_heat": ".HotList-item-heat, .HotItem-metrics, .HotItem-hot",
+    "comment_list": ".Comments-list, .CommentListV2, [data-testid='comment-list'], .CommentList",
+    "comment_item": ".Comments-list > .CommentItem, .CommentListV2 > .CommentItem, .CommentItemV2, .CommentItem",
+    "comment_metric": ".CommentItem-metric, .CommentItem-footer button, .ContentItem-actions button, button"
+  }
+}
+
--- a/skills/zhihu-hotlist/references/collection-flow.md
+++ b/skills/zhihu-hotlist/references/collection-flow.md
@@ -0,0 +1,68 @@
+# Collection Flow
+
+This skill uses the preserved source flow in `assets/zhihu_hotlist_flow.source.json`.
+
+## Source Model
+
+The source implementation does four things:
+
+1. ensure the browser is on the hotlist page
+2. capture hotlist HTML
+3. extract the top N items from the page
+4. visit each item detail page and try to collect visible comment metrics
+
+## Hotlist Page Detection
+
+- Preferred page URL: `https://www.zhihu.com/hot`
+- Domain: `www.zhihu.com`
+- Guard text: `热榜`
+
+The source flow first probes the current page for the guard text before deciding whether it must navigate.
+
+## Hotlist Extraction
+
+The source selectors look for:
+
+- hotlist root
+- hotlist item
+- title link
+- summary
+- heat text
+
+If the page HTML is empty or exposes no items, the collection should be treated as failed.
+
+## Comment Metric Collection
+
+For each hot item:
+
+1. navigate to the item detail page
+2. wait for page root
+3. scroll toward comments
+4. wait for comment list
+5. scroll comment list into view
+6. capture page HTML
+7. parse visible metrics from comment items
+
+## Parsed Metrics
+
+The source collector tries to extract:
+
+- reply count
+- upvote count
+- favorite count
+- heart count
+
+It also preserves unmatched numeric metrics as raw metric fields when possible.
+
+## Count Parsing
+
+The source parser recognizes compact counts such as:
+
+- plain integers
+- `万`
+- `亿`
+- `k`
+- `m`
+
+Use caution when summarizing parsed counts from compact display text.
+
--- a/skills/zhihu-hotlist/references/data-quality.md
+++ b/skills/zhihu-hotlist/references/data-quality.md
@@ -0,0 +1,46 @@
+# Data Quality
+
+This skill can return useful partial data, but it must not overclaim completeness.
+
+## Main Quality Risks
+
+- comment areas may not load for every hot item
+- the DOM may expose only visible comments, not the full set
+- generic selectors may match the wrong footer controls
+- compact text counts can be parsed but still reflect display approximations
+
+## Partial Success Rule
+
+The source implementation tracks partial item failures during comment collection. If some detail pages fail but the run still returns a snapshot:
+
+- report the run as partial
+- include how many items were missing comment metrics
+- keep the successful hotlist capture separate from comment-metric completeness
+
+## Snapshot Caveats
+
+The source store design keeps:
+
+- `snapshot_id`
+- capture timestamp
+- page URL
+- collector version
+- item list
+- collection stats
+
+This is enough for reproducible reporting, but it does not guarantee that every metric field was fully captured.
+
+## Recommended Caution Language
+
+Use wording like:
+
+- `热榜列表已采集，评论指标为部分完成。`
+- `报告基于最新快照生成，部分条目缺少评论指标。`
+- `数字来自页面可见指标，可能低于完整站内统计。`
+
+Avoid wording like:
+
+- `全部评论指标已准确采集`
+- `完整真实热度`
+- `无缺失`
+
--- a/skills/zhihu-hotlist/references/report-format.md
+++ b/skills/zhihu-hotlist/references/report-format.md
@@ -0,0 +1,41 @@
+# Report Format
+
+The source report mode renders a compact text report from a snapshot.
+
+## Header Line
+
+Use this structure:
+
+```text
+知乎热榜报告 <snapshot_id>: 共 <item_count> 条，采集于 <captured_at_ms>
+```
+
+## Per-Item Line
+
+Use this structure:
+
+```text
+<rank>. <title> | 热度 <heat_text> | 评论指标 <metric_count> 条 | 回复 <reply_total> | 赞同 <upvote_total> | 收藏 <favorite_total> | 红心 <heart_total>
+```
+
+## Field Semantics
+
+- `metric_count`: number of collected comment metric records for the item
+- `reply_total`: sum of reply counts across collected records
+- `upvote_total`: sum of upvote counts across collected records
+- `favorite_total`: sum of favorite counts across collected records
+- `heart_total`: sum of heart counts across collected records
+
+## Missing-Metric Handling
+
+If an item has no collected comment metrics:
+
+- keep the item in the report
+- show metric count as `0`
+- explicitly note partial data elsewhere in the result summary if the run was incomplete
+
+## Report Mode Behavior
+
+- If a specific snapshot ID is supplied, report from that snapshot.
+- Otherwise, use the latest known snapshot.
+
--- a/skills/zhihu-hotlist/scripts/extract_hotlist.js
+++ b/skills/zhihu-hotlist/scripts/extract_hotlist.js
@@ -0,0 +1,262 @@
+const limit = Math.max(1, Number(args.top_n || 10));
+
+function cleanText(value) {
+  return String(value || '')
+    .replace(/\s+/g, ' ')
+    .replace(/\u200b/g, '')
+    .trim();
+}
+
+function pickText(root, selectors) {
+  for (const selector of selectors) {
+    const node = root.querySelector(selector);
+    const text = cleanText(node && node.textContent);
+    if (text) {
+      return text;
+    }
+  }
+  return '';
+}
+
+function inferHeat(text) {
+  const compact = cleanText(text);
+  const match = compact.match(/(\d+(?:\.\d+)?)\s*(万|亿|k|K|m|M)(?:热度)?/);
+  if (match) {
+    return `${match[1]}${match[2]}`.replace('K', 'k').replace('M', 'm');
+  }
+  const plain = compact.match(/(\d+(?:\.\d+)?)(?:热度)?/);
+  return plain ? plain[1] : '';
+}
+
+function extractHeatToken(text) {
+  const compact = cleanText(text);
+  const match = compact.match(/(\d+(?:\.\d+)?)\s*(万|亿|k|K|m|M)(?:热度)?$/);
+  if (match) {
+    return `${match[1]}${match[2]}`.replace('K', 'k').replace('M', 'm');
+  }
+  return '';
+}
+
+function inferRank(item, index) {
+  const direct = pickText(item, [
+    '.HotList-item-index',
+    '.HotItem-index',
+    '[data-rank]',
+    '.RankingIndex',
+  ]);
+  const directNumber = Number.parseInt(direct, 10);
+  if (Number.isFinite(directNumber) && directNumber > 0) {
+    return directNumber;
+  }
+
+  const text = cleanText(item.textContent);
+  const leading = text.match(/^(\d{1,2})\b/);
+  if (leading) {
+    return Number.parseInt(leading[1], 10);
+  }
+
+  return index + 1;
+}
+
+function collectRows() {
+  const candidates = collectDomCandidates();
+  const seenTitles = new Set();
+  const rows = [];
+
+  for (const item of candidates) {
+    const title = pickText(item, [
+      '.HotList-item-title',
+      '.HotList-item-title a',
+      '.HotItem-content a',
+      'h2 a',
+      'h2',
+      'a[href*="/question/"]',
+    ]);
+    if (!title || seenTitles.has(title)) {
+      continue;
+    }
+
+    let heat = pickText(item, [
+      '.HotList-item-metrics',
+      '.HotList-item-heat',
+      '.HotItem-metrics',
+      '.HotItem-hot',
+      '[data-heat]',
+    ]);
+    if (!heat) {
+      heat = inferHeat(item.textContent);
+    }
+    if (!heat) {
+      continue;
+    }
+
+    seenTitles.add(title);
+    rows.push([
+      inferRank(item, rows.length),
+      title,
+      heat,
+    ]);
+
+    if (rows.length >= limit) {
+      break;
+    }
+  }
+
+  return rows;
+}
+
+function collectDomCandidates() {
+  const selectors = [
+    '.HotList-item',
+    '.HotItem',
+    '.HotList-list > *',
+    '[data-hot-item]',
+    'section ol li',
+    'main li',
+    'main article',
+    'main [class*="Hot"]',
+  ];
+  const seen = new Set();
+  const candidates = [];
+  selectors.forEach((selector) => {
+    const nodes = Array.from(document.querySelectorAll(selector));
+    nodes.forEach((node) => {
+      if (seen.has(node)) {
+        return;
+      }
+      seen.add(node);
+      candidates.push(node);
+    });
+  });
+  return candidates;
+}
+
+function collectTextSources() {
+  const selectors = ['.HotList-list', '.HotList', '#root', 'main', 'body'];
+  const sources = [];
+  const seen = new Set();
+  selectors.forEach((selector) => {
+    const node = document.querySelector(selector);
+    const rawText = String(node && (node.innerText || node.textContent || '') || '');
+    const dedupeKey = cleanText(rawText);
+    if (!dedupeKey || seen.has(dedupeKey)) {
+      return;
+    }
+    seen.add(dedupeKey);
+    sources.push(rawText);
+  });
+  return sources.sort((left, right) => right.length - left.length);
+}
+
+function looksLikeBlockedPage(text) {
+  return /安全验证|异常访问|请完成验证|登录后继续|登录即可查看|验证码|访问受限/.test(text);
+}
+
+function shouldIgnoreTextLine(line) {
+  if (!line) {
+    return true;
+  }
+  if (line === '知乎热榜' || line === '首页 - 知乎' || line === '首页-知乎') {
+    return true;
+  }
+  if (line.startsWith('/ ') || line.startsWith('当前页面 ·') ||
+      line.startsWith('继续输入任务')) {
+    return true;
+  }
+  return false;
+}
+
+function collectRowsFromText() {
+  const sources = collectTextSources();
+  for (const source of sources) {
+    if (!source) {
+      continue;
+    }
+    if (looksLikeBlockedPage(source)) {
+      throw new Error('知乎页面当前需要登录或完成安全验证，无法读取热榜条目');
+    }
+
+    const rows = parseRowsFromText(source);
+    if (rows.length) {
+      return rows.slice(0, limit);
+    }
+  }
+  return [];
+}
+
+function parseRowsFromText(text) {
+  const lines = String(text || '')
+    .split(/\n+/)
+    .map(cleanText)
+    .filter((line) => !!line && !shouldIgnoreTextLine(line));
+  const seenTitles = new Set();
+  const rows = [];
+  let pendingRank = null;
+  let titleParts = [];
+
+  function pushRow(title, heat) {
+    const normalizedTitle = cleanText(title);
+    if (!normalizedTitle || !heat || seenTitles.has(normalizedTitle)) {
+      return;
+    }
+    seenTitles.add(normalizedTitle);
+    rows.push([
+      pendingRank || rows.length + 1,
+      normalizedTitle,
+      heat,
+    ]);
+    pendingRank = null;
+    titleParts = [];
+  }
+
+  for (const rawLine of lines) {
+    let line = rawLine;
+
+    const rankOnly = line.match(/^(\d{1,2})$/);
+    if (rankOnly && !titleParts.length) {
+      pendingRank = Number(rankOnly[1]);
+      continue;
+    }
+
+    const rankedLine = line.match(/^(\d{1,2})[.、\s]+(.+)$/);
+    if (rankedLine) {
+      pendingRank = Number(rankedLine[1]);
+      line = cleanText(rankedLine[2]);
+    }
+
+    const inlineMatch = line.match(/^(.*?)(\d+(?:\.\d+)?)\s*(万|亿|k|K|m|M)(?:热度)?$/);
+    if (inlineMatch && cleanText(inlineMatch[1])) {
+      pushRow(cleanText(inlineMatch[1]), `${inlineMatch[2]}${inlineMatch[3]}`.replace('K', 'k').replace('M', 'm'));
+      if (rows.length >= limit) {
+        break;
+      }
+      continue;
+    }
+
+    const heatOnly = extractHeatToken(line);
+    if (heatOnly && titleParts.length) {
+      pushRow(titleParts.join(' '), heatOnly);
+      if (rows.length >= limit) {
+        break;
+      }
+      continue;
+    }
+
+    titleParts.push(line);
+  }
+
+  return rows;
+}
+
+const domRows = collectRows();
+const rows = domRows.length ? domRows : collectRowsFromText();
+if (!rows.length) {
+  throw new Error('未能从页面 DOM 中提取到知乎热榜条目');
+}
+
+return {
+  source: `${location.origin}${location.pathname}`,
+  sheet_name: '知乎热榜',
+  columns: ['rank', 'title', 'heat'],
+  rows,
+};