69 lines
1.4 KiB
Markdown
69 lines
1.4 KiB
Markdown
# Collection Flow
|
|
|
|
This skill uses the preserved source flow in `assets/zhihu_hotlist_flow.source.json`.
|
|
|
|
## Source Model
|
|
|
|
The source implementation does four things:
|
|
|
|
1. ensure the browser is on the hotlist page
|
|
2. capture hotlist HTML
|
|
3. extract the top N items from the page
|
|
4. visit each item detail page and try to collect visible comment metrics
|
|
|
|
## Hotlist Page Detection
|
|
|
|
- Preferred page URL: `https://www.zhihu.com/hot`
|
|
- Domain: `www.zhihu.com`
|
|
- Guard text: `热榜`
|
|
|
|
The source flow first probes the current page for the guard text before deciding whether it must navigate.
|
|
|
|
## Hotlist Extraction
|
|
|
|
The source selectors look for:
|
|
|
|
- hotlist root
|
|
- hotlist item
|
|
- title link
|
|
- summary
|
|
- heat text
|
|
|
|
If the page HTML is empty or exposes no items, the collection should be treated as failed.
|
|
|
|
## Comment Metric Collection
|
|
|
|
For each hot item:
|
|
|
|
1. navigate to the item detail page
|
|
2. wait for page root
|
|
3. scroll toward comments
|
|
4. wait for comment list
|
|
5. scroll comment list into view
|
|
6. capture page HTML
|
|
7. parse visible metrics from comment items
|
|
|
|
## Parsed Metrics
|
|
|
|
The source collector tries to extract:
|
|
|
|
- reply count
|
|
- upvote count
|
|
- favorite count
|
|
- heart count
|
|
|
|
It also preserves unmatched numeric metrics as raw metric fields when possible.
|
|
|
|
## Count Parsing
|
|
|
|
The source parser recognizes compact counts such as:
|
|
|
|
- plain integers
|
|
- `万`
|
|
- `亿`
|
|
- `k`
|
|
- `m`
|
|
|
|
Use caution when summarizing parsed counts from compact display text.
|
|
|