acceptance: stabilize zhihu hotlist excel flow
This commit is contained in:
@@ -1,8 +1,8 @@
|
|||||||
# Zhihu Hotlist Excel Acceptance
|
# Zhihu Hotlist Excel Acceptance
|
||||||
|
|
||||||
- Date: 2026-03-29 20:53:59 +0800
|
- Date: 2026-03-29 23:15:47 +0800
|
||||||
- Mode: real provider + live Zhihu hotlist API + simulated browser pipe
|
- Mode: real provider + live Zhihu hotlist API + simulated browser pipe
|
||||||
- Workspace: `/tmp/sgclaw-live-acceptance-5o87r76f`
|
- Workspace: `/tmp/sgclaw-live-acceptance-icg8dnwq`
|
||||||
- Final success: `True`
|
- Final success: `True`
|
||||||
- Total score: `100/100`
|
- Total score: `100/100`
|
||||||
|
|
||||||
@@ -16,68 +16,67 @@
|
|||||||
|
|
||||||
## Final Output
|
## Final Output
|
||||||
|
|
||||||
- exported_path: `/tmp/sgclaw-live-acceptance-5o87r76f/.sgclaw-zeroclaw-workspace/out/zhihu-hotlist-1774788827307024821.xlsx`
|
- exported_path: `/tmp/sgclaw-live-acceptance-icg8dnwq/.sgclaw-zeroclaw-workspace/zhihu_hotlist_20260329_2314.xlsx`
|
||||||
- final_summary: `完美!我已成功完成知乎热榜数据的收集和导出。根据导出完成合同,我必须包含生成的本地.xlsx文件路径。
|
- final_summary: `太好了!Excel文件已成功导出。现在让我提供完整的操作总结和结果。
|
||||||
|
|
||||||
## 任务完成总结
|
## 知乎热榜数据收集与导出完成
|
||||||
|
|
||||||
**操作类型**: 收集 (collect)
|
**操作类型**: 收集 (collect)
|
||||||
**数据源**: https://www.zhihu.com/hot
|
**数据来源**: https://www.zhihu.com/hot
|
||||||
**采集时间**: 2026-03-29 20:53:02 (+08:00)
|
**采集时间**: 2026-03-29 23:14:41 (+08:00)
|
||||||
**采集条目数**: 10条
|
**采集结果**: 成功获取10条热榜数据
|
||||||
**评论指标**: 未采集(仅热榜基础数据)
|
|
||||||
|
|
||||||
**导出工件**:
|
**导出文件**: `/tmp/sgclaw-live-acceptance-icg8dnwq/.sgclaw-zeroclaw-workspace/zhihu_hotlist_20260329_2314.xlsx`
|
||||||
```json
|
|
||||||
{
|
|
||||||
"source": "https://www.zhihu.com/hot",
|
|
||||||
"sheet_name": "知乎热榜",
|
|
||||||
"columns": ["rank", "title", "heat"],
|
|
||||||
"rows": 10
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**生成的Excel文件路径**: `/tmp/sgclaw-live-acceptance-5o87r76f/.sgclaw-zeroclaw-workspace/out/zhihu-hotlist-1774788827307024821.xlsx`
|
**数据概览**:
|
||||||
|
1. 如何看待张雪机车在2026年WSBK葡萄牙站夺冠? - 热度: 1399万
|
||||||
|
2. 李荣浩摆证据4连质问单依纯 - 热度: 805万
|
||||||
|
3. 日本拟动用外储做空国际原油 - 热度: 517万
|
||||||
|
4. 官方通报女子被羁押后无罪释放 - 热度: 429万
|
||||||
|
5. 黄金大买家土耳其央行抛售黄金 - 热度: 196万
|
||||||
|
6. 伊朗科技大学遭袭后伊朗将美以大学列为袭击目标 - 热度: 173万
|
||||||
|
7. 字节跳动单挑互联网巨头 - 热度: 151万
|
||||||
|
8. 学霸家长低调不晒成绩 - 热度: 129万
|
||||||
|
9. DDR5内存价格下降 - 热度: 110万
|
||||||
|
10. 四种超能力选择 - 热度: 75万
|
||||||
|
|
||||||
**热榜数据概览**:
|
**Excel文件内容**:
|
||||||
1. 官方通报女子被羁押后无罪释放相关话题 - 热度1105万
|
- 工作表名称: `知乎热榜`
|
||||||
2. 张雪机车WSBK葡萄牙站夺冠 - 热度756万
|
- 列: `rank` (排名), `title` (标题), `heat` (热度)
|
||||||
3. 高三学生请求拆除鸟巢事件 - 热度549万
|
- 行数: 10条完整热榜数据
|
||||||
4. 日本拟做空国际原油 - 热度359万
|
|
||||||
5. 谷歌TurboQuant论文疑似造假 - 热度338万
|
|
||||||
6. 伊朗将美以大学列为袭击目标 - 热度328万
|
|
||||||
7. 中国智能手机市场下滑苹果逆势增长 - 热度317万
|
|
||||||
8. 超能力选择话题 - 热度194万
|
|
||||||
9. 土耳其央行抛售黄金 - 热度166万
|
|
||||||
10. 中国U23男足战平朝鲜队 - 热度99万
|
|
||||||
|
|
||||||
知乎热榜数据已成功从浏览器实时采集并导出为Excel文件,文件包含完整的排名、标题和热度信息。`
|
所有热榜数据已成功从知乎官网实时采集并导出为Excel格式,文件已保存在指定路径。`
|
||||||
|
|
||||||
## Skill Logs
|
## Skill Logs
|
||||||
|
|
||||||
- `DeepSeek config loaded from /tmp/sgclaw-live-acceptance-5o87r76f/sgclaw_config.json model=deepseek-chat base_url=https://api.deepseek.com`
|
- `DeepSeek config loaded from /tmp/sgclaw-live-acceptance-icg8dnwq/sgclaw_config.json model=deepseek-chat base_url=https://api.deepseek.com`
|
||||||
- `skills dir resolved to /home/zyl/projects/sgClaw/skill_lib/skills`
|
- `skills dir resolved to /home/zyl/projects/sgClaw/skill_lib/skills`
|
||||||
- `runtime profile=BrowserAttached skills_prompt_mode=Compact`
|
- `runtime profile=BrowserAttached skills_prompt_mode=Compact`
|
||||||
- `zeroclaw_process_message_primary`
|
- `zeroclaw_process_message_primary`
|
||||||
|
- `先规划再执行知乎热榜 Excel 导出
|
||||||
|
navigate https://www.zhihu.com/hot
|
||||||
|
getText main
|
||||||
|
call openxml_office
|
||||||
|
return generated local .xlsx path`
|
||||||
- `loaded skills: office-export-xlsx, zhihu-hotlist, zhihu-hotlist-screen, zhihu-navigate, zhihu-write`
|
- `loaded skills: office-export-xlsx, zhihu-hotlist, zhihu-hotlist-screen, zhihu-navigate, zhihu-write`
|
||||||
|
- `read_skill zhihu-hotlist`
|
||||||
- `navigate https://www.zhihu.com/hot`
|
- `navigate https://www.zhihu.com/hot`
|
||||||
- `getText main`
|
- `getText main`
|
||||||
- `read_skill zhihu-hotlist`
|
|
||||||
- `call openxml_office`
|
- `call openxml_office`
|
||||||
|
|
||||||
## Live Hotlist Sample
|
## Live Hotlist Sample
|
||||||
|
|
||||||
- 1. 官方通报女子被羁押后无罪释放,申请国赔 13 天被叫停,当地成立联合调查组,最该查清什么?带来哪些深思? | 1105万
|
- 1. 如何看待张雪机车在 2026 年 WSBK 葡萄牙站夺冠?这对国内的摩托赛事发展有什么影响? | 1399万
|
||||||
- 2. 如何看待张雪机车在 2026 年 WSBK 葡萄牙站夺冠?这对国内的摩托赛事发展有什么影响? | 756万
|
- 2. 李荣浩摆证据 4 连质问单依纯,为什么没有授权的歌曲也能放进演唱会?演唱会筹备中可能出了什么问题? | 805万
|
||||||
- 3. 高三学生因鸟鸣干扰备考请求学校拆除鸟巢,校长回信「学会与万物共存是成长的必修课」,如何评价此教育方式? | 549万
|
- 3. 日本拟动用外储做空国际原油,以挽救日元汇率,对此你怎么看,其会重演 96 年「住友铜事件」么? | 517万
|
||||||
- 4. 日本拟动用外储做空国际原油,以挽救日元汇率,对此你怎么看,其会重演 96 年「住友铜事件」么? | 359万
|
- 4. 官方通报女子被羁押后无罪释放,申请国赔 13 天被叫停,当地成立联合调查组,最该查清什么?带来哪些深思? | 429万
|
||||||
- 5. 谷歌称可节省 6 倍内存的 TurboQuant 论文疑似造假,RaBitQ 作者独家发文 | 338万
|
- 5. 黄金大买家土耳其央行在伊朗战争期间抛售 80 亿美元黄金,这意味着什么? | 196万
|
||||||
- 6. 伊朗科技大学遭袭后,伊朗将美以大学列为「合法袭击目标」,如果战争扩大到教育机构,冲突还有回头路吗? | 328万
|
- 6. 伊朗科技大学遭袭后,伊朗将美以大学列为「合法袭击目标」,如果战争扩大到教育机构,冲突还有回头路吗? | 173万
|
||||||
- 7. 中国智能手机市场下滑 4%,为何苹果销售额逆势增长 23%? | 317万
|
- 7. 字节跳动是怎么短短数年就能单挑所有互联网巨头的? | 151万
|
||||||
- 8. 假如有四种超能力选择,分别为:隐身、透视、飞行、预见未来半小时发生的事情,只能选择一个,你会选择哪个? | 194万
|
- 8. 为什么越厉害的学霸,她们家长越低调?从来不在朋友圈晒孩子成绩? | 129万
|
||||||
- 9. 黄金大买家土耳其央行在伊朗战争期间抛售 80 亿美元黄金,这意味着什么? | 166万
|
- 9. DDR5 内存价格 3 月出现明显下降,请问这是短期现象,还是内存供需紧张真的缓和了? | 110万
|
||||||
- 10. 国青友谊赛,中国 U23 男足 1 比 1 战平朝鲜队,如何评价本场比赛? | 99万
|
- 10. 假如有四种超能力选择,分别为:隐身、透视、飞行、预见未来半小时发生的事情,只能选择一个,你会选择哪个? | 75万
|
||||||
|
|
||||||
## Stderr
|
## Stderr
|
||||||
|
|
||||||
- `sgclaw ready: agent_id=cfae8218-6720-416e-a14e-6f85ce8ca6a4`
|
- `sgclaw ready: agent_id=7482cc6b-8fe0-4727-90da-7b3f62cad9b6`
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
use async_trait::async_trait;
|
use async_trait::async_trait;
|
||||||
use serde::Deserialize;
|
use serde::Deserialize;
|
||||||
use serde_json::{json, Value};
|
use serde_json::{json, Value};
|
||||||
|
use std::collections::BTreeSet;
|
||||||
use std::collections::BTreeMap;
|
use std::collections::BTreeMap;
|
||||||
use std::fs;
|
use std::fs;
|
||||||
use std::path::{Path, PathBuf};
|
use std::path::{Path, PathBuf};
|
||||||
@@ -79,21 +80,35 @@ impl Tool for OpenXmlOfficeTool {
|
|||||||
.iter()
|
.iter()
|
||||||
.map(|value| value.to_string())
|
.map(|value| value.to_string())
|
||||||
.collect::<Vec<_>>();
|
.collect::<Vec<_>>();
|
||||||
if parsed.columns != expected_columns {
|
let column_order = match resolve_column_order(&parsed.columns, &expected_columns) {
|
||||||
return Ok(failed_tool_result(
|
Some(order) => order,
|
||||||
"unsupported columns: expected [rank, title, heat]".to_string(),
|
None => {
|
||||||
));
|
return Ok(failed_tool_result(
|
||||||
}
|
"unsupported columns: expected [rank, title, heat]".to_string(),
|
||||||
|
))
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
if parsed.rows.is_empty() {
|
if parsed.rows.is_empty() {
|
||||||
return Ok(failed_tool_result("rows must not be empty".to_string()));
|
return Ok(failed_tool_result("rows must not be empty".to_string()));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if parsed.rows.iter().any(|row| row.len() != parsed.columns.len()) {
|
||||||
|
return Ok(failed_tool_result(
|
||||||
|
"each row must match the declared columns length".to_string(),
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
if parsed.rows.iter().any(|row| row.len() != 3) {
|
if parsed.rows.iter().any(|row| row.len() != 3) {
|
||||||
return Ok(failed_tool_result(
|
return Ok(failed_tool_result(
|
||||||
"each row must contain exactly 3 values".to_string(),
|
"each row must contain exactly 3 values".to_string(),
|
||||||
));
|
));
|
||||||
}
|
}
|
||||||
|
let normalized_rows = parsed
|
||||||
|
.rows
|
||||||
|
.iter()
|
||||||
|
.map(|row| reorder_row(row, &column_order))
|
||||||
|
.collect::<Vec<_>>();
|
||||||
|
|
||||||
let job_root = create_job_root(&self.workspace_root)?;
|
let job_root = create_job_root(&self.workspace_root)?;
|
||||||
let template_path = job_root.join("zhihu_hotlist_template.xlsx");
|
let template_path = job_root.join("zhihu_hotlist_template.xlsx");
|
||||||
@@ -105,8 +120,8 @@ impl Tool for OpenXmlOfficeTool {
|
|||||||
.map(PathBuf::from)
|
.map(PathBuf::from)
|
||||||
.unwrap_or_else(|| default_output_path(&self.workspace_root));
|
.unwrap_or_else(|| default_output_path(&self.workspace_root));
|
||||||
|
|
||||||
write_hotlist_template(&template_path, parsed.rows.len())?;
|
write_hotlist_template(&template_path, normalized_rows.len())?;
|
||||||
write_payload_json(&payload_path, &parsed.rows)?;
|
write_payload_json(&payload_path, &normalized_rows)?;
|
||||||
write_request_json(&request_path, &template_path, &payload_path, &output_path)?;
|
write_request_json(&request_path, &template_path, &payload_path, &output_path)?;
|
||||||
|
|
||||||
let rendered = run_openxml_cli(&request_path)?;
|
let rendered = run_openxml_cli(&request_path)?;
|
||||||
@@ -120,7 +135,7 @@ impl Tool for OpenXmlOfficeTool {
|
|||||||
output: json!({
|
output: json!({
|
||||||
"sheet_name": DEFAULT_SHEET_NAME,
|
"sheet_name": DEFAULT_SHEET_NAME,
|
||||||
"output_path": artifact_path,
|
"output_path": artifact_path,
|
||||||
"row_count": parsed.rows.len(),
|
"row_count": normalized_rows.len(),
|
||||||
"renderer": OPENXML_OFFICE_TOOL_NAME
|
"renderer": OPENXML_OFFICE_TOOL_NAME
|
||||||
})
|
})
|
||||||
.to_string(),
|
.to_string(),
|
||||||
@@ -156,6 +171,44 @@ fn default_output_path(workspace_root: &Path) -> PathBuf {
|
|||||||
.join(format!("zhihu-hotlist-{nanos}.xlsx"))
|
.join(format!("zhihu-hotlist-{nanos}.xlsx"))
|
||||||
}
|
}
|
||||||
|
|
||||||
|
fn resolve_column_order(
|
||||||
|
provided_columns: &[String],
|
||||||
|
expected_columns: &[String],
|
||||||
|
) -> Option<Vec<usize>> {
|
||||||
|
if provided_columns.len() != expected_columns.len() {
|
||||||
|
return None;
|
||||||
|
}
|
||||||
|
|
||||||
|
let provided_set = provided_columns
|
||||||
|
.iter()
|
||||||
|
.map(|value| value.trim().to_string())
|
||||||
|
.collect::<BTreeSet<_>>();
|
||||||
|
let expected_set = expected_columns
|
||||||
|
.iter()
|
||||||
|
.cloned()
|
||||||
|
.collect::<BTreeSet<_>>();
|
||||||
|
|
||||||
|
if provided_set != expected_set {
|
||||||
|
return None;
|
||||||
|
}
|
||||||
|
|
||||||
|
expected_columns
|
||||||
|
.iter()
|
||||||
|
.map(|expected| {
|
||||||
|
provided_columns
|
||||||
|
.iter()
|
||||||
|
.position(|provided| provided.trim() == expected)
|
||||||
|
})
|
||||||
|
.collect::<Option<Vec<_>>>()
|
||||||
|
}
|
||||||
|
|
||||||
|
fn reorder_row(row: &[Value], column_order: &[usize]) -> Vec<Value> {
|
||||||
|
column_order
|
||||||
|
.iter()
|
||||||
|
.map(|index| row[*index].clone())
|
||||||
|
.collect()
|
||||||
|
}
|
||||||
|
|
||||||
fn write_payload_json(path: &Path, rows: &[Vec<Value>]) -> anyhow::Result<()> {
|
fn write_payload_json(path: &Path, rows: &[Vec<Value>]) -> anyhow::Result<()> {
|
||||||
let mut variables = BTreeMap::new();
|
let mut variables = BTreeMap::new();
|
||||||
for (idx, row) in rows.iter().enumerate() {
|
for (idx, row) in rows.iter().enumerate() {
|
||||||
|
|||||||
@@ -51,3 +51,41 @@ async fn openxml_office_tool_renders_hotlist_xlsx_from_rows() {
|
|||||||
assert!(xml.contains("问题二"));
|
assert!(xml.contains("问题二"));
|
||||||
assert!(!xml.contains("{{TITLE_1}}"));
|
assert!(!xml.contains("{{TITLE_1}}"));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#[tokio::test]
|
||||||
|
async fn openxml_office_tool_accepts_reordered_columns_when_rows_are_structured() {
|
||||||
|
let workspace_root = temp_workspace_root();
|
||||||
|
let output_path = workspace_root.join("out/zhihu-hotlist-reordered.xlsx");
|
||||||
|
let tool = OpenXmlOfficeTool::new(workspace_root.clone());
|
||||||
|
|
||||||
|
let result = tool
|
||||||
|
.execute(json!({
|
||||||
|
"sheet_name": "知乎热榜",
|
||||||
|
"columns": ["title", "heat", "rank"],
|
||||||
|
"rows": [
|
||||||
|
["问题一", "344万", 1],
|
||||||
|
["问题二", "266万", 2]
|
||||||
|
],
|
||||||
|
"output_path": output_path
|
||||||
|
}))
|
||||||
|
.await
|
||||||
|
.unwrap();
|
||||||
|
|
||||||
|
assert!(result.success, "{result:?}");
|
||||||
|
assert!(output_path.exists());
|
||||||
|
|
||||||
|
let unzip = ProcessCommand::new("unzip")
|
||||||
|
.args([
|
||||||
|
"-p",
|
||||||
|
output_path.to_str().unwrap(),
|
||||||
|
"xl/worksheets/sheet1.xml",
|
||||||
|
])
|
||||||
|
.output()
|
||||||
|
.unwrap();
|
||||||
|
assert!(unzip.status.success());
|
||||||
|
|
||||||
|
let xml = String::from_utf8(unzip.stdout).unwrap();
|
||||||
|
assert!(xml.contains("问题一"));
|
||||||
|
assert!(xml.contains("344万"));
|
||||||
|
assert!(xml.contains(">1<"));
|
||||||
|
}
|
||||||
|
|||||||
@@ -7,14 +7,14 @@ class LiveAcceptanceScoreTest(unittest.TestCase):
|
|||||||
def test_score_acceptance_handles_preloaded_office_skill_without_read_skill_log(self):
|
def test_score_acceptance_handles_preloaded_office_skill_without_read_skill_log(self):
|
||||||
result = {
|
result = {
|
||||||
"logs": [
|
"logs": [
|
||||||
{"message": "navigate https://www.zhihu.com/hot"},
|
{"message": "plan 读取知乎热榜并导出 Excel"},
|
||||||
{"message": "navigate https://www.zhihu.com/hot"},
|
{"message": "navigate https://www.zhihu.com/hot"},
|
||||||
{"message": "getText body"},
|
{"message": "getText body"},
|
||||||
{"message": "call openxml_office"},
|
{"message": "call openxml_office"},
|
||||||
],
|
],
|
||||||
"final_task": {
|
"final_task": {
|
||||||
"success": True,
|
"success": True,
|
||||||
"summary": "已导出 Excel",
|
"summary": "已导出 Excel /tmp/sgclaw/out.xlsx",
|
||||||
},
|
},
|
||||||
"stderr": [],
|
"stderr": [],
|
||||||
"exports": [],
|
"exports": [],
|
||||||
@@ -25,6 +25,77 @@ class LiveAcceptanceScoreTest(unittest.TestCase):
|
|||||||
|
|
||||||
self.assertEqual(score["skill_selection"], 30)
|
self.assertEqual(score["skill_selection"], 30)
|
||||||
self.assertEqual(score["final_response_quality"], 5)
|
self.assertEqual(score["final_response_quality"], 5)
|
||||||
|
self.assertNotIn("planner output missing before tool execution", score["deductions"])
|
||||||
|
|
||||||
|
def test_score_acceptance_flags_missing_plan_repeated_summary_and_fake_export_path(self):
|
||||||
|
repeated = "第一段总结。\n\n第一段总结。"
|
||||||
|
result = {
|
||||||
|
"logs": [
|
||||||
|
{"message": "navigate https://www.zhihu.com/hot"},
|
||||||
|
{"message": "getText main"},
|
||||||
|
{"message": "call openxml_office"},
|
||||||
|
],
|
||||||
|
"final_task": {
|
||||||
|
"success": True,
|
||||||
|
"summary": f"{repeated}\n\n导出路径:/tmp/not-real.xlsx",
|
||||||
|
},
|
||||||
|
"stderr": [],
|
||||||
|
"exports": [],
|
||||||
|
}
|
||||||
|
items = [HotItem(rank=1, title="标题", heat="123万")]
|
||||||
|
|
||||||
|
score = score_acceptance(result, items)
|
||||||
|
|
||||||
|
self.assertIn("planner output missing before tool execution", score["deductions"])
|
||||||
|
self.assertIn("repeated assistant paragraphs detected", score["deductions"])
|
||||||
|
self.assertIn("export missing output path", score["deductions"])
|
||||||
|
self.assertEqual(score["final_response_quality"], 0)
|
||||||
|
|
||||||
|
def test_score_acceptance_flags_fake_rows_when_export_contains_no_live_hotlist_data(self):
|
||||||
|
result = {
|
||||||
|
"logs": [
|
||||||
|
{"message": "plan 读取知乎热榜并导出 Excel"},
|
||||||
|
{"message": "navigate https://www.zhihu.com/hot"},
|
||||||
|
{"message": "getText main"},
|
||||||
|
{"message": "call openxml_office"},
|
||||||
|
],
|
||||||
|
"final_task": {
|
||||||
|
"success": True,
|
||||||
|
"summary": "已导出 Excel /tmp/sgclaw/out.xlsx",
|
||||||
|
},
|
||||||
|
"stderr": [],
|
||||||
|
"exports": [],
|
||||||
|
}
|
||||||
|
items = [HotItem(rank=1, title="真实标题", heat="123万")]
|
||||||
|
|
||||||
|
score = score_acceptance(result, items)
|
||||||
|
|
||||||
|
self.assertIn("hotlist rows were not exported as structured live data", score["deductions"])
|
||||||
|
self.assertEqual(score["hotlist_data_correctness"], 0)
|
||||||
|
self.assertEqual(score["xlsx_export_success"], 0)
|
||||||
|
|
||||||
|
def test_score_acceptance_flags_structured_handoff_retry_noise(self):
|
||||||
|
result = {
|
||||||
|
"logs": [
|
||||||
|
{"message": "plan 读取知乎热榜并导出 Excel"},
|
||||||
|
{"message": "navigate https://www.zhihu.com/hot"},
|
||||||
|
{"message": "getText main"},
|
||||||
|
{"message": "call openxml_office"},
|
||||||
|
{"message": "unsupported columns: expected [rank, title, heat]"},
|
||||||
|
{"message": "call openxml_office"},
|
||||||
|
],
|
||||||
|
"final_task": {
|
||||||
|
"success": True,
|
||||||
|
"summary": "已导出 Excel /tmp/sgclaw/out.xlsx",
|
||||||
|
},
|
||||||
|
"stderr": [],
|
||||||
|
"exports": [],
|
||||||
|
}
|
||||||
|
items = [HotItem(rank=1, title="真实标题", heat="123万")]
|
||||||
|
|
||||||
|
score = score_acceptance(result, items)
|
||||||
|
|
||||||
|
self.assertIn("structured handoff required export retries", score["deductions"])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
@@ -250,16 +250,18 @@ def read_json_line(output_queue: queue.Queue[str], timeout: int) -> dict:
|
|||||||
|
|
||||||
|
|
||||||
def score_acceptance(result: dict, items: list[HotItem]) -> dict:
|
def score_acceptance(result: dict, items: list[HotItem]) -> dict:
|
||||||
logs = [entry.get("message", "") for entry in result["logs"]]
|
log_entries = result["logs"]
|
||||||
|
logs = [entry.get("message", "") for entry in log_entries]
|
||||||
final_task = result.get("final_task") or {}
|
final_task = result.get("final_task") or {}
|
||||||
exports = [Path(path) for path in result["exports"]]
|
exports = [Path(path) for path in result["exports"]]
|
||||||
exported_path = resolve_exported_path(exports, final_task.get("summary", ""))
|
exported_path = resolve_exported_path(exports, final_task.get("summary", ""))
|
||||||
|
browser_path_exists = (
|
||||||
skill_selection = 0
|
|
||||||
executed_hotlist_collection = (
|
|
||||||
"navigate https://www.zhihu.com/hot" in logs and
|
"navigate https://www.zhihu.com/hot" in logs and
|
||||||
any(message.startswith("getText ") for message in logs)
|
any(message.startswith("getText ") for message in logs)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
skill_selection = 0
|
||||||
|
executed_hotlist_collection = browser_path_exists
|
||||||
read_hotlist_skill = "read_skill zhihu-hotlist" in logs
|
read_hotlist_skill = "read_skill zhihu-hotlist" in logs
|
||||||
read_office_skill = "read_skill office-export-xlsx" in logs
|
read_office_skill = "read_skill office-export-xlsx" in logs
|
||||||
completed_office_export = "call openxml_office" in logs
|
completed_office_export = "call openxml_office" in logs
|
||||||
@@ -302,12 +304,24 @@ def score_acceptance(result: dict, items: list[HotItem]) -> dict:
|
|||||||
|
|
||||||
final_response_quality = 0
|
final_response_quality = 0
|
||||||
summary = final_task.get("summary", "")
|
summary = final_task.get("summary", "")
|
||||||
if final_task.get("success") and summary.strip():
|
repeated_paragraphs = find_repeated_paragraphs(summary)
|
||||||
|
if final_task.get("success") and summary.strip() and not repeated_paragraphs:
|
||||||
final_response_quality = 5
|
final_response_quality = 5
|
||||||
|
|
||||||
deductions = []
|
deductions = []
|
||||||
|
planner_index = find_planner_log_index(log_entries)
|
||||||
|
first_tool_index = find_first_tool_execution_index(logs)
|
||||||
|
if planner_index is None or (first_tool_index is not None and planner_index > first_tool_index):
|
||||||
|
deductions.append("planner output missing before tool execution")
|
||||||
|
if repeated_paragraphs:
|
||||||
|
deductions.append("repeated assistant paragraphs detected")
|
||||||
if not exported_path:
|
if not exported_path:
|
||||||
deductions.append("export missing output path")
|
deductions.append("export missing output path")
|
||||||
|
if browser_path_exists and (not exported_path or hotlist_data_correctness == 0):
|
||||||
|
deductions.append("hotlist rows were not exported as structured live data")
|
||||||
|
if logs.count("call openxml_office") > 1 or any(
|
||||||
|
"unsupported columns:" in message for message in logs):
|
||||||
|
deductions.append("structured handoff required export retries")
|
||||||
|
|
||||||
total_score = (
|
total_score = (
|
||||||
skill_selection
|
skill_selection
|
||||||
@@ -316,6 +330,7 @@ def score_acceptance(result: dict, items: list[HotItem]) -> dict:
|
|||||||
+ xlsx_export_success
|
+ xlsx_export_success
|
||||||
+ final_response_quality
|
+ final_response_quality
|
||||||
)
|
)
|
||||||
|
total_score = max(0, total_score - acceptance_penalty(deductions))
|
||||||
|
|
||||||
return {
|
return {
|
||||||
"total_score": total_score,
|
"total_score": total_score,
|
||||||
@@ -333,6 +348,58 @@ def score_acceptance(result: dict, items: list[HotItem]) -> dict:
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def find_planner_log_index(log_entries: list[dict]) -> int | None:
|
||||||
|
for index, entry in enumerate(log_entries):
|
||||||
|
message = str(entry.get("message", "")).strip()
|
||||||
|
if entry.get("level") == "plan":
|
||||||
|
return index
|
||||||
|
if not message:
|
||||||
|
continue
|
||||||
|
if message.startswith("plan ") or "先规划再执行" in message:
|
||||||
|
return index
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def find_first_tool_execution_index(logs: list[str]) -> int | None:
|
||||||
|
tool_prefixes = (
|
||||||
|
"navigate ",
|
||||||
|
"click ",
|
||||||
|
"type ",
|
||||||
|
"getText ",
|
||||||
|
"call openxml_office",
|
||||||
|
"call screen_html_export",
|
||||||
|
)
|
||||||
|
for index, message in enumerate(logs):
|
||||||
|
if message.startswith(tool_prefixes):
|
||||||
|
return index
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def find_repeated_paragraphs(summary: str) -> list[str]:
|
||||||
|
seen: set[str] = set()
|
||||||
|
repeated: list[str] = []
|
||||||
|
for paragraph in re.split(r"\n\s*\n", summary):
|
||||||
|
normalized = re.sub(r"\s+", " ", paragraph).strip()
|
||||||
|
if not normalized:
|
||||||
|
continue
|
||||||
|
if normalized in seen and normalized not in repeated:
|
||||||
|
repeated.append(normalized)
|
||||||
|
continue
|
||||||
|
seen.add(normalized)
|
||||||
|
return repeated
|
||||||
|
|
||||||
|
|
||||||
|
def acceptance_penalty(deductions: list[str]) -> int:
|
||||||
|
penalty_map = {
|
||||||
|
"planner output missing before tool execution": 10,
|
||||||
|
"repeated assistant paragraphs detected": 10,
|
||||||
|
"export missing output path": 10,
|
||||||
|
"hotlist rows were not exported as structured live data": 15,
|
||||||
|
"structured handoff required export retries": 10,
|
||||||
|
}
|
||||||
|
return sum(penalty_map.get(item, 0) for item in deductions)
|
||||||
|
|
||||||
|
|
||||||
def resolve_exported_path(exports: list[Path], summary: str) -> Path | None:
|
def resolve_exported_path(exports: list[Path], summary: str) -> Path | None:
|
||||||
match = re.search(r"(/[^\s`]+\.xlsx)", summary)
|
match = re.search(r"(/[^\s`]+\.xlsx)", summary)
|
||||||
if match:
|
if match:
|
||||||
|
|||||||
Reference in New Issue
Block a user