GSS 2025/11/12

# GSS 2025/11/12 0. locomo_eval_test_results results(f1)| multi-hop |temperal|open-domain|single-hop|adversarial|overall| |---|---|---|---|---|---|---| 2|11.9|7.3|5|18|0|10.8 3|8.5|16|6.7|17|0|11.5 results(bleu1)| multi-hop |temperal|open-domain|single-hop|adversarial|overall| |---|---|---|---|---|---|---| 2|8.6|5.8|3.5|15|0|8.8 3|5.8|13.5|5.7|13.9|0|9.3| results(sbert)| multi-hop |temperal|open-domain|single-hop|adversarial|overall| |---|---|---|---|---|---|---| 2|46.5|43.7|35.8|43.6|0|33.9 3|42.7|43.9|33.5|43.1|0|33.1 ------------------ 1. Question answering performance of Base models. - context length: 4096 - f1 score | Base model (f1)| multi-hop |temperal|open-domain|single-hop|adversarial|overall| | -------- | -------- | -------- |---------|-----|-----|----| | phi4:14b | 20.3 |9.8|23.5|16.8|2.9|13.37 gemm3:12b|20.8|8.3|21.4|**18.0**|0.4|13.04 gpt-oss:20b|20.6|**15.5**|**25.3**|15.3|7.8|14.88 |qwen3:30b|**22.6**|7.1|16.7|17.7|**24.7**|**18.21** | Base model (bleu1)| multi-hop |temperal|open-domain|single-hop|adversarial|overall| | -------- | -------- | -------- |---------|-----|-----|----| | phi4:14b | 8.3 |5.5|**6.6**|9.6|4.8|7.5 gemm3:12b|**9.3**|5.0|5.5|10.7|7.6|**8.6** gpt-oss:20b|7.1|**7.7**|6.5|9.4|4.3|7.5 |qwen3:30b|8.3|3.4|6.4|**11.2**|**7.7**|8.5 | Base model (sbert)| multi-hop |temperal|open-domain|single-hop|adversarial|overall| | -------- | -------- | -------- |---------|-----|-----|----| | phi4:14b|40.5|19.4|40|29|20.3|27.7 |gemm3:12b|45.7|22.3|40.9|33.3|**28.2**|32.5 |gpt-oss:20b|45.2|**32.1**|**45.4**|31.7|22.1|32.2 |qwen3:30b|**46.6**|25.8|40.5|**34**|27.4|**33.3** 2. Question answering performance of Mem0. (10 conversations) | mem0 (f1) | multi-hop |temperal|open-domain|single-hop|overall| | -------- | -------- | -------- |---------|-----|-----| | phi4:14b |**30.0**|**46.1**|15.3|**35.7**|**35.8** gemm3:12b|21.9|45.5|12.8|30.7|31.7 gpt-oss:20b|22.2|41.9|**20.8**|27.7|29.6| |qwen3:30b|5.9|4.6|9.2|5.4|5.5| | mem0 (bleu1) | multi-hop |temperal|open-domain|single-hop|overall| | -------- | -------- | -------- |---------|-----|-----| | phi4:14b |**23.4**|**40.6**|12.3|**30.4**|**30.3**| gemm3:12b|14.0|37.3|10.1|12.8|25.7|25.5 gpt-oss:20b|13.8|30.7|**15.6**|23.5|23.2| |qwen3:30b|0.5|0.2|0.5|0.3|0.3| | mem0 (sbert)| multi-hop |temperal|open-domain|single-hop|overall| | -------- | -------- | -------- |---------|-----|-----| | phi4:14b|**57.6**|72.4|40.5|**53.7**|**57.6** |gemm3:12b|47.4|**72.8**|35.6|46.5|52| |gpt-oss:20b|46.7|62.9|**44.1**|46.8|50.4 |qwen3:30b|21.7|9.5|23.0|18.8|17.6| |mem0 (time)|conv. count|avg. add time(s)|avg. answer time(s)| |---|---|---|---| |phi4:14b|3|**15.9**|**1.38**| |gemma3:12b|3|21.2|5.66| |gpt-oss:20b|3|28.6|5.11| |qwen3:30b|3|27.6|5.37| *gpt-oss: 無法用structure output，有2.1% (21/999)的回答有未預期的output *qwen3: 表現太差了只測一輪 3. Question answering preformance of Amem (10 conversations) | Amem (f1) | multi-hop |temperal|open-domain|single-hop|adversarial|overall| | -------- | -------- | -------- |---------|-----|-----|---| | phi4:14b|17.4|21.4|9.2|34.7|29.4|27.8| |gemm3:12b|5.2|3.4|3.6|11.1|**49.5**|17.3| |gpt-oss:20b|2.4|0.9|2.9|3.3|9.1|4.0| |qwen3:30b|**24.7**|**29.1**|**9.89**|**35.1**|22.0|**28.8** | Amem (bleu1) | multi-hop |temperal|open-domain|single-hop|adversarial|overall| | -------- | -------- | -------- |---------|-----|-----|---| | phi4:14b|15.1|19.5|**8.2**|29.5|28.6|**24.7**| |gemm3:12b|8.0|6.0|4.6|9.3|**48.6**|17.2| |gpt-oss:20b|2.1|1.2|2.1|2.3|9|3.6| |qwen3:30b|**16.2**|**24.4**|8.1|**30.3**|19.6|24.2 | Amem (sbert) | multi-hop |temperal|open-domain|single-hop|adversarial|overall| | -------- | -------- | -------- |---------|-----|-----|---| | phi4:14b|40.9|47.4|36.0|49.1|**34.0**|43.8| |gemm3:12b|32.7|31.4|25.6|33.5|52.8|37.2| |gpt-oss:20b|14.9|11.6|16.6|13.8|15.9|14.3| |qwen3:30b|**47.9**|**58.6**|**36.3**|**50.2**|28.7|**46.0** |Amem (time) |conv. count|avg. add time(s)|avg. answer time(s)| |---|---|---|---| |phi4:14b|5|14.52|7.26| |gemma3:12b|5|16.11|8.65| |gpt-oss:20b|5|5.80|12.08| |qwen3:30b|5|11.83|13.75| 註一：gemma3在回答時無法照json格式輸出 > Question 14: What career path has Caroline decided to persue? > 2025-11-13 04:59:03,253 - INFO - Prediction: {"counseling or mental health work more.": "counseling or mental health work"} > 註二：gpt-oss在add memory的時候無法正常輸出json，導致keyword/tag為空，performance低下 > [Warning] gpt-oss output not JSON parseable: > Response: We need to produce JSON with keywords, context, tags. Content is a short conversation. We need to identify most salient keywords focusing on nouns, verbs, key concepts. The speaker: Melanie says: ": [ "Hey Caroline! Good to see you! I'm swamped with the kids & work. What's up with you? Anything new?" ] > memory keywords: []memory tags: [] 4.***(Updated) Question answering performance of memobase.*** | memobase (f1) | multi-hop |temperal|open-domain|single-hop|overall| | -------- | -------- | -------- |---------|-----|-----| | phi4:14b |4.5|3.5|4.3|5.0|4.5 gemm3:12b|11.9|7.9|6.2|12.9|11.3 gpt-oss:20b|7.7|6.6|5.9|7.8|7.4 |qwen3:30b| | memobase (bleu1) | multi-hop |temperal|open-domain|single-hop|overall| | -------- | -------- | -------- |---------|-----|-----| | phi4:14b |2.9|3.5|1.8|5.0|2.1 gemm3:12b|8.4|6.1|3.5|2.4|7.14 gpt-oss:20b|6.2|5.0|3.2|4.5|4.9 |qwen3:30b|||||| | mem0 (sbert)| multi-hop |temperal|open-domain|single-hop|overall| | -------- | -------- | -------- |---------|-----|-----| | phi4:14b|32.5|18.4|34.3|31.2|29.0 |gemm3:12b|34.2|20.2|37.6|31.2|29.8| |gpt-oss:20b|30.2|17.6|34.8|25.9|25.5 |qwen3:30b||||| 5. Mem0 extract/embed/update time_stats |mem0 (time) |conv. count|avg. extract time(s)|avg. embed time (s)|avg. update time(s)|total time(s)| |---|---|---|---|---|---| |phi4:14b|10|426.2|555.5|2539.5|3521.2 |phi4:14b|8|438.89|573.22|2628.46|3640.57 |gemma3:12b|2|542.1|3289.3|601.2|4432.6 |gpt-oss:20b|||| |qwen3:30b|||| 6. langmem (10 conv.) | PetrosStav/gemma3-tools:12b| multi-hop |temperal|open-domain|single-hop|overall| | -------- | -------- | -------- |---------|-----|-----| | f1|6.3|3|16.2|4.3|5.0 |bleu1|3.1|3.4|8.6|3.3|3.5| |sbert|21.0|11.6|30.5|16.5|17.0 7. memobase問題 - add 的時候 threading可能會浪費時間或是有問題 -> 可以改成sequential 8. *(Update) Amem time state(phi4)* | Amem |conv. count| avg update time(s) | avg embed time(s) | | -------- |---| -------- | -------- | | phi4 | 8|4833.1 | 571.3 |