---
type: slide
slideOptions:
spotlight:
enabled: false
#allottedMinutes: 2 # Minutes alloted for a slide.
---
# summer 2023
## week 1, 2
### 2048, OOG
----
## 2048
----

----

----

avg = 90699.6, max = 262080
----

avg = 109204, max = 287996
---
## OOG
----
## config
* gumbel_alphazero
* actor_num_simulation=500
* learner_batch_size=512
* actor_mcts_think_time_limit=0.1
* actor_num_simulation=500
----

----

----

----

----

----

---
# week 3
## Accelerating Self-Play Learning in Alphazero
----
## Major General Improvements
1. Playout Cap Randomization
2. Forced Playouts and Policy Target Pruning
3. Global Pooling
4. Auxiliary Policy Targets
----
## 2. Forced Playouts and Policy Target Pruning
* Forced Playout
* exploration
* 
----
* Policy Target Pruning
in the last simulation:
* indetnify a child c* with the most playouts
* for each other child c, subtract up to Nforced(c\) playouts so long as PUCT(c\) < PUCT(c*)
---
# week 4
## Gomoku and Threat Space Search
----
### basic threats

----
### potential threats

----
### notations for threat space search
* gain square
* cost squares
* rest squares
* A is dependent on B
* dependency tree of A
* two dependency trees in conflict
----

----
### search tree

----


----
### algorithm
1. find all the threats given a board
2. for each threat
* dfs relevant threats till depth==limits or ans is found
---
# week 5, 6
## debug
## [dependency-based search](http://fragrieu.free.fr/SearchingForSolutions.pdf)
----


---
# week 7
## Testing Gomoku
----
## 3bx_threat vs 3bx
2323-vs-1230
Total games: 100
Wins: 57
Losses: 43
Win rate: 56.99999999999999%
----
## 8bx_threat_weight_257800 vs oldest_best
### 2 sec MCTS think time
Total games: 50
Wins: 26
Losses: 24
Win rate: 52.0%
----
## 8bx_threat_weight_257800 vs oldest_best
### 1 sec MCTS think time
Total games: 50
Wins: 31
Losses: 19
Win rate: 62.0%
----
## 8bx_threat_weight_257800 vs oldest_best
### 5 sec MCTS think time
Total games: 50
Wins: 22
Losses: 28
Win rate: 44.0%
----
## the way the program lost

----

----

----
## one-ply-search
----
## The above situation is because that I forgot to turn off the noises orz
---
# week 8
## Computer Olympiad Outer-Open-Gomoku Competition
----
## Minizero-OOG
* Dependency-Based Search similar to TSS
* Gumbel
* tricks from KataGo such as
* Global Pooling
* Auxiliary Policy Targets
* Auxiliary Soft Policy Target
* 2066 iterations
* 332800 nn steps
* learning rate: 0.002
----
## training process

----

----

----

----

----
## National Taiwan Normal University
* Threat-Space-Search and some dobule two Joseki
* no Gumbel
* tricks from KataGo
* 2000 iterations but use 1200 iterations
----
### team1 Corking
* only use domain Knowledge
* no neural network
**2:0**
----
### team2 Stone_OOG
**2:0**
----
### team3 Peace_OOG
**0:2**
----
## our configuration
* 30 sec think times
* infinite actor_num_simulation
----
## Another team from IIS, Academia Sinica
### Minizero_TSSOOG
* TSS
* 800 iterations
* 20 residual blocks
**2:0**
----
## our configuration
* 20 sec think times
* infinite actor_num_simulation
----
## National Yang Ming Chiao Tung University
### clap_OOG
* no Gumbel
* tricks from KataGo such as
* Forced Playouts and Policy Target Pruning
* Global Pooling
* Auxiliary Policy Targets
* 練了一個多月
**0:2**
----
## our configuration
* 15 sec think times
* infinite actor_num_simulation
----
## Université Paris-Dauphine, LAMSADE, CNRS
### Marvin
* no domain knowledge
* minimax search
* 15565 self-play trained for around a month
* 16 residual blocks
* same algorithm as last year but retrained
**1:1**
----
## Ohto Katsuki
### Asura
* random MCTS
**2:0**
----
## result
* score: 9W5L
* rank: 4/8
* Peace_OOG, clap_OOG, Marvin, Minizero_OOG, stone_OOG, Minizero_TSSOOG, corking, asura
* title: None
---
# TAAI 2023
----
## Minizero-OOG
* 8898 iterations
* version: 332800.pt-> 889800.pt
* learning rate: 2e-3-> 2e-5
* simulation count: 100->200
----
#### elo_rating
* 1 sec think time
* mostly 250 games for every 50 iterations

----
## accuracy_plicy

----
## loss_policy

----
## loss_value

----
## returns

----
## game length

----
## time

----
## result
0:2 clap_OOG
0:1:1 Peace_OOG
1:1 Corking
* rank: 3/4
* title: bronze medal
----
## game recap

---
# winter 2024
## opening book
----
### structure
* black and white openings
* state strings to action strings
* BFS tree
* serialize, deserialize

----
### design (black)
* let minizero think longer
* get best action from MCTS for black
* get bf actions from NN policies for white
* bf= max(4, bf)
* bf is dynamically adjusted based on policies
* depth: 15~17
same for white
----
### /black/book.txt
```
State:
Actions: F14
State: F14
Actions: F12 L4 M4 F10 G11 E11 L6 F11 B15 G13
State: F14 B15
Actions: G11
State: F14 E11
Actions: G11
State: F14 F10
Actions: F12
State: F14 F11
Actions: E11
State: F14 F12
Actions: H11
State: F14 G11
Actions: E11
State: F14 G13
Actions: E11
State: F14 L4
Actions: F12
State: F14 L6
Actions: F11
State: F14 M4
Actions: F12
```
---
## opening book optimization
#### Zobrist Hashing + board rotation+ is_resign/terminal
----
### idea
* BFS tree pruning
* smaller tree but bigger opening book
* less file storage
----
### cons
* hash-collision
* more memory on stack
* bigger constant
* more complicated code
----
### /black/hash_book.txt
```
State: 104546213
Actions: 958128308
State: 122126856
Actions: 40186728
State: 331119591
Actions: 40186728
State: 352918596
Actions: 958128308
State: 425125379
Actions: 2102596696
State: 621287631
Actions: 958128308
State: 1013464112
Actions: 958128308
State: 1399999131
Actions: 958128308
State: 1460801621
Actions: 958128308
State: 1813905182
Actions: 1701229151
State: 1819583497
Actions: 1929289568
State: 1929289568
Actions: 1631773831 1318495056 636260149 1806203235 518217854 1975065960 562877947 1475774383 1744274212 1959108805
```
----
### pruning & collision experiment
version= 11512 iteration
think_time= 1 second
depth= 13
* without pruing

----
### pruning & collision experiment
version= 11512 iteration
think_time= 1 second
depth= 13
```
if(visited.find(hash_st)!= visited.end()) continue;
if(console.getMove()== "end") continue;
```
* with pruing

---
### version evaluation from a lost game

* correct pieces percentage from nn policies: 34% (rounded to nearest whole number)
----
* 196000_iter 57%

----
* 332800_iter 51%

----
* 504800_iter 48%

----
* 604800_iter 65%

----
* 654800_iter 71%

----
* 684800_iter 70%

----
* 704800_iter 78%

----
* 744800_iter 79%

----
* 754800_iter 79%

----
* 764800_iter 66%

----
* 784800_iter 70%

----
* 844800_iter 57%

----
* 878700_iter 46%

----
* 972200_iter 33%

----
* 1151200_iter 42%

----
* 1163500_iter 43%

----
### bracket

----
#### bracket result

----
### config for the bracket
```
# Program
program_seed=0
program_auto_seed=false
program_quiet=false
# Actor
actor_num_threads=4
actor_num_parallel_games=32
actor_num_simulation=1000000
actor_mcts_puct_base=19652
actor_mcts_puct_init=1.25
actor_mcts_reward_discount=1
actor_mcts_value_rescale=false
actor_mcts_think_batch_size=16
actor_mcts_think_time_limit=1 # MCTS time limit (in seconds), 0 represents searching without using the time limit
actor_select_action_by_count=true
actor_select_action_by_softmax_count=false
actor_select_action_softmax_temperature=1
actor_select_action_softmax_temperature_decay=false # decay temperature based on zero_end_iteration; use 1, 0.5, and 0.25 for 0%-50%, 50%-75%, and 75%-100% of total iterations, respectively
actor_use_random_rotation_features=true # randomly rotate input features, currently only supports alphazero mode
actor_use_dirichlet_noise=false
actor_dirichlet_noise_alpha=0.03 # 1 / sqrt(num of actions)
actor_dirichlet_noise_epsilon=0.25
actor_use_gumbel= false
actor_use_gumbel_noise=false
actor_gumbel_sample_size=16
actor_gumbel_sigma_visit_c=50
actor_gumbel_sigma_scale_c=1
actor_resign_threshold=-0.9
# Zero
zero_server_port=1229
zero_training_directory=
zero_num_games_per_iteration=2000
zero_start_iteration=0
zero_end_iteration=100
zero_replay_buffer=20
zero_disable_resign_ratio=0.1
zero_actor_intermediate_sequence_length=0 # board games: 0; atari: 200
zero_actor_ignored_command=reset_actors # format: command1 command2 ...
zero_actor_stop_after_enough_games=false
zero_server_accept_different_model_games=true
# Learner
learner_use_per=false # Prioritized Experience Replay
learner_per_alpha=1 # Prioritized Experience Replay
learner_per_init_beta=1 # Prioritized Experience Replay
learner_per_beta_anneal=true # linearly anneal PER init beta to 1 based on zero_end_iteration
learner_training_step=100
learner_training_display_step=100
learner_batch_size=1024
learner_muzero_unrolling_step=5
learner_n_step_return=0 # board games: 0, atari: 10
learner_learning_rate=0.02
learner_momentum=0.9
learner_weight_decay=0.0001
learner_value_loss_scale=1
learner_num_thread=8
# Network
nn_file_name=
nn_num_input_channels=4
nn_input_channel_height=15
nn_input_channel_width=15
nn_num_hidden_channels=128
nn_hidden_channel_height=15
nn_hidden_channel_width=15
nn_num_action_feature_channels=1
nn_num_blocks=8
nn_action_size=225
nn_num_value_hidden_channels=128
nn_discrete_value_size=1 # set to 1 for the games which doesn't use discrete value
nn_type_name=alphazero # alphazero/muzero
# Environment
env_board_size=15
env_gomoku_rule=outer_open # normal/outer_open
```
----
#### elo_rating every 50 iterations
* 1 sec think time
* mostly 250 games

----
#### elo_rating every 500 iterations
* from 3898 iterations to 11398 iterations
* 1 sec think time
* 250 games

----
#### elo_rating every 1000 iterations
* from 3898 iterations to 10898 iterations
* 1 sec think time
* 250 games

----
## possible reasons
* larger think batch size
* more simulation count
* lower learning rate
---
## version and config checking with clap_OOG 400 simulation
----
### config
```
# Program
program_seed=0
program_auto_seed=false
program_quiet=false
# Actor
actor_num_threads=4
actor_num_parallel_games=32
actor_num_simulation=400
actor_mcts_puct_base=19652
actor_mcts_puct_init=1.25
actor_mcts_reward_discount=1
actor_mcts_value_rescale=false
actor_mcts_think_batch_size=16
actor_mcts_think_time_limit=1 # MCTS time limit (in seconds), 0 represents searching without using the time limit
actor_select_action_by_count=true
actor_select_action_by_softmax_count=false
actor_select_action_softmax_temperature=1
actor_select_action_softmax_temperature_decay=false # decay temperature based on zero_end_iteration; use 1, 0.5, and 0.25 for 0%-50%, 50%-75%, and 75%-100% of total iterations, respectively
actor_use_random_rotation_features=true # randomly rotate input features, currently only supports alphazero mode
actor_use_dirichlet_noise=false
actor_dirichlet_noise_alpha=0.03 # 1 / sqrt(num of actions)
actor_dirichlet_noise_epsilon=0.25
actor_use_gumbel= false
actor_use_gumbel_noise=false
actor_gumbel_sample_size=16
actor_gumbel_sigma_visit_c=50
actor_gumbel_sigma_scale_c=1
actor_resign_threshold=-0.9
# Zero
zero_server_port=1229
zero_training_directory=
zero_num_games_per_iteration=2000
zero_start_iteration=0
zero_end_iteration=100
zero_replay_buffer=20
zero_disable_resign_ratio=0.1
zero_actor_intermediate_sequence_length=0 # board games: 0; atari: 200
zero_actor_ignored_command=reset_actors # format: command1 command2 ...
zero_actor_stop_after_enough_games=false
zero_server_accept_different_model_games=true
# Learner
learner_use_per=false # Prioritized Experience Replay
learner_per_alpha=1 # Prioritized Experience Replay
learner_per_init_beta=1 # Prioritized Experience Replay
learner_per_beta_anneal=true # linearly anneal PER init beta to 1 based on zero_end_iteration
learner_training_step=100
learner_training_display_step=100
learner_batch_size=1024
learner_muzero_unrolling_step=5
learner_n_step_return=0 # board games: 0, atari: 10
learner_learning_rate=0.02
learner_momentum=0.9
learner_weight_decay=0.0001
learner_value_loss_scale=1
learner_num_thread=8
# Network
nn_file_name=
nn_num_input_channels=4
nn_input_channel_height=15
nn_input_channel_width=15
nn_num_hidden_channels=128
nn_hidden_channel_height=15
nn_hidden_channel_width=15
nn_num_action_feature_channels=1
nn_num_blocks=8
nn_action_size=225
nn_num_value_hidden_channels=128
nn_discrete_value_size=1 # set to 1 for the games which doesn't use discrete value
nn_type_name=alphazero # alphazero/muzero
# Environment
env_board_size=15
env_gomoku_rule=outer_open # normal/outer_open
```
----
### 7748_iter with 400 sim and different think batch size
* 16 think batch size: 20% win rate
* 8 think batch size: 30% win rate
* 4 think batch size: 34% win rate
* 2 think batch size: 33% win rate
* 1 think batch size: 39% win rate
----
### best version vs clap
```
1788-vs-12345
Total games: 217
winner: 12345
Wins: 102
Losses: 76
Win rate: 47.004608294930875%
Win rate without ties: 57.30337078651686%
```
* win_rate: 42%
---
## new model
----
### reason
* wrong winning rules
* only five in a row counts as wining
* six or more don't count
----
### improvement
* baseline_softmax vs baseline_count
```
7800-vs-7801
Total games: 250
winner: 7801
Wins: 179
Losses: 71
Win rate: 71.6%
Win rate without ties: 71.6%
```
* using count instead of softmax count when sim_count is small
```
actor_select_action_by_count=true
actor_select_action_by_softmax_count=false
```
----
### training in progress
#### new baseline model
#### new baseline model with tss
#### new baseline model with new NN
----
### baseline vs old
* new baseline weight_iter_237400.pt
* old model weight_iter_231200.pt
* new baseline win rate: 84%
```
10004-vs-10005
Total games: 250
winner: 10004
Wins: 202
Losses: 39
Win rate: 80.80000000000001%
Win rate without ties: 83.81742738589212%
```
----
### baseline vs baseline+ NN
* baseline weight_iter_186600.pt
* baseline+ NN weight_iter_186600.pt
* baseline win rate: 75%
```
10006-vs-10007
Total games: 250
winner: 10007
Wins: 172
Losses: 58
Win rate: 68.8%
Win rate without ties: 74.78260869565217%
```
----
### baseline vs baseline+TSS
* baseline weight_iter_400000.pt
* baseline+ TSS weight_iter_400000.pt
* baseline+TSS win rate: 53%
```
8377-vs-8378
Total games: 250
winner: 8377
Wins: 109
Losses: 98
Win rate: 43.6%
Win rate without ties: 52.65700483091788%
```
---
## TCGA 2024
----
### Minizero_OOG
* 15bx 1500 iteration
* parallel-search in competing
* TSS and board history features
* No KataGo or TSS
----
### rules
* OOG-freestyle
* 45 mins for one player
* if exceeding time limit and one of the agent believing who wins, the game continues
----
### VS clap_OOG
* black 30 sec
* white 50 sec
### vs peace_OOg
* black 35 sec
* white 50 sec
----
### result
2:0 clap_OOG
1:1 peace_OOG
* rank: 1/3
* title: gold medal
---
## slide link: https://hackmd.io/@diegowu/B1xwKd7q2