report - HackMD

--- type: slide slideOptions: spotlight: enabled: false #allottedMinutes: 2 # Minutes alloted for a slide. --- # summer 2023 ## week 1, 2 ### 2048, OOG ---- ## 2048 ---- ![](https://hackmd.io/_uploads/Bk0-NF79n.png) ---- ![](https://hackmd.io/_uploads/ByBMNtXc2.png) ---- ![](https://hackmd.io/_uploads/rkg0vK7cn.png) avg = 90699.6, max = 262080 ---- ![](https://hackmd.io/_uploads/H132dKX9n.png) avg = 109204, max = 287996 --- ## OOG ---- ## config * gumbel_alphazero * actor_num_simulation=500 * learner_batch_size=512 * actor_mcts_think_time_limit=0.1 * actor_num_simulation=500 ---- ![](https://hackmd.io/_uploads/H1TuaFQ9h.png) ---- ![](https://hackmd.io/_uploads/ryd26Kmqh.png) ---- ![](https://hackmd.io/_uploads/rkNTTFX53.png) ---- ![](https://hackmd.io/_uploads/Hy5TpFm5n.png) ---- ![](https://hackmd.io/_uploads/BJmyRt792.png) ---- ![](https://hackmd.io/_uploads/r1tOs9753.png) --- # week 3 ## Accelerating Self-Play Learning in Alphazero ---- ## Major General Improvements 1. Playout Cap Randomization 2. Forced Playouts and Policy Target Pruning 3. Global Pooling 4. Auxiliary Policy Targets ---- ## 2. Forced Playouts and Policy Target Pruning * Forced Playout * exploration * ![](https://hackmd.io/_uploads/BJSw5mH53.png) ---- * Policy Target Pruning in the last simulation: * indetnify a child c* with the most playouts * for each other child c, subtract up to Nforced(c\) playouts so long as PUCT(c\) < PUCT(c*) --- # week 4 ## Gomoku and Threat Space Search ---- ### basic threats ![](https://hackmd.io/_uploads/SkKss5Vj2.png =600x600) ---- ### potential threats ![](https://hackmd.io/_uploads/B1zyhc4ih.png =400x600) ---- ### notations for threat space search * gain square * cost squares * rest squares * A is dependent on B * dependency tree of A * two dependency trees in conflict ---- ![](https://hackmd.io/_uploads/HJJNys4i2.png =600x600) ---- ### search tree ![](https://hackmd.io/_uploads/Bk5F1oEih.png =520x600) ---- ![](https://hackmd.io/_uploads/HJJNys4i2.png =450x450) ![](https://hackmd.io/_uploads/Bk5F1oEih.png =400x450) ---- ### algorithm 1. find all the threats given a board 2. for each threat * dfs relevant threats till depth==limits or ans is found --- # week 5, 6 ## debug ## [dependency-based search](http://fragrieu.free.fr/SearchingForSolutions.pdf) ---- ![](https://hackmd.io/_uploads/HyBMG9_2h.png) ![](https://hackmd.io/_uploads/SyE7Gqun3.png) --- # week 7 ## Testing Gomoku ---- ## 3bx_threat vs 3bx 2323-vs-1230 Total games: 100 Wins: 57 Losses: 43 Win rate: 56.99999999999999% ---- ## 8bx_threat_weight_257800 vs oldest_best ### 2 sec MCTS think time Total games: 50 Wins: 26 Losses: 24 Win rate: 52.0% ---- ## 8bx_threat_weight_257800 vs oldest_best ### 1 sec MCTS think time Total games: 50 Wins: 31 Losses: 19 Win rate: 62.0% ---- ## 8bx_threat_weight_257800 vs oldest_best ### 5 sec MCTS think time Total games: 50 Wins: 22 Losses: 28 Win rate: 44.0% ---- ## the way the program lost ![](https://hackmd.io/_uploads/SyOf1hban.png =500x500) ---- ![](https://hackmd.io/_uploads/ByEWe3-p2.png =500x500) ---- ![](https://hackmd.io/_uploads/B1ePZ3Zp2.png =500x500) ---- ## one-ply-search ---- ## The above situation is because that I forgot to turn off the noises orz --- # week 8 ## Computer Olympiad Outer-Open-Gomoku Competition ---- ## Minizero-OOG * Dependency-Based Search similar to TSS * Gumbel * tricks from KataGo such as * Global Pooling * Auxiliary Policy Targets * Auxiliary Soft Policy Target * 2066 iterations * 332800 nn steps * learning rate: 0.002 ---- ## training process ![](https://hackmd.io/_uploads/BkxiFArT3.png =580x580) ---- ![](https://hackmd.io/_uploads/rykitRH6n.png =580x580) ---- ![](https://hackmd.io/_uploads/r1kiFCHT3.png =580x580) ---- ![](https://hackmd.io/_uploads/HJJitRST2.png =580x580) ---- ![](https://hackmd.io/_uploads/r1kjtRS62.png =580x580) ---- ## National Taiwan Normal University * Threat-Space-Search and some dobule two Joseki * no Gumbel * tricks from KataGo * 2000 iterations but use 1200 iterations ---- ### team1 Corking * only use domain Knowledge * no neural network **2:0** ---- ### team2 Stone_OOG **2:0** ---- ### team3 Peace_OOG **0:2** ---- ## our configuration * 30 sec think times * infinite actor_num_simulation ---- ## Another team from IIS, Academia Sinica ### Minizero_TSSOOG * TSS * 800 iterations * 20 residual blocks **2:0** ---- ## our configuration * 20 sec think times * infinite actor_num_simulation ---- ## National Yang Ming Chiao Tung University ### clap_OOG * no Gumbel * tricks from KataGo such as * Forced Playouts and Policy Target Pruning * Global Pooling * Auxiliary Policy Targets * 練了一個多月 **0:2** ---- ## our configuration * 15 sec think times * infinite actor_num_simulation ---- ## Université Paris-Dauphine, LAMSADE, CNRS ### Marvin * no domain knowledge * minimax search * 15565 self-play trained for around a month * 16 residual blocks * same algorithm as last year but retrained **1:1** ---- ## Ohto Katsuki ### Asura * random MCTS **2:0** ---- ## result * score: 9W5L * rank: 4/8 * Peace_OOG, clap_OOG, Marvin, Minizero_OOG, stone_OOG, Minizero_TSSOOG, corking, asura * title: None --- # TAAI 2023 ---- ## Minizero-OOG * 8898 iterations * version: 332800.pt-> 889800.pt * learning rate: 2e-3-> 2e-5 * simulation count: 100->200 ---- #### elo_rating * 1 sec think time * mostly 250 games for every 50 iterations ![elo](https://hackmd.io/_uploads/rybSnIYHp.png =400x400) ---- ## accuracy_plicy ![soft_gomoku_oo_15x15_gaz_8bx128_n100-d38be9-dirty_accuracy_policy](https://hackmd.io/_uploads/ryVLhUYBT.png =580x580) ---- ## loss_policy ![soft_gomoku_oo_15x15_gaz_8bx128_n100-d38be9-dirty_loss_policy](https://hackmd.io/_uploads/rktw2UKST.png =580x580) ---- ## loss_value ![soft_gomoku_oo_15x15_gaz_8bx128_n100-d38be9-dirty_loss_value](https://hackmd.io/_uploads/H17uhUYBa.png =580x580) ---- ## returns ![soft_gomoku_oo_15x15_gaz_8bx128_n100-d38be9-dirty_Returns](https://hackmd.io/_uploads/S1uO3LtBp.png =580x580) ---- ## game length ![soft_gomoku_oo_15x15_gaz_8bx128_n100-d38be9-dirty_Lengths](https://hackmd.io/_uploads/ryfphIYBp.png =580x580) ---- ## time ![soft_gomoku_oo_15x15_gaz_8bx128_n100-d38be9-dirty_Time](https://hackmd.io/_uploads/BykY38tHp.png =580x580) ---- ## result 0:2 clap_OOG 0:1:1 Peace_OOG 1:1 Corking * rank: 3/4 * title: bronze medal ---- ## game recap ![what_a_pity](https://hackmd.io/_uploads/SJJV7wYSp.png =600x600) --- # winter 2024 ## opening book ---- ### structure * black and white openings * state strings to action strings * BFS tree * serialize, deserialize ![ree](https://hackmd.io/_uploads/S1nX3tJYp.png =400x400) ---- ### design (black) * let minizero think longer * get best action from MCTS for black * get bf actions from NN policies for white * bf= max(4, bf) * bf is dynamically adjusted based on policies * depth: 15~17 same for white ---- ### /black/book.txt ``` State: Actions: F14 State: F14 Actions: F12 L4 M4 F10 G11 E11 L6 F11 B15 G13 State: F14 B15 Actions: G11 State: F14 E11 Actions: G11 State: F14 F10 Actions: F12 State: F14 F11 Actions: E11 State: F14 F12 Actions: H11 State: F14 G11 Actions: E11 State: F14 G13 Actions: E11 State: F14 L4 Actions: F12 State: F14 L6 Actions: F11 State: F14 M4 Actions: F12 ``` --- ## opening book optimization #### Zobrist Hashing + board rotation+ is_resign/terminal ---- ### idea * BFS tree pruning * smaller tree but bigger opening book * less file storage ---- ### cons * hash-collision * more memory on stack * bigger constant * more complicated code ---- ### /black/hash_book.txt ``` State: 104546213 Actions: 958128308 State: 122126856 Actions: 40186728 State: 331119591 Actions: 40186728 State: 352918596 Actions: 958128308 State: 425125379 Actions: 2102596696 State: 621287631 Actions: 958128308 State: 1013464112 Actions: 958128308 State: 1399999131 Actions: 958128308 State: 1460801621 Actions: 958128308 State: 1813905182 Actions: 1701229151 State: 1819583497 Actions: 1929289568 State: 1929289568 Actions: 1631773831 1318495056 636260149 1806203235 518217854 1975065960 562877947 1475774383 1744274212 1959108805 ``` ---- ### pruning & collision experiment version= 11512 iteration think_time= 1 second depth= 13 * without pruing ![without_pruning](https://hackmd.io/_uploads/SJJnQQQtT.png) ---- ### pruning & collision experiment version= 11512 iteration think_time= 1 second depth= 13 ``` if(visited.find(hash_st)!= visited.end()) continue; if(console.getMove()== "end") continue; ``` * with pruing ![pruning](https://hackmd.io/_uploads/SkRSDHzYT.png) --- ### version evaluation from a lost game ![what_a_pity](https://hackmd.io/_uploads/HycUgYVYa.png =360x360) * correct pieces percentage from nn policies: 34% (rounded to nearest whole number) ---- * 196000_iter 57% ![196000_iter](https://hackmd.io/_uploads/S17sKOEFp.png =600x600) ---- * 332800_iter 51% ![332800_iter](https://hackmd.io/_uploads/SJBoYO4ta.png =600x600) ---- * 504800_iter 48% ![504800_iter](https://hackmd.io/_uploads/HyT4h_4tT.png =600x600) ---- * 604800_iter 65% ![604800_iter](https://hackmd.io/_uploads/S13stu4YT.png =600x600) ---- * 654800_iter 71% ![654800_iter](https://hackmd.io/_uploads/HJx3GK4Ya.png =600x600) ---- * 684800_iter 70% ![684800_iter](https://hackmd.io/_uploads/r1P2EFEta.png =600x600) ---- * 704800_iter 78% ![704800_iter](https://hackmd.io/_uploads/rJLrhdVYa.png =600x600) ---- * 744800_iter 79% ![744800_iter](https://hackmd.io/_uploads/BkZe5_4Fa.png =600x600) ---- * 754800_iter 79% ![754800_iter](https://hackmd.io/_uploads/B1LsCdNtp.png =600x600) ---- * 764800_iter 66% ![764800_iter](https://hackmd.io/_uploads/r1suJFEtT.png =600x600) ---- * 784800_iter 70% ![784800_iter](https://hackmd.io/_uploads/r1xnadEFT.png =600x600) ---- * 844800_iter 57% ![844800_iter](https://hackmd.io/_uploads/ryzNq_VFa.png =600x600) ---- * 878700_iter 46% ![878700_iter](https://hackmd.io/_uploads/HJHV5_4ta.png =600x600) ---- * 972200_iter 33% ![972200_iter](https://hackmd.io/_uploads/SydEcu4FT.png =600x600) ---- * 1151200_iter 42% ![1151200_iter](https://hackmd.io/_uploads/SJ2V5OVYp.png =600x600) ---- * 1163500_iter 43% ![1163500_iter](https://hackmd.io/_uploads/r1bNS0rKT.png =600x600) ---- ### bracket ![IMG_0199](https://hackmd.io/_uploads/r1-fLABtp.png) ---- #### bracket result ![IMG_0212](https://hackmd.io/_uploads/B1J_jIchp.png =600x600) ---- ### config for the bracket ``` # Program program_seed=0 program_auto_seed=false program_quiet=false # Actor actor_num_threads=4 actor_num_parallel_games=32 actor_num_simulation=1000000 actor_mcts_puct_base=19652 actor_mcts_puct_init=1.25 actor_mcts_reward_discount=1 actor_mcts_value_rescale=false actor_mcts_think_batch_size=16 actor_mcts_think_time_limit=1 # MCTS time limit (in seconds), 0 represents searching without using the time limit actor_select_action_by_count=true actor_select_action_by_softmax_count=false actor_select_action_softmax_temperature=1 actor_select_action_softmax_temperature_decay=false # decay temperature based on zero_end_iteration; use 1, 0.5, and 0.25 for 0%-50%, 50%-75%, and 75%-100% of total iterations, respectively actor_use_random_rotation_features=true # randomly rotate input features, currently only supports alphazero mode actor_use_dirichlet_noise=false actor_dirichlet_noise_alpha=0.03 # 1 / sqrt(num of actions) actor_dirichlet_noise_epsilon=0.25 actor_use_gumbel= false actor_use_gumbel_noise=false actor_gumbel_sample_size=16 actor_gumbel_sigma_visit_c=50 actor_gumbel_sigma_scale_c=1 actor_resign_threshold=-0.9 # Zero zero_server_port=1229 zero_training_directory= zero_num_games_per_iteration=2000 zero_start_iteration=0 zero_end_iteration=100 zero_replay_buffer=20 zero_disable_resign_ratio=0.1 zero_actor_intermediate_sequence_length=0 # board games: 0; atari: 200 zero_actor_ignored_command=reset_actors # format: command1 command2 ... zero_actor_stop_after_enough_games=false zero_server_accept_different_model_games=true # Learner learner_use_per=false # Prioritized Experience Replay learner_per_alpha=1 # Prioritized Experience Replay learner_per_init_beta=1 # Prioritized Experience Replay learner_per_beta_anneal=true # linearly anneal PER init beta to 1 based on zero_end_iteration learner_training_step=100 learner_training_display_step=100 learner_batch_size=1024 learner_muzero_unrolling_step=5 learner_n_step_return=0 # board games: 0, atari: 10 learner_learning_rate=0.02 learner_momentum=0.9 learner_weight_decay=0.0001 learner_value_loss_scale=1 learner_num_thread=8 # Network nn_file_name= nn_num_input_channels=4 nn_input_channel_height=15 nn_input_channel_width=15 nn_num_hidden_channels=128 nn_hidden_channel_height=15 nn_hidden_channel_width=15 nn_num_action_feature_channels=1 nn_num_blocks=8 nn_action_size=225 nn_num_value_hidden_channels=128 nn_discrete_value_size=1 # set to 1 for the games which doesn't use discrete value nn_type_name=alphazero # alphazero/muzero # Environment env_board_size=15 env_gomoku_rule=outer_open # normal/outer_open ``` ---- #### elo_rating every 50 iterations * 1 sec think time * mostly 250 games ![elo](https://hackmd.io/_uploads/Hk4FiABFp.png =420x420) ---- #### elo_rating every 500 iterations * from 3898 iterations to 11398 iterations * 1 sec think time * 250 games ![elo](https://hackmd.io/_uploads/SkKGZxPF6.png =420x420) ---- #### elo_rating every 1000 iterations * from 3898 iterations to 10898 iterations * 1 sec think time * 250 games ![elo](https://hackmd.io/_uploads/SJvxDFFta.png =420x420) ---- ## possible reasons * larger think batch size * more simulation count * lower learning rate --- ## version and config checking with clap_OOG 400 simulation ---- ### config ``` # Program program_seed=0 program_auto_seed=false program_quiet=false # Actor actor_num_threads=4 actor_num_parallel_games=32 actor_num_simulation=400 actor_mcts_puct_base=19652 actor_mcts_puct_init=1.25 actor_mcts_reward_discount=1 actor_mcts_value_rescale=false actor_mcts_think_batch_size=16 actor_mcts_think_time_limit=1 # MCTS time limit (in seconds), 0 represents searching without using the time limit actor_select_action_by_count=true actor_select_action_by_softmax_count=false actor_select_action_softmax_temperature=1 actor_select_action_softmax_temperature_decay=false # decay temperature based on zero_end_iteration; use 1, 0.5, and 0.25 for 0%-50%, 50%-75%, and 75%-100% of total iterations, respectively actor_use_random_rotation_features=true # randomly rotate input features, currently only supports alphazero mode actor_use_dirichlet_noise=false actor_dirichlet_noise_alpha=0.03 # 1 / sqrt(num of actions) actor_dirichlet_noise_epsilon=0.25 actor_use_gumbel= false actor_use_gumbel_noise=false actor_gumbel_sample_size=16 actor_gumbel_sigma_visit_c=50 actor_gumbel_sigma_scale_c=1 actor_resign_threshold=-0.9 # Zero zero_server_port=1229 zero_training_directory= zero_num_games_per_iteration=2000 zero_start_iteration=0 zero_end_iteration=100 zero_replay_buffer=20 zero_disable_resign_ratio=0.1 zero_actor_intermediate_sequence_length=0 # board games: 0; atari: 200 zero_actor_ignored_command=reset_actors # format: command1 command2 ... zero_actor_stop_after_enough_games=false zero_server_accept_different_model_games=true # Learner learner_use_per=false # Prioritized Experience Replay learner_per_alpha=1 # Prioritized Experience Replay learner_per_init_beta=1 # Prioritized Experience Replay learner_per_beta_anneal=true # linearly anneal PER init beta to 1 based on zero_end_iteration learner_training_step=100 learner_training_display_step=100 learner_batch_size=1024 learner_muzero_unrolling_step=5 learner_n_step_return=0 # board games: 0, atari: 10 learner_learning_rate=0.02 learner_momentum=0.9 learner_weight_decay=0.0001 learner_value_loss_scale=1 learner_num_thread=8 # Network nn_file_name= nn_num_input_channels=4 nn_input_channel_height=15 nn_input_channel_width=15 nn_num_hidden_channels=128 nn_hidden_channel_height=15 nn_hidden_channel_width=15 nn_num_action_feature_channels=1 nn_num_blocks=8 nn_action_size=225 nn_num_value_hidden_channels=128 nn_discrete_value_size=1 # set to 1 for the games which doesn't use discrete value nn_type_name=alphazero # alphazero/muzero # Environment env_board_size=15 env_gomoku_rule=outer_open # normal/outer_open ``` ---- ### 7748_iter with 400 sim and different think batch size * 16 think batch size: 20% win rate * 8 think batch size: 30% win rate * 4 think batch size: 34% win rate * 2 think batch size: 33% win rate * 1 think batch size: 39% win rate ---- ### best version vs clap ``` 1788-vs-12345 Total games: 217 winner: 12345 Wins: 102 Losses: 76 Win rate: 47.004608294930875% Win rate without ties: 57.30337078651686% ``` * win_rate: 42% --- ## new model ---- ### reason * wrong winning rules * only five in a row counts as wining * six or more don't count ---- ### improvement * baseline_softmax vs baseline_count ``` 7800-vs-7801 Total games: 250 winner: 7801 Wins: 179 Losses: 71 Win rate: 71.6% Win rate without ties: 71.6% ``` * using count instead of softmax count when sim_count is small ``` actor_select_action_by_count=true actor_select_action_by_softmax_count=false ``` ---- ### training in progress #### new baseline model #### new baseline model with tss #### new baseline model with new NN ---- ### baseline vs old * new baseline weight_iter_237400.pt * old model weight_iter_231200.pt * new baseline win rate: 84% ``` 10004-vs-10005 Total games: 250 winner: 10004 Wins: 202 Losses: 39 Win rate: 80.80000000000001% Win rate without ties: 83.81742738589212% ``` ---- ### baseline vs baseline+ NN * baseline weight_iter_186600.pt * baseline+ NN weight_iter_186600.pt * baseline win rate: 75% ``` 10006-vs-10007 Total games: 250 winner: 10007 Wins: 172 Losses: 58 Win rate: 68.8% Win rate without ties: 74.78260869565217% ``` ---- ### baseline vs baseline+TSS * baseline weight_iter_400000.pt * baseline+ TSS weight_iter_400000.pt * baseline+TSS win rate: 53% ``` 8377-vs-8378 Total games: 250 winner: 8377 Wins: 109 Losses: 98 Win rate: 43.6% Win rate without ties: 52.65700483091788% ``` --- ## TCGA 2024 ---- ### Minizero_OOG * 15bx 1500 iteration * parallel-search in competing * TSS and board history features * No KataGo or TSS ---- ### rules * OOG-freestyle * 45 mins for one player * if exceeding time limit and one of the agent believing who wins, the game continues ---- ### VS clap_OOG * black 30 sec * white 50 sec ### vs peace_OOg * black 35 sec * white 50 sec ---- ### result 2:0 clap_OOG 1:1 peace_OOG * rank: 1/3 * title: gold medal --- ## slide link: https://hackmd.io/@diegowu/B1xwKd7q2