Test Run 2024-(Jan.~Feb.)

# Test Run 2024-(Jan.~Feb.) I test some candidate options. Each candidate run plays around 130k ~ 150K games. Each match plays 400 games against the baseline (default setting). ### 1. Use Gumbel approximate Q policy temperature The origin approximate Q is $Q_{approx} = \Sigma(P(b)Q(b))(N(b) != 0)$. We apply the temperature (t=4.0) on the policy. This method was inspired by the shengkelong's test run. I am surprised that it can improve the perfornace. I have no ideal for that. ~~Win-Rate: 59%~~ After fixing some behavior and self-play around 0.5 M games on 9x9, I don't see significantly improvement. I remove the this function. ### 2. Use the optimistic policy The optimistic policy was introduced by KataGo. But KataGo never use the optimistic policy for self-play. In my short test run, it looks good. However, after the longer training, it has some biases. Not good idea. ~~Win-Rate: 65%~~ ### 3. Forbid the random move during the fast search phase Test the feature in the v0.6.1. The orgin playout cap randomization perform fast search without any exploring settings. I add some randomization during the fast search phase since v0.6.1. Looks the random move could improve the strength. Win-Rate: 42.25% ### 4. Optimistic policy with KLD weighted. see [code](https://github.com/CGLemon/Sayuri/commit/e0d3a11cf8ecc247f827c183c2e11bc242991187) This method is inspired by [Policy Surprise Weighting](https://github.com/lightvector/KataGo/blob/master/docs/KataGoMethods.md#policy-surprise-weighting). I test this methods several times. Evey run shows similer improvement. ~~Win-Rate: 81.75%~~ After the longer training on 19x19, I checked Optimistic with KLD weighted is bad than normal Optimistic. ### 5. Large batch size Use larger batch size for training process. We also adjust the learning rate and steps per epoch. Be sure the number of samples and per-batch learning rate are are equal to small batch size case. Looks we do not need the large batch size. Win-Rate: 4.5% ### 6. Territory scoring rule For the next UEC cup, I implement the territory scoring (aka Japanese-like rule). After fixing some bugs, look like the network could understand this rule. Because the next UEC cup is close, I am afraid that I couldn't check optimistic policy things. The next main run will be soon.