# Main Run 2024-3-21 * 2024-3-21 * Start the training. * The version is v0.7.0. * learning rate = 0.005 * batch size = 256 * The network size is 6bx96c. * **Area Scoring** only (T^T). Maybe next run will play **Territory Scoring** both. </br> * 2024-3-25 * Played 475k games. * The loss is unstable. Drop the learning rate to 0.0025 (from 0.005). </br> * 2024-3-27 * Played 590k games. * Accumulate around $1.707 \times 10^{8}$ 20bx256c eval queries. * The strengh is between LZ091 and LZ092. I will update the match games later. * Halt the 6bx96c training. * The loss is still unstable after playing 100k games. Maybe 6bx96c can not understand the 19x19 well? Here is the whole process of policy loss. You could see the loss values at around 450000 steps and at around 680000 steps are oscillation. * ![policy-loss-6bx96c](https://hackmd.io/_uploads/SynfJBZkA.png) </br> * 2024-3-28 * Start the 10bx128c training. * learning rate = 0.005 * batch size = 256 * current replay buffer is 150000 games. </br> * 2024-3-30 * Played 695k games (10bx128c played 105k games). * Accumulate around $3.061 \times 10^{8}$ 20bx256c eval queries. * The strengh is better than Leela Zero with [116](https://zero.sjeng.org/networks/39d465076ed1bdeaf4f85b35c2b569f604daa60076cbee9bbaab359f92a7c1c4.gz) weights. Elo different is +53. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w 116.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 400 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 124 106 230 (57.50%) Leela Zero 0.17 94 76 170 (42.50%) ``` </br> * 2024-4-3 * Played 905k games (10bx128c played 315k games). * Drop the learning rate to 0.0025 (from 0.005). </br> * 2024-4-5 * Played 1040k games (10bx128c played 450k games). * Accumulate around $7.509 \times 10^{8}$ 20bx256c eval queries. * Halt the 10bx128c training. * The strengh is between LZ116 and LZ117. </br> * 2024-4-8 * Start the 15bx192c training. * learning rate = 0.0025 * batch size = 256 * current replay buffer is 200000 games. </br> * 2024-4-9 * Played 1095k games (15bx192c played 55k games). * Accumulate around $9.905 \times 10^{8}$ 20bx256c eval queries. * The strengh is as same as Leela Zero with [127](https://zero.sjeng.org/networks/3f6c8dd85e888bec8b0bcc3006c33954e4be5df8a24660b03fcf3e128fd54338.gz) weights. Elo different is -6. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w 127.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 596 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 143 150 293 (49.16%) Leela Zero 0.17 148 155 303 (50.84%) ``` </br> * 2024-4-12 * Test a new network structure mixer convolution, a transformer-like without attention block. Seem the mixer of policy loss and WDL loss are significantly better than resnet. Each network is 6bx128c and trained on 150k games. * ![ploss](https://hackmd.io/_uploads/SJ8A47Ix0.png) * ![wloss](https://hackmd.io/_uploads/B1qTEmUlA.png) * [shengkelong](https://github.com/shengkelong) said "I did get good result on transformer. Performance is bad and The relation of attention map is mess". One possible solution proposed by him is Global Pooling or [Patch Embedding](https://arxiv.org/abs/2010.11929). It could compress spatial size. But it is high risk. Think about why AlphaGo never uses Global Pooling. The Global Pooling may lose the local information for policy head. * Another solution is to remove attention mechanism, like [MLP-Mixer](https://arxiv.org/abs/2105.01601). My current version use depthwise convolution instead of token-mixing. You may see relation works from [here](https://arxiv.org/abs/2203.06717), and [here](https://arxiv.org/abs/2201.09792). Note we don't use any Patch Embedding. * Relative performance: Residual (Winograd) > Mixer > Residual (Im2col). </br> * 2024-4-19 * Played 1380k games (15bx192c played 285k games). Accumulate around $2.826 \times 10^{9}$ 20bx256c eval queries. Current weights strength should be between LZ130 and LZ135. We lack of GPU now. So only release weights. Do not provide full match result. * Drop the learning rate to 0.002 (from 0.0025) because last 80k don't get significantly improvement. I am not really sure so only change a little. * Test 10b mixer network performance. After playing around 400 games. The mixer winrate is around 54%. I think we get some benefit from new structure. Will add mixer block for next training step. However still have some problems. * Although the value head performance of SWA is good. The value head is unstable in the training process. * The evals per second is slower than conv3x3 (Winograd). </br> * 2024-5-6 * Played 1640k games (15bx192c played 600k games). * Accumulate around $5.9118 \times 10^{9}$ 20bx256c eval queries. * The strengh is as same as Leela Zero with [151](https://zero.sjeng.org/networks/672342b58e62910f461cce138b8186b349539dbe98807bf202ab91a72b19d0c7.gz) weights. Elo different is +28. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w 151.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 686 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 175 196 371 (54.08%) Leela Zero 0.17 147 168 315 (45.92%) ``` * The strengh is as same as Leela Zero with [154](https://zero.sjeng.org/networks/7ff174e6ecc146c30ad1d6fe60a7089eacee65dfe4cce051bf24e076abfc0b68.gz) weights. Elo different is -2. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w 154.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 773 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 174 210 384 (49.68%) Leela Zero 0.17 176 213 389 (50.32%) ``` * Time of Learning rate change. * Drop the learning rate to 0.0016 (from 0.002) when playing 1435k games. (2024-4-25) * Drop the learning rate to 0.00125 (from 0.0016) when playing 1545k games. (2024-5-1) * After the longer training, the performance of pure mixer block is not good than residual block. Although the mixer block have large view, counter-intuitively, it is not good at life and death of dragon. We try some hybrid struture and find that residual block can help this. <div id="sayuri-art" align="center"> <br/> <h3>ownership of mixer-block</h3> <img src="https://hackmd.io/_uploads/rJPVVHLfA.png" alt="mixer-veiw" width="384"/> <h3>ownership of residual-block</h3> <img src="https://hackmd.io/_uploads/SygxSrIGC.png" alt="residual-veiw" width="384"/> <h3>ownership of hybrid-block (residual + mixer)</h3> <img src="https://hackmd.io/_uploads/HkgAHH8GC.png" alt="hybrid-veiw" width="384"/> </div> </br> * 2024-5-10 * Played 1710k games (15bx192c played 670k games). * Remove KLD weighted for optimistic policy. </br> * 2024-5-11 * Played 1720k games (15bx192c played 680k games). * Accumulate around $6.6862 \times 10^{9}$ 20bx256c eval queries. * The strengh is as same as Leela Zero with [157](https://zero.sjeng.org/networks/d351f06e446ba10697bfd2977b4be52c3de148032865eaaf9efc9796aea95a0c.gz) weights. Elo different is +10. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w 157.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 914 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 213 257 470 (51.42%) Leela Zero 0.17 200 244 444 (48.58%) ``` </br> * 2024-5-17 * Played 1840k games (15bx192c played 800k games). * Accumulate around $7.8433 \times 10^{9}$ 20bx256c eval queries. * Now use the SWA weights for the self-play games. * Drop the learning rate to 0.001 (from 0.00125) when playing 1825k games. (2024-5-16) </br> * 2024-5-19 * Update the experimental executable mixer block weight [here](https://drive.google.com/drive/folders/1wH3pdEOHq1DNYuSvQbRK4q_ly3WgjXNa?usp=drive_link). Pleas watch the ```README.txt``` file first. Compare it with baseline. The new net is +70 Elo better than resnet only. </br> * 2024-5-21 * Played 1915k games (15bx192c played 875k games). * Accumulate around $8.557 \times 10^{9}$ 20bx256c eval queries. * The strengh is as same as Leela Zero with [173](https://zero.sjeng.org/networks/33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.gz) weights. Elo different is -6. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w 173.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 1000 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 222 270 492 (49.20%) Leela Zero 0.17 230 278 508 (50.80%) ``` </br> * 2024-5-22 * Played 1935k games (15bx192c played 895k games). * Accumulate around $8.747 \times 10^{9}$ 20bx256c eval queries. * Halt the 15bx192c training. * Some future works: * Based on Kobayashi's reply. The Gumbel based model is hard to learn playing pass under the territory scoring. My current implementation shows same result. I guass the main reason is Sequential Halving? Agent refuse to play pass because win-rate of pass is 0% and think other moves are better win-rate? * Version 4 weights supports other activation functions, like mish or swish. * Improve my GTP match tool. Hiroshi Yamashita suggested me follow the floodgate's design. But this is not what I need. I also try CGOS implementation. However the performance is not better. Maybe BayesElo is my next choose. </br> * 2024-5-30 * Start the 20bx256c training. * learning rate = 0.0005 * batch size = 128 * cPUCT = 0.5 (from 1.25) * current replay buffer is 300000 games. * Please use this [version](https://github.com/CGLemon/Sayuri/releases/tag/dev-2024-6-2) or above for last 20b network. </br> * 2024-6-1 * Played 1975k games (20bx256c played 40k games). * Accumulate around $9.518 \times 10^{9}$ 20bx256c eval queries. * The strengh is as same as Leela Zero with [174](https://zero.sjeng.org/networks/c9d70c413e589d338743bfa83783e23f378dc0b9aa98940a2fbd1d852dab8781.gz) weights. Elo different is +8. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w 174.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 606 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 149 161 310 (51.16%) Leela Zero 0.17 142 154 296 (48.84%) ``` * The strengh is better than Leela Zero with [ELFv0](http://zero.sjeng.org/networks/62b5417b64c46976795d10a6741801f15f857e5029681a42d02c9852097df4b9.gz) weights. Elo different is +60. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w ELFv0.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 400 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 105 129 234 (58.50%) Leela Zero 0.17 71 95 166 (42.50%) ``` </br> * 2024-6-14 * Played 2210k games (20bx256c played 275k games). * Accumulate around $1.4005 \times 10^{10}$ 20bx256c eval queries. * Drop the learning rate to 0.0003 (from 0.0005). * Progression is slow. Look like the Gumbel effect can not help current weights. </br> * 2024-6-17 * Played 2250k games (20bx256c played 315k games). * We double the playouts/visits for self-play (from 400 to 800). * Really weird result. Current weights aready acheived ELFv1. However they are weaker than LZ190. Based on [Computer Go Rating](https://github.com/breakwa11/GoAIRatings), ELFv1 should be stronger than LZ190. </br> * 2024-6-20 * Played 2285k games (20bx256c played 350k games). * Accumulate around $1.5722 \times 10^{10}$ 20bx256c eval queries. * The strengh is as same as Leela Zero with [190](https://zero.sjeng.org/networks/ef09cd530927e16599add3b4fc3215a37dce265296ccbb1f377669b3c469e60b.gz) weights. Elo different is -10. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w 190.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 581 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 124 158 282 (48.54%) Leela Zero 0.17 132 167 299 (51.46%) ``` * The strengh is as same as Leela Zero with [ELFv1](http://zero.sjeng.org/networks/d13c40993740cb77d85c838b82c08cc9c3f0fbc7d8c3761366e5d59e8f371cbd.gz) weights. Elo different is +16. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w ELFv1.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 446 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 95 138 233 (52.24%) Leela Zero 0.17 85 128 213 (47.76%) ``` ![sayuri-elo](https://hackmd.io/_uploads/By6vMLXLA.png) </br> * 2024-6-27 * Played 2385k games (20bx256c played 450k games). * Accumulate around $1.8455 \times 10^{10}$ 20bx256c eval queries. * The strengh is as same as Leela Zero with [ELFv2](http://zero.sjeng.org/networks/05dbca157002b9fd618145d22803beae0f7e4015af48ac50f246b9316e315544.gz) weights. Elo different is -2. * Sayuri: ```-w current_weights -t 1 -b 1 -p 400 --lcb-reduction 0 --score-utility-factor 0.1 --cpuct-init 0.5 --use-optimistic-policy --random-moves-factor 0.1 --random-moves-temp 0.8``` * Leela Zero: ```-w ELFv2.gz --noponder -v 400 -g -t 1 -b 1 --timemanage off --randomcnt 30 --randomtemp 0.8``` * Game result (played 876 games with Leela Zero): ``` Name black won white won total (win-rate) Sayrui v0.7.0 194 242 436 (49.77%) Leela Zero 0.17 196 244 440 (50.23%) ``` * Compare Sayuri with ELF OpenGo, our engine may reduce around 250 times computation, surpassing KataGo g104's 50 times. * I think I will keep this run because I am interesting in Gumbel issue on later 20b network. However, could I afford the computation? Actually, I don't get **any** resource and help from my professor (```our Lab's budget becomes professor's personal bonus :-)``` ). </br> * 2024-7-8 * Played 2525k games (20bx256c played 590k games). * Accumulate around $2.2215 \times 10^{10}$ 20bx256c eval queries. * Fix the target policy issue. But seem we doesn't get the benefit on the current games. Will explain it later. * In order to maximize the strengh before UEC cup, dropping the learning rate again. * learning rate = 0.00015 (from 0.0003) * batch size = 128 * current replay buffer is 325000 games. </br> * Summary * Halt this run after UCE16 because we are busy for next run features. Include slight improvement of network, fixing Japanese-like rule, fixing target policy and policy surprising sampling. * We mention disadvantage completed-Q target policy (part of the Gumbel AlphaZero). Let we look at target policy formula, the target policy should be $P_{\text{target}}(a)=\text{Softmax}(P_{\text{logit}}(a) + \sigma(Q(a)))$. Where the $\sigma(...)$ could be **any** monotonic function. In practice, the return value of function is proportional to number of total visits. It means we may too believe Q in high visits condition. For example, the target policy will be one-hot if number of visits is over 800. It will make the output policy too sharp. What's more, based on Yuki Kobayashi's result, we are hard to get benefit when number of visits is over 200. Our solution is switching to AlphaZero. * The mixer block is not a noval network structure for the game of the Go. For example, [Maru](https://drive.google.com/file/d/1H_yoY-dxkF0-PrGyeb0f2NozW6R3ri5h/view) in the UEC16 also used the same ideal to improve the performance. But based on my experience, these kind of transformer-like modules and the modules inspired by transformer do not work well on the game of the Go. The reasons are (1) the speed (evaluation times per second) is terrible. (2) these module may lost informations. In order to overcome these disvantages, (1) we remove conv 1x1 layers as many as possible and use customized cude implementation instead of cuDNN. (2) Only add the mixer block at tail of the tower.