HackMD - Collaborative Markdown Knowledge Base

## Symlog Predictions > 1. Be designed to deal with varying scales of rewards and values. > 2. Absolute and Huber losses fails because they stagnate learning. > 3. Normalizing targets based on running statistics introduces nan-stationarity into the optimization?(Need to check) > 4. bi-symmetric logarithmic function -> > $symlog(x)=sign(x)ln(|x|+1)$ whose inverse is $symexp(x)=sign(x)(exp(|x|)-1)$ > 5. The effectiveness of it may come from that it focues more on the small parts of the values, pay less attention to the large parts of the values. It looks like it would be more applicable to the values between Positive and negative. > 6. They claim that using the symlog function there is no need anymore to truncate large rewards and introduce reward normalization and bother for the extreme values (spikes?) detected. The idea is breathtaking, but there are obvious limitations. ## World Model Learning > 1. The recurrent structure of the overall model is common, but still there are something needs further understanding, like the using of x and the continuity of z and h. A flow chart will help me better understand it > > <img src="https://i.imgur.com/KiDbmBv.png" width = "300" height = "200" div align=center /> > > 2. What role is the stop gradient method playing? > Intuitively the loss use it to pay more attention to the $p_\phi(z_t|h_t)$ term. But why? > 3. I think it is not completely clear for me why fixed hyperparameters can deal with varing problem for this method. Maybe this claim mainly comes from its experimental results. ## Actor Critic Learning > 1. $R^{\lambda}_t=r_t+{\gamma}c_t((1-\lambda)v_{\psi}(s_{t+1})+{\lambda}R^{\lambda}_{t+1})$, seems like a simple linear combination？ > 2. discretize the resuting range into a sequnce B o f K == 255 equally spaced buckets? Need further understanding. > 3. The form is similar in distributional RL, but changed to a discrete version. > 4. Twohot method seems interesting and new to me. > 5. $L_{critic}(\psi)=\sum^{T}_{t=1}E_{b_i\sim{y_t}}[-ln{p_{\psi}(b_i|s_t)}]=-\sum^T_{t=1}y^T_tln{p_{\psi}(\cdot|s_t)}$. Does it mean sample a $b_i$ from the distribution of $y_i$ and simply maxmize its log-likelihood? I understand its theory, but I feel it is difficult for me to directly repoduce the code for this. > 6. Except for things mentioned above, the other tricks use by the Actor Learning section seems common.