Bridging the Gap Between Value and Policy

# Bridging the Gap Between Value and Policy ## 1. 要解決什麼問題? Learning分成on-policy和off-policy兩種，但兩種都有各自的缺點。on-policy會有sample inefficiency問題；而off-policy會有not stably interact with function approximation問題，這會導致bias和unstable behavior。而這篇主要的目標之一就是把兩者結合，消除兩者的缺點 [(1)Introduction]。這也就是為甚麼要在每個iteration都進行一次on-policy和一次off-policy的更新，就是為了讓兩者達到互相制衡的效果。[(5)PCL] ![](https://i.imgur.com/1Bdh2Yp.png) ## 2. 到底什麼是 "Path Consistency" 要回答這個問題首先要先提到這篇和soft q-learning、和A2C的差距。文中說到最主要的差距在"multi-step path-wise consistencies"[(6)Related Work]。這可以拆成兩個部分解釋，其一是multi-step。我認為multi-step的用意在一次可以看到很多個state，在一條路徑上有可能前面表現得很好後面表現得很差，我就不會只把表現好的部分看進去；其二是"path-wise consistency"，文中提到當當前state s和next state s'符合所謂"temporal consistency property"時: ![](https://i.imgur.com/Leoa7LP.png) 我們可以「嚴格保證」pi=pi* 且 V=V* 。也就是這時的policy和value都是optimal的: ![](https://i.imgur.com/UcjiQmo.png) 而且，這個"temporal consistency property"可以推廣到multi-step，成為所謂"extended temporal consistency property": ![](https://i.imgur.com/CZ43OyM.png) [(4)Consistency Between Optimal Value & Policy] 那為什麼不用multi-step Q-learning呢?因為multi-step Q-learning使用的是"Bellman hard-max backup"，也就是先跑幾個state，再計算Q值的temporal error進行更新。但這是不合理的，因為我把這個trajectory放進replay buffer後，我更新了Q network，那麼我下次把他從replay buffer裡拿出來的時候，那個reward會是舊的reward，我用新的Q去跑應該會得到更好的reward。這就是文中提到的"the rewards received after a non-optimal action do not relate to the hard-max Q-values Q。."[(5.2)Connections to Actor-Critic and Q-learning]。然而，根據上面的Theorem 1，用Path Consistency Learning不會遇到這樣的問題，在entropy項的控制下，我們能夠確保reward和最新的network之間的關係，不論是on-policy還是off-policy皆成立。而對於每一條Path來說，都能確保式(11)和(13)的成立，因此稱作"Path Consistency"。另外，為了達成此目的，需要透過"Bridging the Gap Between Value and Policy"。也就是在一條式子中寫出value和policy之間的關係: ![](https://i.imgur.com/Kpp3lMV.png) [(4)Consistency Between Optimal Value & Policy] 這也就呼應了本文的標題: "Bridging the Gap Between Value and Policy Based Reinforcement Learning"。而A2C和Soft Q-learning雖然也能確保同一件事情，但和這篇比較他們就缺少了multi-step和可以同時on and off policy這兩個特點，使訓練效果不如本篇。