reward design - HackMD

###### tags: `reward design` # reward design # old reward ## policy ### preprocess $d = \text{goal_position} - \text{endpoint_position}$ $d *= \text{self.edge}$ ### init $r = -||d||$ ### bonus ``` if distance < 0.03: r += 0.5 elif distance < 0.015: r += 1.5 ``` # new reward (11/9) ## result 200 episodes https://drive.google.com/file/d/11Kw9nDR1tqqpp9lskFintnki8Xv21eqY/view?usp=sharing ~1000 episodes https://drive.google.com/file/d/1s-AIROvN7PSVSKNcjpugIkRSCpiHBcEs/view?usp=sharing ### distance ![distance](https://i.imgur.com/DWXJqG3.png) ### orientation ![orientation](https://i.imgur.com/yktDU7o.png) ### total ![total](https://i.imgur.com/KeVanAl.png) ![](https://i.imgur.com/fWE6D4V.jpg) ![](https://i.imgur.com/hhdIy3o.jpg) ![](https://i.imgur.com/qbXyzHn.jpg) ## policy ### preprocess $d = \text{goal_position} - \text{endpoint_position}$ $d *= \text{self.edge}$ ### init $r = ||\text{prev_d}|| - ||d||$ ### bonus ```python= if distance < NEAR_DISTANCE : on_goal++ if left_bonus_count > 0 : r += BONUS_REWARD * (MAX_STEP - 2 * finished_step) / MAX_STEP left_bonus_count -= 1 if on_goal >= ON_GOAL_FINISH_COUNT : done else on_goal = 0 ``` ### orientation r += 1000 (R[4] - 0.001) ### postprocess $\text{prev_d} = d$ $\text{finished_step} += 1$ ## training parameter > max_steps_per_episode=100 (**ddpg.gin**) ```python= # constant NEAR_DISTANCE = 0.05 BONUS_REWARD = 10 ON_GOAL_FINISH_COUNT = 5 MAX_STEP = 100 # init finished_step = 0 prev_d = self.goal_np_p - np.array(self.ENDP.getPosition()) left_bonus_count = ON_GOAL_FINISH_COUNT ``` ## Note - 目前還有想到根據已經進行的 step 數削減獲得的 reward - 如果新 reward 效果不好可以試試看將「與目標距離」和「相較前一步接近的距離」一起用來初始化 $r$ - 設置 `left_bonus_count` 的原因是防止出現到達 target 後在 `done` 前離開又到達又離開不斷刷 `BONUS_REWARD` # PPO Agent reward設計 ### 原版 Panda: 「我怕被懲罰乾脆不要動好惹。」可以看到reward絕對值控制在到約$10^{-9}$ ![](https://i.imgur.com/wFzkk0b.gif) ![](https://i.imgur.com/uT0NcNe.png) ### 第二波改版 <code>newObservation[0:7]</code>是robotController回傳自身的馬達位置給supervisorController。 ```python= if newObservation[0]-(-2.897)<0.05 or 2.897-newObservation[0]<0.05 or\ newObservation[1]-(-1.763)<0.05 or 1.763-newObservation[1]<0.05 or\ newObservation[2]-(-2.8973)<0.05 or 2.8973-newObservation[2]<0.05 or\ newObservation[3]-(-3.072)<0.05 or -0.0698-newObservation[3]<0.05 or\ newObservation[4]-(-2.8973)<0.05 or 2.8973-newObservation[4]<0.05 or\ newObservation[5]-(-0.0175)<0.05 or 3.7525-newObservation[5]<0.05 or\ newObservation[6]-(-2.897)<0.05 or 2.897-newObservation[6]<0.05: reward = -1 # if over the limit, reward = -1 else: if(newObservation[-1]<0.01): reward = 10 elif(newObservation[-1]<0.05): reward = 5 elif(newObservation[-1]<0.1): reward = 1 else: reward = -(newObservation[-1]-supervisor.preL2norm) ``` 停止時機設在與距離近於0.01時候，目前手臂一開始會靠近目標，接著有在球體附近圍繞，但是最後開始遠離，可能為不直接碰到目標避免遊戲結束。 ![](https://i.imgur.com/bcQhv6G.gif) ![](https://i.imgur.com/YrCRj2E.png) ### **第三波改版** 更改獎賞機制(參考上述bonus部分) ```python= if newObservation[0]-(-2.897)<0.05 or 2.897-newObservation[0]<0.05 or\ newObservation[1]-(-1.763)<0.05 or 1.763-newObservation[1]<0.05 or\ newObservation[2]-(-2.8973)<0.05 or 2.8973-newObservation[2]<0.05 or\ newObservation[3]-(-3.072)<0.05 or -0.0698-newObservation[3]<0.05 or\ newObservation[4]-(-2.8973)<0.05 or 2.8973-newObservation[4]<0.05 or\ newObservation[5]-(-0.0175)<0.05 or 3.7525-newObservation[5]<0.05 or\ newObservation[6]-(-2.897)<0.05 or 2.897-newObservation[6]<0.05: reward = -1 # if over the limit, reward = -1 else: if(newObservation[-1]<0.01): reward = 10 elif(newObservation[-1]<0.05): reward = 5 elif(newObservation[-1]<0.1): reward = 1 else: reward = -(newObservation[-1]-supervisor.preL2norm) if reward>0: reward = reward*(supervisor.stepsPerEpisode - step)/supervisor.stepsPerEpisode ``` ![](https://i.imgur.com/Bfa0Q0A.gif) ![](https://i.imgur.com/cVlx39m.png) ### **第四波改版** 獎賞以扣分為主 ```python= if newObservation[0]-(-2.897)<0.05 or 2.897-newObservation[0]<0.05 or\ newObservation[1]-(-1.763)<0.05 or 1.763-newObservation[1]<0.05 or\ newObservation[2]-(-2.8973)<0.05 or 2.8973-newObservation[2]<0.05 or\ newObservation[3]-(-3.072)<0.05 or -0.0698-newObservation[3]<0.05 or\ newObservation[4]-(-2.8973)<0.05 or 2.8973-newObservation[4]<0.05 or\ newObservation[5]-(-0.0175)<0.05 or 3.7525-newObservation[5]<0.05 or\ newObservation[6]-(-2.897)<0.05 or 2.897-newObservation[6]<0.05: reward = -1 # if over the limit, reward = -1 else: if(newObservation[-1]<0.01): reward = 10 elif(newObservation[-1]<0.05): reward = 5 elif(newObservation[-1]<0.1): reward = 1 else: reward = (newObservation[-1]-supervisor.preL2norm) ``` ![](https://i.imgur.com/YMwN9ny.gif) ![](https://i.imgur.com/04gnYaA.png) ### 第五波改版放置更長的時間第二波改版放置更長時間有逐步收斂的趨勢運行世界時間：47 hr ![](https://i.imgur.com/QFqwVkC.gif) ![](https://i.imgur.com/4TPIAOn.png) ![](https://i.imgur.com/sdCt7NH.png) ![](https://i.imgur.com/jBLYQ88.png)