###### tags: `reward design` # reward design # old reward ## policy ### preprocess $d = \text{goal_position} - \text{endpoint_position}$ $d *= \text{self.edge}$ ### init $r = -||d||$ ### bonus ``` if distance < 0.03: r += 0.5 elif distance < 0.015: r += 1.5 ``` # new reward (11/9) ## result 200 episodes https://drive.google.com/file/d/11Kw9nDR1tqqpp9lskFintnki8Xv21eqY/view?usp=sharing ~1000 episodes https://drive.google.com/file/d/1s-AIROvN7PSVSKNcjpugIkRSCpiHBcEs/view?usp=sharing ### distance  ### orientation  ### total     ## policy ### preprocess $d = \text{goal_position} - \text{endpoint_position}$ $d *= \text{self.edge}$ ### init $r = ||\text{prev_d}|| - ||d||$ ### bonus ```python= if distance < NEAR_DISTANCE : on_goal++ if left_bonus_count > 0 : r += BONUS_REWARD * (MAX_STEP - 2 * finished_step) / MAX_STEP left_bonus_count -= 1 if on_goal >= ON_GOAL_FINISH_COUNT : done else on_goal = 0 ``` ### orientation r += 1000 (R[4] - 0.001) ### postprocess $\text{prev_d} = d$ $\text{finished_step} += 1$ ## training parameter > max_steps_per_episode=100 (**ddpg.gin**) ```python= # constant NEAR_DISTANCE = 0.05 BONUS_REWARD = 10 ON_GOAL_FINISH_COUNT = 5 MAX_STEP = 100 # init finished_step = 0 prev_d = self.goal_np_p - np.array(self.ENDP.getPosition()) left_bonus_count = ON_GOAL_FINISH_COUNT ``` ## Note - 目前還有想到根據已經進行的 step 數削減獲得的 reward - 如果新 reward 效果不好可以試試看將「與目標距離」和「相較前一步接近的距離」一起用來初始化 $r$ - 設置 `left_bonus_count` 的原因是防止出現到達 target 後在 `done` 前離開又到達又離開不斷刷 `BONUS_REWARD` # PPO Agent reward設計 ### 原版 Panda: 「我怕被懲罰乾脆不要動好惹。」 可以看到reward絕對值控制在到約$10^{-9}$   ### 第二波改版 <code>newObservation[0:7]</code>是robotController回傳自身的馬達位置給supervisorController。 ```python= if newObservation[0]-(-2.897)<0.05 or 2.897-newObservation[0]<0.05 or\ newObservation[1]-(-1.763)<0.05 or 1.763-newObservation[1]<0.05 or\ newObservation[2]-(-2.8973)<0.05 or 2.8973-newObservation[2]<0.05 or\ newObservation[3]-(-3.072)<0.05 or -0.0698-newObservation[3]<0.05 or\ newObservation[4]-(-2.8973)<0.05 or 2.8973-newObservation[4]<0.05 or\ newObservation[5]-(-0.0175)<0.05 or 3.7525-newObservation[5]<0.05 or\ newObservation[6]-(-2.897)<0.05 or 2.897-newObservation[6]<0.05: reward = -1 # if over the limit, reward = -1 else: if(newObservation[-1]<0.01): reward = 10 elif(newObservation[-1]<0.05): reward = 5 elif(newObservation[-1]<0.1): reward = 1 else: reward = -(newObservation[-1]-supervisor.preL2norm) ``` 停止時機設在與距離近於0.01時候,目前手臂一開始會靠近目標,接著有在球體附近圍繞,但是最後開始遠離,可能為不直接碰到目標避免遊戲結束。   ### **第三波改版** 更改獎賞機制(參考上述bonus部分) ```python= if newObservation[0]-(-2.897)<0.05 or 2.897-newObservation[0]<0.05 or\ newObservation[1]-(-1.763)<0.05 or 1.763-newObservation[1]<0.05 or\ newObservation[2]-(-2.8973)<0.05 or 2.8973-newObservation[2]<0.05 or\ newObservation[3]-(-3.072)<0.05 or -0.0698-newObservation[3]<0.05 or\ newObservation[4]-(-2.8973)<0.05 or 2.8973-newObservation[4]<0.05 or\ newObservation[5]-(-0.0175)<0.05 or 3.7525-newObservation[5]<0.05 or\ newObservation[6]-(-2.897)<0.05 or 2.897-newObservation[6]<0.05: reward = -1 # if over the limit, reward = -1 else: if(newObservation[-1]<0.01): reward = 10 elif(newObservation[-1]<0.05): reward = 5 elif(newObservation[-1]<0.1): reward = 1 else: reward = -(newObservation[-1]-supervisor.preL2norm) if reward>0: reward = reward*(supervisor.stepsPerEpisode - step)/supervisor.stepsPerEpisode ```   ### **第四波改版** 獎賞以扣分為主 ```python= if newObservation[0]-(-2.897)<0.05 or 2.897-newObservation[0]<0.05 or\ newObservation[1]-(-1.763)<0.05 or 1.763-newObservation[1]<0.05 or\ newObservation[2]-(-2.8973)<0.05 or 2.8973-newObservation[2]<0.05 or\ newObservation[3]-(-3.072)<0.05 or -0.0698-newObservation[3]<0.05 or\ newObservation[4]-(-2.8973)<0.05 or 2.8973-newObservation[4]<0.05 or\ newObservation[5]-(-0.0175)<0.05 or 3.7525-newObservation[5]<0.05 or\ newObservation[6]-(-2.897)<0.05 or 2.897-newObservation[6]<0.05: reward = -1 # if over the limit, reward = -1 else: if(newObservation[-1]<0.01): reward = 10 elif(newObservation[-1]<0.05): reward = 5 elif(newObservation[-1]<0.1): reward = 1 else: reward = (newObservation[-1]-supervisor.preL2norm) ```   ### 第五波改版放置更長的時間 第二波改版放置更長時間 有逐步收斂的趨勢 運行世界時間:47 hr    
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up