Deepbots攻略手冊

# Deepbots攻略手冊 ## Deepbots官方介紹 https://github.com/aidudezzz/deepbots Deepbots is a simple framework which is used as "middleware" between the free and open-source Cyberbotics' Webots robot simulator and Reinforcement Learning algorithms. When it comes to Reinforcement Learning the OpenAI gym environment has been established as the most used interface between the actual application and the RL algorithm. Deepbots is a framework which follows the OpenAI gym environment interface logic in order to be used by Webots applications. ### deepbots套件解釋 ![](https://i.imgur.com/snnG5Oo.png) ## supervisor.py 從agent得到選擇的action ``` python==129 selectedAction, actionProb = agent.work(observation, type_="selectAction") ``` 環境變化後的觀察，四個回傳值對應到<code>get_observation</code>,<code>get_reward</code>,<code>is_done</code>和<code>get_info</code> ``` python==129 newObservation, reward, done, info = supervisor.step([selectedAction]) ``` :::info :pushpin: 因為我想依照觀察後的newObservation決定reward，因此在上面這行code之後又打下以下code以計算reward。 ``` python==141 # compute reward here ## do not get too close to the limit value # [-2.897, 2.897], [-1.763, 1.763], [-2.8973, 2.8973], [-3.072, -0.0698] # [-2.8973, 2.8973], [-0.0175, 3.7525], [-2.897, 2.897] if newObservation[0]-(-2.897)<0.05 or 2.897-newObservation[0]<0.05 or\ newObservation[1]-(-1.763)<0.05 or 1.763-newObservation[1]<0.05 or\ newObservation[2]-(-2.8973)<0.05 or 2.8973-newObservation[2]<0.05 or\ newObservation[3]-(-3.072)<0.05 or -0.0697976-newObservation[3]<0.05 or\ newObservation[4]-(-2.8973)<0.05 or 2.8973-newObservation[4]<0.05 or\ newObservation[5]-(-0.0175)<0.05 or 3.7525-newObservation[5]<0.05 or\ newObservation[6]-(-2.897)<0.05 or 2.897-newObservation[6]<0.05: reward = -1 # if on of the motors on the limit, reward = -2 else: if(newObservation[-1]<0.01): reward = 10 #*((supervisor.stepsPerEpisode - step)/supervisor.stepsPerEpisode) elif(newObservation[-1]<0.05): reward = 5 #*((supervisor.stepsPerEpisode - step)/supervisor.stepsPerEpisode) elif(newObservation[-1]<0.1): reward = 1 #*((supervisor.stepsPerEpisode - step)/supervisor.stepsPerEpisode) else: reward = -(newObservation[-1]-supervisor.preL2norm) # positive reward supervisor.preL2norm = newObservation[-1] # print("Beaker: ",supervisor.beaker.getOrientation(),"=>",supervisor.beaker.getOrientation()[4]) reward = reward + (supervisor.beaker.getOrientation()[4] - 1.0) # We want it to remain horizontal. ``` ::: 接著我們要將對環境的觀察存到我們的腦袋裡面，等等才能讓agent訓練。 ```python==172 trans = Transition(observation, selectedAction, actionProb, reward, newObservation) agent.storeTransition(trans) ``` 最後我們更新神經網路，<code>done</code>對應到上面<code>supervisor.step([selectedAction])</code> ```python==176 if done: if(step==0): print("0 Step but done?") continue print("done gogo") # Save the episode's score supervisor.episodeScoreList.append(supervisor.episodeScore) agent.trainStep(batchSize=step) solved = supervisor.solved() # Check whether the task is solved agent.save('') break ``` 最後的fp...是我寫來存訓練情形的，也就是之前meeting常看到我秀出來的趨勢圖，用於覺得訓練母湯時候可以觀察一下，以免多花冤妄的時間。想要畫圖可以在與supervisor.py相同路徑下打上<code>python3 checkConvergence.py</code> ```ptyhon==191 fp = open("Episode-score.txt","a") fp.write(str(supervisor.episodeScore)+'\n') fp.close() ``` 其餘是當你定義了要在超過一定的訓練次數後，決定是否任務成功。<code>solved</code>對應到<code>solved()</code> ```python==197 if not solved: print("Task is not solved, deploying agent for testing...") elif solved: print("Task is solved, deploying agent for testing...") ``` ## robotController.py 下面這個使用者定義函數用於避免對手臂下了超過馬達範圍的指令，但是在產生這個馬達範圍指令的supervisorController中仍有進行扣分。 ```python==3 def motorToRange(motorPosition, i): if(i==0): motorPosition = np.clip(motorPosition, -2.897, 2.897) elif(i==1): motorPosition = np.clip(motorPosition, -1.763, 1.763) elif(i==2): motorPosition = np.clip(motorPosition, -2.8973, 2.8973) elif(i==3): motorPosition = np.clip(motorPosition, -3.072, -0.07) elif(i==4): motorPosition = np.clip(motorPosition, -2.8973, 2.8973) elif(i==5): motorPosition = np.clip(motorPosition, -0.0175, 3.7525) elif(i==6): motorPosition = np.clip(motorPosition, -2.897, 2.897) else: pass return motorPosition ``` 在class 的建構子中已經定好需要的馬達資訊 ```python==21 class PandaRobot(RobotEmitterReceiverCSV): def __init__(self): super().__init__() self.positionSensorList = [] for i in range(7): positionSensorName = 'positionSensor' + str(i+1) positionSensor = self.robot.getPositionSensor(positionSensorName) positionSensor.enable(self.get_timestep()) self.positionSensorList.append(positionSensor) self.motorList = [] for i in range(7): motorName = 'motor' + str(i + 1) motor = self.robot.getMotor(motorName) # Get the motor handle #positionSensor1 motor.setPosition(float('inf')) # Set starting position motor.setVelocity(0.0) # Zero out starting velocity self.motorList.append(motor) motorName = 'finger motor L' motor = self.robot.getMotor(motorName) # Get the motor handle #positionSensor1 motor.setPosition(0.02) # Set starting position motor.setVelocity(0.2) # Zero out starting velocity self.motorList.append(motor) motorName = 'finger motor R' motor = self.robot.getMotor(motorName) # Get the motor handle #positionSensor1 motor.setPosition(0.02) # Set starting position motor.setVelocity(0.2) # Zero out starting velocity self.motorList.append(motor) ``` class中的create_message用於跟supervisorController.py進行溝通 ```python==47 def create_message(self): # Read the sensor value, convert to string and save it in a list message = [str(self.positionSensorList[0].getValue()), str(self.positionSensorList[1].getValue()),\ str(self.positionSensorList[2].getValue()), str(self.positionSensorList[3].getValue()),\ str(self.positionSensorList[4].getValue()), str(self.positionSensorList[5].getValue()),\ str(self.positionSensorList[6].getValue())] return message ``` class中的use_message_data用於當supervisor回傳資訊(如agent選擇的動作)之後，在這邊進行解碼並控制馬達進行動作。 ```python==55 def use_message_data(self, message): #print("robot get this message: ", message) code = int(message[0]) setVelocityList = [] # decoding action for i in range(7): setVelocityList.append(code%3) code = int(code/3) #print("decode message to action: ", setVelocityList) # # version1 add Velocity # for i in range(7): # action = setVelocityList[i] # Convert the string message into an action integer # if action == 2: # motorSpeed = -1.0 # elif action == 1: # motorSpeed = 1.0 # else: # motorSpeed = 0.0 # self.motorList[i].setVelocity(motorSpeed) # Set the motors' velocities based on the action received # version2 add position for i in range(7): action = setVelocityList[i] if action == 2: motorPosition = self.positionSensorList[i].getValue()-0.05 motorPosition = motorToRange(motorPosition, i) self.motorList[i].setVelocity(2.5) self.motorList[i].setPosition(motorPosition) elif action == 1: motorPosition = self.positionSensorList[i].getValue()+0.05 motorPosition = motorToRange(motorPosition, i) self.motorList[i].setVelocity(2.5) self.motorList[i].setPosition(motorPosition) else: motorPosition = self.positionSensorList[i].getValue() motorPosition = motorToRange(motorPosition, i) self.motorList[i].setVelocity(2.5) self.motorList[i].setPosition(motorPosition) ```