多智能體 - HackMD

# 多智能體(MARL)分類方法 ![](https://i.imgur.com/Urjxt4q.png) a. Analysis of emergent behaviors : 將所有Agent 放入Environment中，不共享及溝通，各自訓練 b. Learning communication : 學習過程中Agent會彼此傳輸訊息及共享來達到更好結果 c. Learning cooperation : 學習過程中Agent透過學習互相合作達到更好結果 d. Agents modeling agents : 若agent間沒有溝通，但是透過看到其他agent的一些外在結果來去model自己的訊息 **Decentralized system (分散式系統)** ![](https://hackmd.io/_uploads/Hy2LQeuVh.png) 分散的方法中，我們獨立單一Agent，而不考慮其他Agents的存在。在這種情況下，所有Agents都將其他Agents視為環境的一部分。這是一個非平穩的環境條件，因此不能保證收斂。 **Centralized approach (集合式系統)** ![](https://hackmd.io/_uploads/rkIpmg_En.png) 從所有Agents學習單個策略。將環境的當前狀態作為Input，Output Agents合作的策略。獎勵是Global的。 **Self-Play** 在訓練對抗遊戲時非常困難，因為當我們訓練Agent時需要有一個對手協助你，但若對手設定太厲害，agents得到的結果都會是輸，因此訓練出不好的結果 ![](https://i.imgur.com/dMjMGy1.png) 在Self-Play中會幫我們解決此問題-->讓訓練的Agent同時處於玩家和對手，並在Agent升級自己的級別時升級其級別 **輸入** (遊戲狀態)->使用當前遊戲的狀態作為輸入並決定下一步行動 (對手策略)->Agent對於對手的策略也會進行輸入考慮 (策略參數)->Agent的策略參數，會隨著訓練過程不斷修改 **輸出** (行動)->根據遊戲狀態和策略參數，Agent輸出一步行動 (獎勵)->藉由Agent的行動後，環境給予獎勵，Agent再經由獎勵來改寫自己的策略參數 (新的參數)->不斷更新以便下一次訓練使用 **優點** 可以不需要外部數據就能自動生成訓練數據，並能夠訓練出更優略的行動 --- # Cooperative environments 合作環境想法1: 放入2個Agent，並設計一個有開關的門(人必須踩在上面)，門後有金幣，而這2個Agent必須合作才有辦法拿到金幣並通關。想法2: 設計一個箱子而這個箱子一定要2個或以上的Agent才可以推動，用來逃脫或是用來壓按鈕。 --- **Competitive/Adversarial environments 競爭/對抗環境** 有2個隊伍都想要打敗對方獲得勝利，像是單人的羽球和排球就是簡單的例子。底下是製作了單人排球的遊戲 ![](https://i.imgur.com/u90TqTj.png) 規則:球不能落地目標:讓球掉到對方的場地上代理:2個不同的多代理 **環境設置:** 先放置球場、邊界、球網和球再來放置紫方和藍方需要的物件，球員(Agent)和地板物件放置完，就要開始設置環境的變數首先是整個遊戲的環境(遊戲有哪些物件，遊戲規則) ``` public enum Team { Blue = 0, Purple = 1, Default = 2 } public enum Event //設置界線區域 { HitPurpleGoal = 0, //藍方地板(紫方得分) HitBlueGoal = 1, //紫方地板(藍方得分) HitOutOfBounds = 2, //界外 HitIntoBlueArea = 3, //藍方區域 HitIntoPurpleArea = 4 //紫方區域 } ``` ``` void Start() { blueAgentRb = blueAgent.GetComponent<Rigidbody>(); purpleAgentRb = purpleAgent.GetComponent<Rigidbody>(); ballRb = ball.GetComponent<Rigidbody>(); var spawnSideList = new List<int> { -1, 1 }; ballSpawnSide = spawnSideList[Random.Range(0, 2)]; //一開始發球位置 blueGoalRenderer = blueGoal.GetComponent<Renderer>(); purpleGoalRenderer = purpleGoal.GetComponent<Renderer>(); RenderersList.Add(blueGoalRenderer); RenderersList.Add(purpleGoalRenderer); volleyballSettings = FindObjectOfType<VolleyballSettings>(); ResetScene(); } ``` ``` public void ResolveEvent(Event triggerEvent) //是否有碰觸trigger(觸發器) { switch (triggerEvent) { case Event.HitOutOfBounds: if (lastHitter == Team.Blue) {// apply penalty to blue agent } else if (lastHitter == Team.Purple) {// apply penalty to purple agent } blueAgent.EndEpisode(); purpleAgent.EndEpisode(); ResetScene(); break; case Event.HitBlueGoal: // blue wins // turn floor blue StartCoroutine(GoalScoredSwapGroundMaterial(volleyballSettings.blueGoalMaterial, RenderersList, .5f)); blueAgent.EndEpisode(); purpleAgent.EndEpisode(); ResetScene(); break; case Event.HitPurpleGoal: // purple wins // turn floor purple StartCoroutine(GoalScoredSwapGroundMaterial(volleyballSettings.purpleGoalMaterial, RenderersList, .5f)); blueAgent.EndEpisode(); purpleAgent.EndEpisode(); ResetScene(); break; case Event.HitIntoBlueArea: if (lastHitter == Team.Purple) { purpleAgent.AddReward(1); } break; case Event.HitIntoPurpleArea: if (lastHitter == Team.Blue) { blueAgent.AddReward(1); } break; } } ``` episode 則代表了智能體與環境之間的互動過程，包括從環境中觀測到狀態，採取動作，接收環境回饋等過程。 ``` public void ResetScene() { resetTimer = 0; lastHitter = Team.Default; // reset last hitter foreach (var agent in AgentsList) { // randomise starting positions and rotations var randomPosX = Random.Range(-2f, 2f); var randomPosZ = Random.Range(-2f, 2f); var randomPosY = Random.Range(0.5f, 3.75f); // depends on jump height var randomRot = Random.Range(-45f, 45f); agent.transform.localPosition = new Vector3(randomPosX, randomPosY, randomPosZ); agent.transform.eulerAngles = new Vector3(0, randomRot, 0); agent.GetComponent<Rigidbody>().velocity = default(Vector3); } ResetBall(); } void ResetBall() { var randomPosX = Random.Range(-2f, 2f); var randomPosZ = Random.Range(6f, 10f); var randomPosY = Random.Range(6f, 8f); ballSpawnSide = -1 * ballSpawnSide; if (ballSpawnSide == -1) { ball.transform.localPosition = new Vector3(randomPosX, randomPosY, randomPosZ); } else if (ballSpawnSide == 1) { ball.transform.localPosition = new Vector3(randomPosX, randomPosY, -1 * randomPosZ); } ballRb.angularVelocity = Vector3.zero; ballRb.velocity = Vector3.zero; } ``` 排球上的設置 ``` void OnTriggerEnter(Collider other) //觸發區域 { if (other.gameObject.CompareTag("boundary")) //界外 { // ball went out of bounds envController.ResolveEvent(Event.HitOutOfBounds); } else if (other.gameObject.CompareTag("blueBoundary")) //藍方界外 (紫得分) { // ball hit into blue side envController.ResolveEvent(Event.HitIntoBlueArea); } else if (other.gameObject.CompareTag("purpleBoundary")) //紫方界外 (藍得分) { // ball hit into purple side envController.ResolveEvent(Event.HitIntoPurpleArea); } else if (other.gameObject.CompareTag("purpleGoal")) //藍方地板 (紫得分) { // ball hit purple goal (blue side court) envController.ResolveEvent(Event.HitPurpleGoal); } else if (other.gameObject.CompareTag("blueGoal")) //紫方界外 (藍得分) { // ball hit blue goal (purple side court) envController.ResolveEvent(Event.HitBlueGoal); } } ``` 環境設置完後會像是這樣 {%youtube oWDZuzzNhCo %} 有了環境再來就是Agent的部分，而強化學習最重要的就是觀察和獎勵 ``` public override void CollectObservations(VectorSensor sensor) {觀察} public override void OnActionReceived(ActionBuffers actions) {行動與獎勵的給予} public override void Heuristic(in ActionBuffers actionsOut) {鍵盤的操作} ``` ``` public override void CollectObservations(VectorSensor sensor) { // Agent rotation (1 float) sensor.AddObservation(this.transform.rotation.y); // Vector from agent to ball (direction to ball) (3 floats) Vector3 toBall = new Vector3((ballRb.transform.position.x - this.transform.position.x) * agentRot, (ballRb.transform.position.y - this.transform.position.y), (ballRb.transform.position.z - this.transform.position.z) * agentRot); sensor.AddObservation(toBall.normalized); // Distance from the ball (1 float) // 與球的距離，magnitude表示這個向量的長度 sensor.AddObservation(toBall.magnitude); // Agent velocity (3 floats) sensor.AddObservation(agentRb.velocity); // Ball velocity (3 floats) sensor.AddObservation(ballRb.velocity.y); sensor.AddObservation(ballRb.velocity.z * agentRot); sensor.AddObservation(ballRb.velocity.x * agentRot); } ``` ``` public override void OnActionReceived(ActionBuffers actionBuffers) { MoveAgent(actionBuffers.DiscreteActions); } public void MoveAgent(ActionSegment<int> act) { var grounded = CheckIfGrounded(); var dirToGo = Vector3.zero; var rotateDir = Vector3.zero; var dirToGoForwardAction = act[0]; var rotateDirAction = act[1]; var dirToGoSideAction = act[2]; var jumpAction = act[3]; if (dirToGoForwardAction == 1) dirToGo = (grounded ? 1f : 0.5f) * transform.forward * 1f; else if (dirToGoForwardAction == 2) dirToGo = (grounded ? 1f : 0.5f) * transform.forward * volleyballSettings.speedReductionFactor * -1f; if (rotateDirAction == 1) rotateDir = transform.up * -1f; else if (rotateDirAction == 2) rotateDir = transform.up * 1f; if (dirToGoSideAction == 1) dirToGo = (grounded ? 1f : 0.5f) * transform.right * volleyballSettings.speedReductionFactor * -1f; else if (dirToGoSideAction == 2) dirToGo = (grounded ? 1f : 0.5f) * transform.right * volleyballSettings.speedReductionFactor; if (jumpAction == 1) { if (((jumpingTime <= 0f) && grounded)) { Jump(); } } transform.Rotate(rotateDir, Time.fixedDeltaTime * 200f); agentRb.AddForce(agentRot * dirToGo * volleyballSettings.agentRunSpeed, ForceMode.VelocityChange); // makes the agent physically "jump" if (jumpingTime > 0f) { jumpTargetPos = new Vector3(agentRb.position.x, jumpStartingPos.y + volleyballSettings.agentJumpHeight, agentRb.position.z) + agentRot * dirToGo; MoveTowards(jumpTargetPos, agentRb, volleyballSettings.agentJumpVelocity, volleyballSettings.agentJumpVelocityMaxChange); } // provides a downward force to end the jump if (!(jumpingTime > 0f) && !grounded) { agentRb.AddForce(Vector3.down * volleyballSettings.fallingForce, ForceMode.Acceleration); } // controls the jump sequence if (jumpingTime > 0f) { jumpingTime -= Time.fixedDeltaTime; } } ``` ``` public override void Heuristic(in ActionBuffers actionsOut) { var discreteActionsOut = actionsOut.DiscreteActions; if (Input.GetKey(KeyCode.D)) { // rotate right discreteActionsOut[1] = 2; } if (Input.GetKey(KeyCode.W) || Input.GetKey(KeyCode.UpArrow)) { // forward discreteActionsOut[0] = 1; } if (Input.GetKey(KeyCode.A)) { // rotate left discreteActionsOut[1] = 1; } if (Input.GetKey(KeyCode.S) || Input.GetKey(KeyCode.DownArrow)) { // backward discreteActionsOut[0] = 2; } if (Input.GetKey(KeyCode.LeftArrow)) { // move left discreteActionsOut[2] = 1; } if (Input.GetKey(KeyCode.RightArrow)) { // move right discreteActionsOut[2] = 2; } discreteActionsOut[3] = Input.GetKey(KeyCode.Space) ? 1 : 0; } ``` ![](https://hackmd.io/_uploads/Hku4TUF43.png) 都調整完後就可以開始訓練 ``` behaviors: Volleyball: trainer_type: ppo hyperparameters: batch_size: 2048 buffer_size: 20480 learning_rate: 0.0002 beta: 0.003 epsilon: 0.15 lambd: 0.93 num_epoch: 4 learning_rate_schedule: constant network_settings: normalize: true hidden_units: 256 num_layers: 2 vis_encode_type: simple reward_signals: extrinsic: gamma: 0.96 strength: 1.0 keep_checkpoints: 5 max_steps: 20000000 time_horizon: 1000 summary_freq: 20000 ``` ``` mlagents-learn config/Volleyball.yaml --run-id=V1 http://localhost:6006/ ``` ![](https://hackmd.io/_uploads/SkSpdk9Vh.png) {%youtube QYxTRyxgfjs %} {%youtube yfdTLwsUyKE %} ![](https://i.imgur.com/yMwrppx.png) --- # Adversarial and Cooperative 對抗與合作將上面2個環境合併，足球和籃球就是基本的範例，不只需要觀察球，同時還要觀察對方的Agent和己方的Agent。 https://community.arm.com/arm-community-blogs/b/graphics-gaming-and-vr-blog/posts/1-unity-ml-agents-arm-game-ai # Behavior Designer ![](https://hackmd.io/_uploads/H10r6y9Nn.png) 獎勵:The mean reward refers to the "Cumulative Reward - The mean cumulative episode reward over all agents. Should increase during a successful training session." While the Std. of reward just describes the standard deviation of the reward which can also be seen as a margin of error. I would recommend to check out some more example environments to understand better how rewards work.