# 多智能體(MARL)分類方法

a. Analysis of emergent behaviors :
將所有Agent 放入Environment中,不共享及溝通,各自訓練
b. Learning communication :
學習過程中Agent會彼此傳輸訊息及共享來達到更好結果
c. Learning cooperation :
學習過程中Agent透過學習互相合作達到更好結果
d. Agents modeling agents :
若agent間沒有溝通,但是透過看到其他agent的一些外在結果來去model自己的訊息
**Decentralized system (分散式系統)**

分散的方法中,我們獨立單一Agent,而不考慮其他Agents的存在。
在這種情況下,所有Agents都將其他Agents視為環境的一部分。
這是一個非平穩的環境條件,因此不能保證收斂。
**Centralized approach (集合式系統)**

從所有Agents學習單個策略。
將環境的當前狀態作為Input,Output Agents合作的策略。
獎勵是Global的。
**Self-Play**
在訓練對抗遊戲時非常困難,因為當我們訓練Agent時需要有一個對手協助你,但若對手設定太厲害,agents得到的結果都會是輸,因此訓練出不好的結果

在Self-Play中會幫我們解決此問題-->讓訓練的Agent同時處於玩家和對手,並在Agent升級自己的級別時升級其級別
**輸入**
(遊戲狀態)->使用當前遊戲的狀態作為輸入並決定下一步行動
(對手策略)->Agent對於對手的策略也會進行輸入考慮
(策略參數)->Agent的策略參數,會隨著訓練過程不斷修改
**輸出**
(行動)->根據遊戲狀態和策略參數,Agent輸出一步行動
(獎勵)->藉由Agent的行動後,環境給予獎勵,Agent再經由 獎勵來改寫自己的策略參數
(新的參數)->不斷更新以便下一次訓練使用
**優點**
可以不需要外部數據就能自動生成訓練數據,並能夠訓練出更優略的行動
---
# Cooperative environments 合作環境
想法1:
放入2個Agent,並設計一個有開關的門(人必須踩在上面),門後有金幣,而這2個Agent必須合作才有辦法拿到金幣並通關。
想法2:
設計一個箱子而這個箱子一定要2個或以上的Agent才可以推動,用來逃脫或是用來壓按鈕。
---
**Competitive/Adversarial environments 競爭/對抗環境**
有2個隊伍都想要打敗對方獲得勝利,像是單人的羽球和排球就是簡單的例子。
底下是製作了單人排球的遊戲

規則:球不能落地
目標:讓球掉到對方的場地上
代理:2個不同的多代理
**環境設置:**
先放置球場、邊界、球網和球
再來放置紫方和藍方需要的物件,球員(Agent)和地板
物件放置完,就要開始設置環境的變數
首先是整個遊戲的環境(遊戲有哪些物件,遊戲規則)
```
public enum Team
{
Blue = 0,
Purple = 1,
Default = 2
}
public enum Event //設置界線區域
{
HitPurpleGoal = 0, //藍方地板(紫方得分)
HitBlueGoal = 1, //紫方地板(藍方得分)
HitOutOfBounds = 2, //界外
HitIntoBlueArea = 3, //藍方區域
HitIntoPurpleArea = 4 //紫方區域
}
```
```
void Start()
{
blueAgentRb = blueAgent.GetComponent<Rigidbody>();
purpleAgentRb = purpleAgent.GetComponent<Rigidbody>();
ballRb = ball.GetComponent<Rigidbody>();
var spawnSideList = new List<int> { -1, 1 };
ballSpawnSide = spawnSideList[Random.Range(0, 2)]; //一開始發球位置
blueGoalRenderer = blueGoal.GetComponent<Renderer>();
purpleGoalRenderer = purpleGoal.GetComponent<Renderer>();
RenderersList.Add(blueGoalRenderer);
RenderersList.Add(purpleGoalRenderer);
volleyballSettings = FindObjectOfType<VolleyballSettings>();
ResetScene();
}
```
```
public void ResolveEvent(Event triggerEvent) //是否有碰觸trigger(觸發器)
{
switch (triggerEvent)
{
case Event.HitOutOfBounds:
if (lastHitter == Team.Blue)
{// apply penalty to blue agent
}
else if (lastHitter == Team.Purple)
{// apply penalty to purple agent
}
blueAgent.EndEpisode();
purpleAgent.EndEpisode();
ResetScene();
break;
case Event.HitBlueGoal:
// blue wins
// turn floor blue
StartCoroutine(GoalScoredSwapGroundMaterial(volleyballSettings.blueGoalMaterial, RenderersList, .5f));
blueAgent.EndEpisode();
purpleAgent.EndEpisode();
ResetScene();
break;
case Event.HitPurpleGoal:
// purple wins
// turn floor purple
StartCoroutine(GoalScoredSwapGroundMaterial(volleyballSettings.purpleGoalMaterial, RenderersList, .5f));
blueAgent.EndEpisode();
purpleAgent.EndEpisode();
ResetScene();
break;
case Event.HitIntoBlueArea:
if (lastHitter == Team.Purple)
{
purpleAgent.AddReward(1);
}
break;
case Event.HitIntoPurpleArea:
if (lastHitter == Team.Blue)
{
blueAgent.AddReward(1);
}
break;
}
}
```
episode 則代表了智能體與環境之間的互動過程,包括從環境中觀測到狀態,採取動作,接收環境回饋等過程。
```
public void ResetScene()
{
resetTimer = 0;
lastHitter = Team.Default; // reset last hitter
foreach (var agent in AgentsList)
{
// randomise starting positions and rotations
var randomPosX = Random.Range(-2f, 2f);
var randomPosZ = Random.Range(-2f, 2f);
var randomPosY = Random.Range(0.5f, 3.75f); // depends on jump height
var randomRot = Random.Range(-45f, 45f);
agent.transform.localPosition = new Vector3(randomPosX, randomPosY, randomPosZ);
agent.transform.eulerAngles = new Vector3(0, randomRot, 0);
agent.GetComponent<Rigidbody>().velocity = default(Vector3);
}
ResetBall();
}
void ResetBall()
{
var randomPosX = Random.Range(-2f, 2f);
var randomPosZ = Random.Range(6f, 10f);
var randomPosY = Random.Range(6f, 8f);
ballSpawnSide = -1 * ballSpawnSide;
if (ballSpawnSide == -1)
{
ball.transform.localPosition = new Vector3(randomPosX, randomPosY, randomPosZ);
}
else if (ballSpawnSide == 1)
{
ball.transform.localPosition = new Vector3(randomPosX, randomPosY, -1 * randomPosZ);
}
ballRb.angularVelocity = Vector3.zero;
ballRb.velocity = Vector3.zero;
}
```
排球上的設置
```
void OnTriggerEnter(Collider other) //觸發區域
{
if (other.gameObject.CompareTag("boundary")) //界外
{
// ball went out of bounds
envController.ResolveEvent(Event.HitOutOfBounds);
}
else if (other.gameObject.CompareTag("blueBoundary")) //藍方界外 (紫得分)
{
// ball hit into blue side
envController.ResolveEvent(Event.HitIntoBlueArea);
}
else if (other.gameObject.CompareTag("purpleBoundary")) //紫方界外 (藍得分)
{
// ball hit into purple side
envController.ResolveEvent(Event.HitIntoPurpleArea);
}
else if (other.gameObject.CompareTag("purpleGoal")) //藍方地板 (紫得分)
{
// ball hit purple goal (blue side court)
envController.ResolveEvent(Event.HitPurpleGoal);
}
else if (other.gameObject.CompareTag("blueGoal")) //紫方界外 (藍得分)
{
// ball hit blue goal (purple side court)
envController.ResolveEvent(Event.HitBlueGoal);
}
}
```
環境設置完後會像是這樣
{%youtube oWDZuzzNhCo %}
有了環境再來就是Agent的部分,而強化學習最重要的就是觀察和獎勵
```
public override void CollectObservations(VectorSensor sensor)
{觀察}
public override void OnActionReceived(ActionBuffers actions)
{行動與獎勵的給予}
public override void Heuristic(in ActionBuffers actionsOut)
{鍵盤的操作}
```
```
public override void CollectObservations(VectorSensor sensor)
{
// Agent rotation (1 float)
sensor.AddObservation(this.transform.rotation.y);
// Vector from agent to ball (direction to ball) (3 floats)
Vector3 toBall = new Vector3((ballRb.transform.position.x - this.transform.position.x) * agentRot,
(ballRb.transform.position.y - this.transform.position.y),
(ballRb.transform.position.z - this.transform.position.z) * agentRot);
sensor.AddObservation(toBall.normalized);
// Distance from the ball (1 float)
// 與球的距離,magnitude表示這個向量的長度
sensor.AddObservation(toBall.magnitude);
// Agent velocity (3 floats)
sensor.AddObservation(agentRb.velocity);
// Ball velocity (3 floats)
sensor.AddObservation(ballRb.velocity.y);
sensor.AddObservation(ballRb.velocity.z * agentRot);
sensor.AddObservation(ballRb.velocity.x * agentRot);
}
```
```
public override void OnActionReceived(ActionBuffers actionBuffers)
{
MoveAgent(actionBuffers.DiscreteActions);
}
public void MoveAgent(ActionSegment<int> act)
{
var grounded = CheckIfGrounded();
var dirToGo = Vector3.zero;
var rotateDir = Vector3.zero;
var dirToGoForwardAction = act[0];
var rotateDirAction = act[1];
var dirToGoSideAction = act[2];
var jumpAction = act[3];
if (dirToGoForwardAction == 1)
dirToGo = (grounded ? 1f : 0.5f) * transform.forward * 1f;
else if (dirToGoForwardAction == 2)
dirToGo = (grounded ? 1f : 0.5f) * transform.forward * volleyballSettings.speedReductionFactor * -1f;
if (rotateDirAction == 1)
rotateDir = transform.up * -1f;
else if (rotateDirAction == 2)
rotateDir = transform.up * 1f;
if (dirToGoSideAction == 1)
dirToGo = (grounded ? 1f : 0.5f) * transform.right * volleyballSettings.speedReductionFactor * -1f;
else if (dirToGoSideAction == 2)
dirToGo = (grounded ? 1f : 0.5f) * transform.right * volleyballSettings.speedReductionFactor;
if (jumpAction == 1)
{
if (((jumpingTime <= 0f) && grounded))
{
Jump();
}
}
transform.Rotate(rotateDir, Time.fixedDeltaTime * 200f);
agentRb.AddForce(agentRot * dirToGo * volleyballSettings.agentRunSpeed, ForceMode.VelocityChange);
// makes the agent physically "jump"
if (jumpingTime > 0f)
{
jumpTargetPos = new Vector3(agentRb.position.x, jumpStartingPos.y + volleyballSettings.agentJumpHeight, agentRb.position.z) + agentRot * dirToGo;
MoveTowards(jumpTargetPos, agentRb, volleyballSettings.agentJumpVelocity, volleyballSettings.agentJumpVelocityMaxChange);
}
// provides a downward force to end the jump
if (!(jumpingTime > 0f) && !grounded)
{
agentRb.AddForce(Vector3.down * volleyballSettings.fallingForce, ForceMode.Acceleration);
}
// controls the jump sequence
if (jumpingTime > 0f)
{
jumpingTime -= Time.fixedDeltaTime;
}
}
```
```
public override void Heuristic(in ActionBuffers actionsOut)
{
var discreteActionsOut = actionsOut.DiscreteActions;
if (Input.GetKey(KeyCode.D))
{
// rotate right
discreteActionsOut[1] = 2;
}
if (Input.GetKey(KeyCode.W) || Input.GetKey(KeyCode.UpArrow))
{
// forward
discreteActionsOut[0] = 1;
}
if (Input.GetKey(KeyCode.A))
{
// rotate left
discreteActionsOut[1] = 1;
}
if (Input.GetKey(KeyCode.S) || Input.GetKey(KeyCode.DownArrow))
{
// backward
discreteActionsOut[0] = 2;
}
if (Input.GetKey(KeyCode.LeftArrow))
{
// move left
discreteActionsOut[2] = 1;
}
if (Input.GetKey(KeyCode.RightArrow))
{
// move right
discreteActionsOut[2] = 2;
}
discreteActionsOut[3] = Input.GetKey(KeyCode.Space) ? 1 : 0;
}
```

都調整完後就可以開始訓練
```
behaviors:
Volleyball:
trainer_type: ppo
hyperparameters:
batch_size: 2048
buffer_size: 20480
learning_rate: 0.0002
beta: 0.003
epsilon: 0.15
lambd: 0.93
num_epoch: 4
learning_rate_schedule: constant
network_settings:
normalize: true
hidden_units: 256
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.96
strength: 1.0
keep_checkpoints: 5
max_steps: 20000000
time_horizon: 1000
summary_freq: 20000
```
```
mlagents-learn config/Volleyball.yaml --run-id=V1
http://localhost:6006/
```

{%youtube QYxTRyxgfjs %}
{%youtube yfdTLwsUyKE %}

---
# Adversarial and Cooperative 對抗與合作
將上面2個環境合併,足球和籃球就是基本的範例,不只需要觀察球,同時還要觀察對方的Agent和己方的Agent。
https://community.arm.com/arm-community-blogs/b/graphics-gaming-and-vr-blog/posts/1-unity-ml-agents-arm-game-ai
# Behavior Designer

獎勵:The mean reward refers to the "Cumulative Reward - The mean cumulative episode reward over all agents. Should increase during a successful training session."
While the Std. of reward just describes the standard deviation of the reward which can also be seen as a margin of error. I would recommend to check out some more example environments to understand better how rewards work.