# Research Log
## Tentative Timeline / Goals
| Create Date | Due Date | Topic | Task | Notes | Status |
|------|-----|-----|----|---|------|
| 10/26/22 | 11/02/22 | Discovery | Get Cleanup and Harvest environments running and play with parameters to start trying to characterize conditions which necessitate inter-agent communication (these ones are interested in action sharing) | (10/26/22) Finally got through dependency hell... fix the GPU dependency and run experiments | <p style='color:white;background-color:green'>Complete</p> |
| 11/02/22 | 11/09/22 | Inital Experimentation | Profile why the environment is so slow. Run more experiments. | Keep running more experiments and organize them better... like track the plots systematically. | <p style='color:white;background-color:Orange'>In Progress</p> |
| 11/09/22 | 11/16/22 | Initial Experimentation | <ul> <li>Fix up the plots</li> <li> plot different algorithms against each other and just look at one thing at a time (mean, max, etc...) </li> <li> Keep toying with the env e.g. try a 1D and try making it literally as small as possible. </li> <li> Once that's done - pick baseline env instantiations to run all the algorithms against and compare</li> <li> Write a render method to try and inspect the policy in action </li> <li> Look for rollouts / generate them if the don't exist </li> <li> Look for the final policy / weights or generate them if they don't exist </li></ul>| Keep good track of what experiments have been run. Stay organized | <p style='color:white;background-color:Orange'>In Progress</p> |
| 11/16/22 | 11/23/22 | test | test | test | test |
| 11/23/22 | 11/30/22 | test | test | test | test |
| 11/30/22 | 12/07/22 | test | test | test | test |
| 12/07/22 | 12/14/22 | test | test | (last meeting before break) | test |
| 12/14/22 | 12/21/22 | test | test | test | test |
| 12/21/22 | 12/28/22 | test | test | test | test |
| 12/28/22 | 01/04/23 | test | test | test | test |
| 01/04/23 | 01/11/23 | test | test | test | test |
| 01/11/23 | 01/18/23 | test | test | test | test |
| 01/18/23 | 01/25/23 | test | test | test | test |
| 01/25/23 | 02/01/23 | test | test | test | test |
| 02/01/23 | 02/08/23 | test | test | test | test |
| 02/08/23 | 02/15/23 | test | test | test | test |
| 02/15/23 | 02/22/23 | test | test | test | test |
| 02/22/23 | 03/01/23 | test | test | test | test |
| 03/01/23 | 03/08/23 | test | test | test | test |
| 03/08/23 | 03/15/23 | test | test | test | test |
| 03/15/23 | 03/22/23 | test | test | test | test |
| 03/22/23 | 03/29/23 | test | test | test | test |
| 03/29/23 | 04/05/23 | test | test | test | test |
| 04/05/23 | 04/12/23 | test | test | test | test |
| 04/12/23 | 04/19/23 | test | test | test | test |
| 04/19/23 | 04/26/23 | test | test | test | test |
| 04/26/23 | 05/03/23 | test | test | test | test |
## Comparing different shared information (09/28/22)
### Actions
- Complete actions
- [Decentralized multi-agent reinforcement learning
with shared actions (2020)](https://arxiv.org/pdf/2003.10185.pdf)
- Decentralized Learning
- Cooperative context
- Goal is to maximize collective reward
- **Intuition**
- This work attempts to find the (provably) optimal policy, and suggests that the necessitates complete state information
- Retrospective actions
- [Social Influence as Intrisic Motivation for Multi-Agent Deep Reinforcement Learning (2019)](https://arxiv.org/pdf/1810.08647.pdf)
- After selecting an action, consider how taking other actions would have influenced other agents' behavior, by looking at how their actions would have changed
- large change in how other agents would act is considered "high social influence" and is rewarded
- centralized training to compute $c_t^k$ - causal influence reward
- 
- **Intuition**
- Knowledge of how your actions are affecting (would affect) the actions of other agents are considered "influential"
- Why are "influential" actions necessarily desirable?
- Authors show this is equivalent to having high mutual information between actions
- This leads to high levels of coordination / cooperation while still giving diverse policies
- This is particularly challenging in decentralized MARL algorithms
- Effective in traditionally difficult "social dilemma" environments
- **When might communicating actions be useful?**
- This could be most useful in the case where the primary concern is effective cooperation / coordination
### State
- Observation correlated with state
- [Learning to Communicate with Deep Reinforcement Learning (2016)](https://arxiv.org/pdf/1605.06676.pdf)
- private observation $o_t^a$ correlated to $s_t$
- Learned
- Reinforced Inter-Agent Learning (RIAL)
- learns to share a binary or real valued message
- Differentiable Inter-Agent Learning (DIAL)
- Environment - prisoner lightbulb riddle
- integrates learning the policy and communication - allowing gradients to flow between agents
- 
- **Intuition**
- Sharing environment / state information is an effective way to expand exploration of the environment
- I think this is only true under the cooperative / shared reward paradigm, otherwise you'd be seeing the states, but wouldn't necessarily know whether they're good / bad / useful
- *Aside: This paper also developed several benchmarks that require communication - could be useful for testing*
- **When might communicating state be useful?**
- This could be most useful in the case where the primary concern is that the environment is difficult to explore, leading to suboptimal policies
### Policy
- Low-dimensional policy fingerprint
- [Stabilizing Experience Replay for Deep Multi-Agent Reinforcement Learning (2018)](https://arxiv.org/pdf/1702.08887.pdf)
- Utilizes a multi-agent form of importance sampling to decay data made obsolete by non-stationarity
- augments tuples in the replay buffer with the probability of the joint action according to the current policies - this is used for the importance sampling
- Also conditions agents' value functions on a "fingerprint" of other agents' policies
- specifically, they augment the input to the Q-function with training iteration $e$ and rate of exploration $\epsilon$
- That's it... not direct information from the policy, but almost meta-data which informs how well developed the policy is.
- 
- **Intuition**
- Deep Q-Learning does seem hard due to the non-stationarity making the replay buffer ineffective. This aims to correct that using these two methods in combination.
- The fact they use both of these makes it a little more difficult to isolate the effect of communicating information about the policy from the effect of the modified replay buffer.
- Mean field - average of neighbor policies
- [Mean Field Multi-Agent Reinforcement Learning (2020)](https://arxiv.org/pdf/1802.05438.pdf)
- 
- **When might communicating policy information be useful?**
- This could be most useful in the case where the primary concern is non-stationarity (specifically non-stationarity induced by the agents, not necessarily environmental non-stationarity)
- Sharing actions and/or intentions can be an effective way to alert agents about what else is going on in the environment outside of their immediate state
### Combinations
- Action probabilites and State
- [Multi-Agent Reinforcement Learning for Networked Systems Control (2020)](https://arxiv.org/pdf/2004.01339.pdf)
- Targets easing non-stationarity by maximizing shared information
- 
- Cleanup
- In Cleanup (a public goods game) agents must clean a river before apples can grow, but are not able to harvest apples while cleaning.
- agents must efficiently time harvesting apples and cleaning the river, and allow agents cleaning the river a chance to consume apples.
- Potential Gridworld Mapping:
- ```# goal states == # agents```, and to obtain maximum reward, they must be collected in a specific order... So they need to learn to coordinate temporally (wait for the respective agents to reach their goal state in optimal order)
- Harvest
- In Harvest (a common pool resource game), apples respawn at a rate proportional to the amount of nearby apples; if apples are harvested too quickly, they will not grow back.
- agents must spatially distribute their harvesting, and abstain from consuming apples too quickly in order to harvest sustainably.
- Potential Gridworld Mapping
- No respawning of apples. We have ```# goal states == # agents```, and the episode is over (max reward is obtained) when all apples have been picked up. Optimally, each agents should go after the apple nearest them - requiring spatial coordination.
- *The code for these games, including hyperparameter settings and apple and waste respawn probabilities, can be found [here](https://github.com/eugenevinitsky/sequential_social_dilemma_games.)*
Make a post in slack with assumptions
## Assumptions (10/12/22)
- Environmental
- **Stationary**
- The non-stationarity we are addressing comes from multi-agent interaction within the environment. We're assuming all environments themselves are stationary
- Agents
- **Agents are homogenous**
- all agents in the environment leverage identical parameterizations and learning algorithms
- Additionally, communication is unilateral and identical. That is, all agents communicate with all other agents, and are sharing the same type of information (i.e. actions, policies, etc...)
- Initial experiments consider **cooperative tasks** (i.e. optimal policies require agents to learn to work together)
- Experimental
- **Actions are taken simultaneously**
- Later could explore sequential actions
## First Test Environments (10/26/22)
### Harvest
### Cleanup
Try and see if the policy is already being stored - if not store it myself to feed states through and observe
#### 11/09/22
**Smallest Board Rewards over Iterations - MOA**

**Smallest Board Profiling Plot (where is this spending all it's time?) - MOA**

**Small Board Rewards over Iterations - MOA**

**Small Board Profiling Plot (where is this spending all it's time?) - MOA**

**Small Board PPO Results**

**Small Board PPO**

#### 11/16/22
##### Note on Markov Game Formulation
With standard MDPs we have:
- State space $S$
- Action space $A$
- Transition function $T: S \times A \rightarrow P(S)$
- Reward function $R: S \times A \rightarrow \mathcal{R}$
Extending this to a Markov *Game* (multi-agent MDP formulation):
- State space $S$
- A *collection* of *Action Sets* $\mathcal{A} = A_1, A_2, ..., A_n$, one per agent
- A modifed Transition function which takes the current state and an action from each agent $\mathcal{T}: S \times A_1, ..., A_n \rightarrow P(s)$
- Reward Function per agent $R_i: S \times A_1, ..., A_n \rightarrow \mathcal{R}_i$
##### Note on Underlying Markov Game Dynamics
I found a formulation that seems very useful. I found it in Social Dilemma Games literature, and it seems very closely tied to Game Theoretic payoff matrices, but it might be useful in characterizing Markov Games in general, and could potentially be extended to formally define / differentiate Collaborative / Adversarial environments as well as formalizing what an optimal policy might look like.. for example, whether an optimal policy should require cooperation or sabotage:
Given (forgive overlapping notation....):
- Reward for mutual cooperation $R^C$ (need to define this explicitly... is this separate from the Markov Game $R$?)
- Punishment for mutual defecting $P^D$
- "Sucker" outcome obtained by player who cooperates when the other defects $S^C$
- "Temptation" outcome for player who defects against cooperation $T^D$
With these definitions we can define a *Matrix Game* as a social dilemma game in which the following inequalities define the payoff matrix:
1. $R^C > P^D$: meaning mutual cooperation is preferred to mutual defection
2. $R^C > S^C$: meaning mutual cooperation is preferred to being exploited
3. $2R^C > T^D + S^C$: meaning mutual cooperation is better than an equal probability of cooperating and defecting
4. Either:
a. $T^D > R^C$: or *greedy* game in which exploiting a cooperating is better than cooperating or
b. $P^D > S^C$ or *fear* game in which mutual defection is prefered to being exploited
To extend this into the generic MARL context, we'd want to define **Cooperation** and **Defection** rigorously. In a simple $2 \times 2$ payoff matrix, where the agents have literal actions "cooperate" and "defect", this is trivial. But in more complex environments it's less clear. Cooperation is perhaps less apparent on a step-by-step basis, and instead looks more like agent(s) neglecting individual payoff in a manner such that it maximizes total sum of payoff for all agents. Defining this
These particular inequalities might only be useful specifically for social dilemma games, but coming up with similar formal notation for relationships like this could be useful, but might be difficult to distill. For example, I could imagine a two (2) agent game in which we want it to be the case that optimal policies with respect to expected total sum of rewards is greater than the expected sum of rewards for optimal policies that don't cooperate: $\mathbb{E}[R_{\pi_C*}^1 + R_{\pi_C*}^2] > \mathbb{E}[R_{\pi_U*}^1 + R_{\pi_U*}^2]$. Note that this doesn't imply that expected reward for either *individual* optimal policy dominates the other: $\mathbb{E}[R_{\pi_C*}^i] > \mathbb{E}[R_{\pi_U*}^i]$. A simple example of this would be an environment with two actions per agent and rewards follow the payoff matrix:
$$\begin{bmatrix}
0,0 & 0,2\\
2,0 & 1,1
\end{bmatrix}$$
Quick Overview of Algorithms being compared:
- Proximal Policy Optimization (PPO)
- On-Policy
- Discrete or Continuous action spaces
- Essentially a Policy Gradient Algorithm
- Aims to take the largest update step possible, without causing performance to degrade.
- Two Main Variants:
- Penalty
- penalizes the KL-divergence in the objective function instead of constraining it (as in TRPO).
- penalty coefficient can change dynamically over time so it scales appropriately
- Clip
- I think this is more common?
- no KL term in the objective
- no constraints
- just 'clips' the objective function so that it never goes too far from the old policy
- 
##### Updated Plots

##### Other Updates
- I have a render function
- need to figure out how to get it to play nice with ray tune experiments...
- Logging out final policies for all agents
- from here, I can come up with interesting states to send through the policies to try and characterize how these different policies behave
- for example, I can hand craft particular states that I solve for the optimal trajectory, and run policies over these states to quantify
- how far from optimal is a policy? (under these conditions)
- attempt to characterize qualitative policy behavior
- gather insight into how different forms of communication qualitatively affect a policy under different circumstances
#### 12/7/22 Plots
##### Env: Cleanup
>A public goods dilemma in which agents get a reward for consuming apples, but must use a cleaning beam to clean a river in order for apples to grow. While an agent is cleaning the river, other agents can exploit it by consuming the apples that appear.
```
@ = wall
A = Apple spawn point
P = Player spawn point
```
##### Model of Other Agents (MOA)
>Finally, we show that influence agents can be trained
independently, when each agent is equipped with an internal
neural network Model of Other Agents (MOA), which has
been trained to predict the actions of every other agent. The
agent can then simulate counterfactual actions and use its
own internal MOA to predict how these will affect other
agents, thereby computing its own intrinsic influence reward.
Influence agents can thus learn socially, only through
observing other agents’ actions, and without requiring a
centralized controller or access to another agent’s reward
function. Therefore, the influence reward offers us a simple,
general and effective way of overcoming long-standing
unrealistic assumptions and limitations in this field of
research, including centralized training and the sharing of
reward functions or policy parameters
<pre>
Map: 1D Small
@@@@@@@@@@@@
@A P A P A@
@@@@@@@@@@@@
</pre>

<pre>
Map 2D Small
@@@@@@@@@@@@
@A P A A@
@A A A AA@
@A A P A@
@@@@@@@@@@@@
</pre>

<pre>
Map: 1D Med
@@@@@@@@@@@@@@@@@@@@@@@@
@A P A A AAAA PA@
@@@@@@@@@@@@@@@@@@@@@@@@
</pre>

<pre>
Map 2D Med
@@@@@@@@@@@@@@@@@@@@@@@@
@A A A A A @
@A AAA PAA AAA @
@AAAAAA AAAAAA @
@A AAAAA @
@A AAA P @
@A A AAA A A AA @
@@@@@@@@@@@@@@@@@@@@@@@@
</pre>

<pre>
Map: 1D Large
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@A P AAAAAAA A A AA AAAA A P @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
</pre>

<pre>
Map 2D Large
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@A P A A A A AA AA@
@A A A AA AAAAAAAAAAAAAAAA @
@A A A @
@AA AAAAAAA AAAAAAAAA A A AA A A@
@A A A AA AAAA AAA AAA@
@A A A AAA A A A A A@
@A A A AA AAAA @
@A A A AAAAA AA AA @
@A A A P @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
</pre>

#### Fixed states to test on trained policy
A few interesting environmental levers to create interesting states to explore policies:
- **Starting Resources**
- Number / Location of apples
- Number / Location of player spawn points
- **Relationship between Agents**
- Can (not) navigate the same squares
- Local regions surrounding the agents (abundant / sparse)
- **Sizing**
- Large vs. Small Map
- Does the size of the map influence the effectiveness of the communication? i.e. larger state spaces benefit from the additional information, while small state spaces can be navigated without it
- **Rate**
- How quickly do apples regenerate?
<pre>
"Basically Nothing"
- Minimal Map, no room for agents to move anywhere
@@@@@@
@APPA@
@@@@@@
"Hallway"
- Minimal 1D Map
- Small, Medium, Large
@@@@@@@@@
@A P P A@
@@@@@@@@@
"Surrounded By Apples"
- An abundance of apples
- Small, Medium, Large
@@@@@@@
@AAAAA@
@AAPAA@
@AAAAA@
@AAPAA@
@AAAAA@
@@@@@@@
"Sparse"
- Very few apples
- Small, Medium, Large
@@@@@@@@@@@@@@@@@@@@@
@A P@
@ @
@ @
@ @
@ A @
@ @
@ @
@ @
@P A@
@@@@@@@@@@@@@@@@@@@@@
</pre>
##### Schelling Diagram
> Explicitly, a Schelling diagram [45](https://www-jstor-org.ezp-prod1.hul.harvard.edu/stable/173406?searchText=Hockey+helmets%2C+concealed+weapons%2C+and+daylight+saving+A+study+of+binary+choices+with+externalities%2C&searchUri=%2Faction%2FdoBasicSearch%3FQuery%3DHockey%2Bhelmets%252C%2Bconcealed%2Bweapons%252C%2Band%2Bdaylight%2Bsaving%253A%2BA%2Bstudy%2Bof%2Bbinary%2Bchoices%2Bwith%2Bexternalities%252C&ab_segments=0%2FSYC-6704_basic_search%2Ftest-2&refreqid=fastly-default%3A1d00a4c84236b1dc8c14a274264cdad7#metadata_info_tab_contents), [18](https://arxiv.org/pdf/1707.06600.pdf) depicts the relative payoffs for a single cooperator or defector given a fixed number of other cooperators. Thus Schelling diagrams are a natural and convenient generalization of payoff matrices to multi-agent settings. Game-theoretic properties like Nash equilibria are readily visible in Schelling diagrams; see [45](https://www-jstor-org.ezp-prod1.hul.harvard.edu/stable/173406?searchText=Hockey+helmets%2C+concealed+weapons%2C+and+daylight+saving+A+study+of+binary+choices+with+externalities%2C&searchUri=%2Faction%2FdoBasicSearch%3FQuery%3DHockey%2Bhelmets%252C%2Bconcealed%2Bweapons%252C%2Band%2Bdaylight%2Bsaving%253A%2BA%2Bstudy%2Bof%2Bbinary%2Bchoices%2Bwith%2Bexternalities%252C&ab_segments=0%2FSYC-6704_basic_search%2Ftest-2&refreqid=fastly-default%3A1d00a4c84236b1dc8c14a274264cdad7#metadata_info_tab_contents) for additional details and intuition.
n-agents = 5
D = 15x15x3
What this shows is the individual-agent payoff over the number of agents that cooperate. We see that as long as there is one (1) cooperating agent, greedy policies are more personally beneficial, but as the number of cooperating agents increases - everyone does better, even the defecting agents

- what was the objective here? individual or collective?
- How can I measure the benefit of communication?
- Converge Faster
- Converge Higher
- play with the distribution of apples / agents spawning
- focus on the 1d case first...
- what happens when the apples are all over here <-- and players all spawn --> ?
- also, to be clear, we're focusing on training right now
- is there some variant of that Schelling Diagram we can come up with?