# SSL for RL Notes
## Reachability
**General intuition**: representation of two states should be *similar* if they are *reachable* from each other (i.e. number of actions it takes to reach each other)
#### Fastest Reach Time criterion
Optimize the latent representation,
$$
||z_i - z_j|| \propto \min_\pi \eta_\pi(s_i \rightarrow s_j) \quad \forall i,j \in |\mathcal{S}|\,,
$$
where $\eta_\pi(s_i \rightarrow s_j)$ is the expected number of steps, under some policy $\pi$, to reach state $s_j$ from state $s_i$ (for the first time). So $\min_\pi \eta_\pi(s_i \rightarrow s_j)$ is the "optimal" number of steps to go from $s_i$ to $s_j$. Further, the policy $\pi = \arg\min_\pi (s_i \rightarrow s_j)$ is the optimal goal-condtioned policy to get to state $s_j$ from $s_i$.
To make things simpler, we can consider the optimization over all $\eta < n$ (e.g. for $n = 10$), so the representation is only local.
<span style="color:red">
TODO
</span>
- How to optimize this objective efficiently?
- Can we estimate $\eta_\pi (s_i \rightarrow s_j)$ for a fixed policy using TD-style methods?
- Since $\min_\pi \eta_\pi$ itself requires solving an MDP problem I think?
<span style="color:red">
TODO: is there something we can prove here (e.g. similar to the policy improvement theorem for value, and/or triangle inequality)? Maybe it is totally trivial, will need to write out. Some intuitions:
</span>
- Let $\eta(s_i, s_j) = \min_\pi \eta_\pi (s_i \rightarrow s_j)$ , I think $\eta(s_1, s_3) \leq \eta(s_1, s_2) + \eta(s_2, s_3)$
- I think it needs some ergodicity assumptions (ie all states are reachable from all other states with some policy). The above does not hold if both $s_2$ and $s_3$ are absorbing / terminating states, for instance
- If we have triangle inequality above, this makes $\eta$ either a pseudometric like KL (since $\eta(s,s') = 0$ if $s = s'$), or a metric (if we also have symmetry, $\eta(s,s') = \eta(s',s)$, although this latter point probably is not important).
- Although, if we are defining $||z_i - z_j||$ to be an (e.g. Euclidean) norm then we are actually enforcing symmetry too (i.e. we cannot have a policy independent state representation that is asymmetric)...
- Alternatively we can project $z$ nonlinearly before enforcing the distance constraint?
- are we actually trying to learn a euclidean embedding for the state space? (I don't know enough about topology to define this)
- The trivial solution exists here, i.e. $z = \textbf{0}$ for all $z$ is a valid solution, will need to regularize
- Are there other identities to be had here (e.g. bellman form, poicy improvement theorem, etc.) that can lead to efficient algorithms (that does not require searching over all decision rules to define the distance?)
#### Rough Brainstorm
- Generally: how do we push the state representation of "reachable" states to be close?
- Can: sample trajectory, and push the state observations in that trajectory to be close together (i.e. observations in the same trajectory are positive pairs for SSL)
- Can we: propagate some notion of similarity across the state space while only using local learning?
- Similar intuition to how using TD learning can leverage only one-step returns but allow the accurate value estimate to propagate acorss the entire state space
- Might need to construct a similarity measurement that has a Bellman form for this to work in theory properly
- Alternatively, if we have a model that only measures local similarity, we can also roll-out the model to estimate more distant similarity
- E.g. learn state representation by making states sample-able within a 10-step trajectory similar
- A transition model in this representation space will be able to estimate the similarity between states more efficiently? (are we using a transition model to construct representations for states? rather than for action selection?)
- Can try to learn $f(s,s') \rightarrow n$, $n$ = number of steps between states $s$ and $s'$.
- Can help us train a goal conditioned policy $\pi(s,s')$
- Can we also treat $f(s,s') \rightarrow n$ also as an energy-like function for self supervised learning?
- I.e. this is very similar to the _Joint embedding EBM_, which has the general form of learning energy function $F(x,y) \rightarrow \mathbb{R}$, where we make states $x$ and $y$ to be similar by pushing down their energy $F(x,y)$.
- Maybe we can use the n-step as an objective, maybe we add this as an additional regularizer, unsure.
##### Some maybe related papers I found via briefly searching
- [Learning Actionable Representations with Goal-Conditioned Policies](https://arxiv.org/abs/1811.07819), ICLR19
- [Planning with Goal-Conditioned Policies](https://arxiv.org/abs/1911.08453), NeurIPS 19
- [Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning](https://arxiv.org/abs/2101.05265), ICLR21
# Research Updates
#### 2021 Nov 4
- Training a goal-conditioned policy (state SAC + HER)
- From the stable baseline tutorial [here](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#hindsight-experience-replay-her)
- Environment: parking to a goal, from [highway-env](https://github.com/eleurent/highway-env)
<img src="https://raw.githubusercontent.com/eleurent/highway-env/gh-media/docs/media/parking-env.gif" alt="drawing" width="300"/>
- States: `[x, y, vx, vy, cos_h, sin_h]`
- Actions: `[acceleration, steering]`
- Evaluating if $||z_{t} - z_{t+k}|| \propto k$ arises naturally in the learned representation,
- Using a trained agent (i.e. .99 success rate to goal), some examples of the similarity of the: policy (top, dim=2), and latent representation before policy (bottom, dim=256)
- Each matrix is constructed using a single sampled trajectory, so it is $T \times T$, where $T$ is the length of the trajectory
- Sample trajectory (top: policy, bottom: latent)


(all distances, so larger = less similar, i.e. correlation distance = 1 - correlation)
- Sample trajectory 2


- Sample trajectory 3

