# **meeting 08/29**
**Advisor: Prof. Chih-Yu Wang \
Presenter: Shao-Heng Chen \
Date: Aug 29, 2023**
<!-- Chih-Yu Wang -->
<!-- Wei-Ho Chung -->
## **Paper reading**
- C. Huang, R. Mo and C. Yuen, "[Reconfigurable Intelligent Surface Assisted Multiuser MISO Systems Exploiting Deep Reinforcement Learning](https://ieeexplore.ieee.org/abstract/document/9110869)," in *IEEE Journal on Selected Areas in Communications*, vol. 38, no. 8, pp. 1839-1850, Aug. 2020. (Cited by 397)
### **System Model**
- Downlink RIS-aided MU-MISO system
- System settings
- $K$ single-antenna users
- $M$-antenna BS (where $M \geq K$)
- $N$-elements RIS
- Schematic diagram
- Severe signal **blockage** between BS and users (no direct path from BS to users)

- The channel model
- Assume a **frequency flat fading** channel (narrow band?)
- $\mathbf{H}_1 \in \mathbb{C}^{N \times M}$ is the BS-RIS channel
- the subscript $1$ indicates the first half of the channel
- $\mathbf{h}_{k, 2} \in \mathbb{C}^{N \times 1}$ is the RIS-user $k$ channel
- the subscript $2$ indicates the second half
- Some assumptions
- **Perfect CSI** are known at both the BS and the RIS
- obtaining CSI at the RIS is a challenging task though
- **Ideal reflection** (lossless)
- $|\mathbf{\Phi}(n, n)|^2 = 1$
- The signal received at the $k$-th user is given as
$$
\begin{equation*}
y_{k} = \mathbf {h}_{k,2}^{T}\mathbf{\Phi }\mathbf {H}_{1}\mathbf {Gx}+w_{k}
\end{equation*}
$$
- $y_k$ is the signal received at the $k$-th user (complex scalar?)
- $w_k$ is the zero mean (complex?) AWGN (with entries of varianc $\sigma_n^2$) at the $k$-th user
- $\mathbf{x} \in \mathbb{C}^{K \times 1}$ is the data streams for $K$ users
- a column vector of dimension $K \times 1$
- with zero mean unit variance entries, $\mathcal {E}(|x|)^2 = 1$
- $\mathbf{G} \in \mathbb{C}^{M \times K}$ is the transmit beamforming matrix (applied at the BS)
- $\mathbf{\Phi} \triangleq diag(\phi_1, ..., \phi_N) \in \mathbb{C}^{N \times N}$ is the diagonal phase shift matrix applied at the RIS
- $\mathbf{\Phi}(n, n) = \phi_n = e^{j\varphi_n}$
### **Problem Formulation**
- This paper adopt the **ergodic sum rate** as a metric to evaluate the system performance
- Never heard of it before, I'm not quite sure about what it means and I don't know what's the difference is between ergodic-sum-rate and sum-rate
- So, what is Ergodic Achievable Rate?
- It refers to the average (data?) rate that a communication system can achieve over a long period of time, considering the inherent randomness and variations in the channel conditions.
- It is the average (date? bit?) rate in which the signal is transmitted over the channel
- It explains the usefulness and efficiency of massive MIMO technology and is usually derived from the Shannon theorem
- The formula of ergodic-rate is given by
$$
\begin{equation*}
{R_k} = {\log _2}\left( {1 + SIN{R_k}} \right)
\end{equation*}
$$
- The ergodic-sum-rate for $K$ users is written as
$$
\begin{equation*}
{R_{sum}} = \sum\limits_{k = 1}^K E \left\{ {{R_k}} \right\} \approx K \cdot {\log _2}\left( {1 + SIN{R_k}} \right)
\end{equation*}
$$
- The above received signal model can be further written as
$$
\begin{equation*}
y_{k} = \mathbf {h}_{k,2}^{T}\mathbf{\Phi }\mathbf {H}_{1}\mathbf {g}_{k}x_{k}+\sum _{n, \ n\neq k}^{K}\mathbf {h}_{k,2}^{T}\mathbf{\Phi }\mathbf {H}_{1}\mathbf {g}_{n}x_{n}+w_{k}
\end{equation*}
$$
- Where $g_k$ is the $k$-th column vector of the matrix $\mathbf{G}$
- Without joint detection of data streams for all users, the second term is treated as cochannel interference.
- The SINR at the $k$-th user is given by
$$
\begin{equation*}
\rho _{k} = \frac {|\mathbf {h}_{k,2}^{T}\mathbf {\Phi H}_{1}\mathbf {g}_{k}|^{2}}{\sum\limits_{n, \ n\neq k}^{K} | \mathbf {h}_{k,2}^{T}\mathbf {\Phi H}_{1}\mathbf {g}_{n}|^{2}+\sigma _{n}^{2}}
\end{equation*}
$$
- The objective is to find out the optimal $\mathbf{G}$ and $\mathbf{\Phi}$ by maximizing sum rate and the optimization problem can be formulated as
$$
\begin{align*}
\max \limits_{\mathbf{G}, \mathbf{\Phi }} \;\; &C(\mathbf {G},\mathbf{\Phi }, \mathbf {h}_{k,2},\mathbf {H}_{1}) \\
\textrm {s.t.} \;\; &tr\{\mathbf{G}\mathbf{G}^{\mathcal {H}} \} \leq P_{t} \\
&|\phi _{n}|=1 \;\; \forall n=1,2, \ldots, N.
\end{align*}
$$
- The first constraint is to ensure that we don't exceed the total transmission power allowed at the BS
- The second constraint is modulus constraints for RIS (but in our case, we'll have phase-depent amplitude with discrete phases)
### **Method**
- Deep Deterministic Policy Gradient (DDPG)
- Actor-Critic Architecture:
- DDPG is an extension of the actor-critic architecture
- In this setup, there are 2 main components: the actor and the critic
- The actor is responsible for learning a policy that maps states to actions (approximate the action)
- The critic evaluates the quality of the actions chosen by the actor
- Deterministic Policy:
- Unlike some other DRL algorithms that learn stochastic policies (where the agent selects actions based on a probability distribution), DDPG learns a deterministic policy
- This means that given a state, the actor **directly outputs the best action** to take
- Experience Replay:
- Similar to other DRL algorithms, DDPG utilizes experience replay
- This involves storing past experiences (state, action, reward, next state) in a replay buffer and sampling mini-batches from it during training
- Experience replay helps in breaking the temporal correlations in the data, leading to more stable learning
- Target Networks:
- DDPG employs **2 sets** of neural networks for both the actor and the critic: the "target" networks and the "online (training?)" networks
- The target networks are used to estimate the value function and the optimal action, while the online (training) networks are updated during training
- **Soft updates** (using a small fraction of the online (training) network's weights) are applied to the target networks to stabilize learning
<!-- - Target Q-Network:
- The critic in DDPG learns the Q-function, which estimates the expected cumulative reward starting from a given state and taking a specific action
- This Q-network is trained using a target Q-network to reduce the learning's inherent instability in deep networks -->
- Policy Gradient Update:
- The actor is updated using policy gradients, with the goal of maximizing the expected cumulative reward
- The gradients are propagated through the critic network using the chain rule of derivatives (foward pass?)
- This process guides the actor to select actions that have higher estimated Q-values
- Ornstein-Uhlenbeck Noise:
- To **add exploration** in the continuous action space, DDPG uses Ornstein-Uhlenbeck noise
- This noise process introduces temporally correlated noise to the actions, allowing the agent to explore the action space more effectively
- DDPG network
<img src='https://hackmd.io/_uploads/HkIwzJcah.png' width=65% height=65%>
<img src='https://hackmd.io/_uploads/SJer9k9ph.png' width=85% height=65%>
- The training network
- The updates on the training critic network are given as follows
$$
\begin{align*}
\mathbf{\theta }_{c}^{(t+1)} &= \mathbf{\theta }_{c}^{(t)} - \mu _{c} \Delta _{\mathbf{\theta }_{c}^{(train)}} \ell (\mathbf{\theta }_{c}^{(train)}) \\
\ell (\mathbf{\theta }_{c}^{(train)}) &= \bigg (r^{(t)}+\gamma q(\mathbf{\theta }_{c}^{(target)}|s^{(t+1)},a') - q(\mathbf{\theta }_{c}^{(train)}|s^{(t)},a^{(t)}) \bigg)^{2}
\end{align*}
$$
- $\mu_c$ is the learning rate for the update on training critic network
- $a'$ is the action output from the **target?** actor network
- $\mathbf{\theta }_{c}^{(train)}$ is the training critic network
- $\mathbf{\theta }_{c}^{(target)}$ is the target critic network
- The parameters of the target network are updated as that of the training network in certain time slots
- The update on target network is much slower than the training network
- $\Delta _{\mathbf{\theta }_{c}^{(train)}} \ell (\mathbf{\theta }_{c}^{(train)})$ is the gradient with respect to the training critic network
- The update on the training actor network is given as
$$
\begin{align*}
&\mathbf{\theta }_{a}^{(t+1)} \\
&\mathbf{\theta }_{a}^{(t)} - \mu _{a} \Delta _{a}q(\mathbf{\theta }_{c}^{(target)}|s^{(t)}, a) \Delta _{\mathbf{\theta }_{a}^{(train)}} \pi(\mathbf{\theta }_{a}^{(train)}|s^{(t)})
\end{align*}
$$
- $\mu_a$ is the learning rate for the update on training critic network
- $\mathbf{\theta }_{a}^{(train)}$ is the training actor network
- $\pi(\mathbf{\theta }_{a}^{(train)}|s^{(t)})$ is the training actor network given input $s^{(t)}$
- $\Delta _{a}q(\mathbf{\theta }_{c}^{(target)}|s^{(t)}, a)$ is the gradient of target critic network with respect to the action
- $\Delta _{\mathbf{\theta }_{a}^{(train)}} \pi(\mathbf{\theta }_{a}^{(train)}|s^{(t)})$ is the gradient of training actor network with respect to its training actor network
- Note
- The update of training actor network $\mathbf{\theta }_{a}^{(train)}$ is affected by the target critic network $\mathbf{\theta }_{c}^{(target)}$ through gradient of the target critic network with respect to the action $a$
- which ensures that the next selection of action is on the favorite direction of actions to optimize the $Q$ value function
- The target network
- The updates on the target critic network and the target actor network are given as follows
$$
\begin{align*}
\mathbf{\theta }_{c}^{(target)} \leftarrow &\tau_{c} \mathbf{\theta }_{c}^{(train)} + (1 - \tau _{c}) \mathbf{\theta }_{c}^{(target)} \\
\mathbf{\theta }_{a}^{(target)} \leftarrow &\tau_{a} \mathbf{\theta }_{a}^{(train)} + (1 - \tau_{a}) \mathbf{\theta }_{a}^{(target)}
\end{align*}
$$
- where $\tau_{c}, \tau_{a}$ are the learning rate for updating of the target critic network and the target actor network, respectively
### **Algorithm**
- *State*
- The state $s^{(t)}$ at time step $t$ consists of 4 components:
- the transmission power at the $t$-th time step
- the received power of users at the $t$-th time step
- the action from the $(t - 1)$-th time step
- the channel matrix $\mathbf {H}_{1} \in \mathbb{C}^{N \times M}$ and $\mathbf{h}_{k,2} \in \mathbb{C}^{N \times 1}$
- The real part and the imaginary part will be separated as independent input port
- The dimension of the state space is $D_s = 2K + 2K^2 + (2MK + 2N) + (2NM + 2NK)$
- I don't understand why the dimension of the received power is $2K^2$
- *Action*
- The action is simply constructed by the transmit beamforming matrix $\mathbf {G} \in \mathbb{C}^{M \times K}$ and the phase shift matrix $\mathbf{\Phi} \in \mathbb{C}^{N \times N}$
- The dimension of the action space is $D_a = 2MK + 2N$
- *Reward*
- The reward at the $t$-th time step is determined as the sum rate capacity $C(\mathbf{G}^{(t)}, \mathbf{\Phi}^{(t)}, \mathbf {h}_{k,2}, \mathbf {H}_{1})$
- Given the instantaneous channels $\mathbf {H}_{1}, \mathbf{h}_{k,2}, \forall k$ and the action $\mathbf{G}^{(t)}, \mathbf{\Phi}^{(t)}$ obtained from the actor network
<!-- - Algorithm Description
<img src='https://hackmd.io/_uploads/ryl0yV962.png' width=50% height=50%>
-->
### **Results**
- The Benchmarks and Settings
- Weighted Minimum Mean Square Error (WMMSE) algorithm
- Iterative algorithm based on Fractional Programming (FP) with the ZF beamforming
- The hyper-parameter settings
<img src='https://hackmd.io/_uploads/B1JZhN5a3.png' width=50% height=50%>
- The sum rate versus $P_t$ to show the proposed DRL-based algorithm in comparison with two benchmarks
- The sum rates increase with the transmit power $P_t$
<img src='https://hackmd.io/_uploads/HkDDnEqTh.png' width=70% height=50%>
- The sum rate as a function of element number N with the proposed DRL-based algorithm as well as two benchmarks ($P_t = 20 dB, M = K = 64$)
- The average sum rates increase with the number of elements $N$
<img src='https://hackmd.io/_uploads/rymIaV5ph.png' width=70% height=50%>
- The impact of $P_t$ on the performance
- The rewards as a function of time steps
<img src='https://hackmd.io/_uploads/HyEtgH9a2.png' width=70% height=80%>
- The average rewards is given by
$$
\begin{align*}
\textrm {average}\_{}{\textrm {reward}}(K_{i})=\frac {\sum _{k=1}^{K_{i}} \textrm {reward} (k)}{K_{i}}, \;\; K_{i}=1,2,\ldots,K, \\
\end{align*}
$$
- The rewards converges faster at the low SNR $(P_t = 5 dB)$ than high SNR $(P_t = 20 dB)$
- With higher SNR, the dynamic range of instant rewards is large, resulting in more fluctuations and worse convergence
- The average rewards versus time steps under different $P_t = \{ −10dB, 0dB, 10dB, 20dB, 30dB \}$
- SNRs have significantly effect on the convergence rate and performance
<img src='https://hackmd.io/_uploads/By2xzrqah.png' width=70% height=70%>
- The impact of element number $N$ on the performance
- The average rewards versus time steps under different system parameter settings
- The increase of elements $N$ doesn’t increase the convergence time
<img src='https://hackmd.io/_uploads/Hy4SmHqp2.png' width=70% height=70%>
- The sum rate as a function of $P_t$ under two scenarios
- The average sum rates increase with the transmit power $P_t$
<img src='https://hackmd.io/_uploads/H1pYXScph.png' width=70% height=70%>
- The CDF of sum rate for various system settings
- The average sum rates improve with the transmission power $P_t$ and the number of RIS elements $N$
<img src='https://hackmd.io/_uploads/Byl6mr56n.png' width=70% height=70%>
- In the proposed DRL algorithm, they use constant learning and decaying rates for the critic and actor neural networks
- The average rewards versus steps under different learning rates
- Different learning rates have the great influence on the performance
<img src='https://hackmd.io/_uploads/SyiRmrqa3.png' width=70% height=70%>
- The average rewards versus steps under different decaying rates
- Different decaying rates exerts less influence on the performance and convergence rate
<img src='https://hackmd.io/_uploads/ryg2rrca3.png' width=70% height=70%>
## **Future works**
- Since this paper has been cited by 397 other papers, I'll search among them to find newer papers that might be suitable for our needs
- My plan for next several weeks is to watch some Deep Reinforcement Learning courses on Udemy
<img src='https://hackmd.io/_uploads/HkM7PV9an.png' width=90% height=90%>
## **References**
- C. Huang, R. Mo and C. Yuen, "[Reconfigurable Intelligent Surface Assisted Multiuser MISO Systems Exploiting Deep Reinforcement Learning](https://ieeexplore.ieee.org/abstract/document/9110869)," in *IEEE Journal on Selected Areas in Communications*, vol. 38, no. 8, pp. 1839-1850, Aug. 2020. (Cited by 397)
- L. Kibona, J. Liu and Y. Liu, "[Ergodic Sum-rate Analysis for Massive MIMO under Imperfect Channel State Information](https://ieeexplore.ieee.org/document/9021601)," *2019 Photonics & Electromagnetics Research Symposium - Fall (PIERS - Fall)*, Xiamen, China, 2019, pp. 3243-3249.
- 2.3. Ergodic Achievable Sum Rate