# **meeting 08/29** **Advisor: Prof. Chih-Yu Wang \ Presenter: Shao-Heng Chen \ Date: Aug 29, 2023** <!-- Chih-Yu Wang --> <!-- Wei-Ho Chung --> ## **Paper reading** - C. Huang, R. Mo and C. Yuen, "[Reconfigurable Intelligent Surface Assisted Multiuser MISO Systems Exploiting Deep Reinforcement Learning](https://ieeexplore.ieee.org/abstract/document/9110869)," in *IEEE Journal on Selected Areas in Communications*, vol. 38, no. 8, pp. 1839-1850, Aug. 2020. (Cited by 397) ### **System Model** - Downlink RIS-aided MU-MISO system - System settings - $K$ single-antenna users - $M$-antenna BS (where $M \geq K$) - $N$-elements RIS - Schematic diagram - Severe signal **blockage** between BS and users (no direct path from BS to users) ![](https://hackmd.io/_uploads/r11rZhFT3.png) - The channel model - Assume a **frequency flat fading** channel (narrow band?) - $\mathbf{H}_1 \in \mathbb{C}^{N \times M}$ is the BS-RIS channel - the subscript $1$ indicates the first half of the channel - $\mathbf{h}_{k, 2} \in \mathbb{C}^{N \times 1}$ is the RIS-user $k$ channel - the subscript $2$ indicates the second half - Some assumptions - **Perfect CSI** are known at both the BS and the RIS - obtaining CSI at the RIS is a challenging task though - **Ideal reflection** (lossless) - $|\mathbf{\Phi}(n, n)|^2 = 1$ - The signal received at the $k$-th user is given as $$ \begin{equation*} y_{k} = \mathbf {h}_{k,2}^{T}\mathbf{\Phi }\mathbf {H}_{1}\mathbf {Gx}+w_{k} \end{equation*} $$ - $y_k$ is the signal received at the $k$-th user (complex scalar?) - $w_k$ is the zero mean (complex?) AWGN (with entries of varianc $\sigma_n^2$) at the $k$-th user - $\mathbf{x} \in \mathbb{C}^{K \times 1}$ is the data streams for $K$ users - a column vector of dimension $K \times 1$ - with zero mean unit variance entries, $\mathcal {E}(|x|)^2 = 1$ - $\mathbf{G} \in \mathbb{C}^{M \times K}$ is the transmit beamforming matrix (applied at the BS) - $\mathbf{\Phi} \triangleq diag(\phi_1, ..., \phi_N) \in \mathbb{C}^{N \times N}$ is the diagonal phase shift matrix applied at the RIS - $\mathbf{\Phi}(n, n) = \phi_n = e^{j\varphi_n}$ ### **Problem Formulation** - This paper adopt the **ergodic sum rate** as a metric to evaluate the system performance - Never heard of it before, I'm not quite sure about what it means and I don't know what's the difference is between ergodic-sum-rate and sum-rate - So, what is Ergodic Achievable Rate? - It refers to the average (data?) rate that a communication system can achieve over a long period of time, considering the inherent randomness and variations in the channel conditions. - It is the average (date? bit?) rate in which the signal is transmitted over the channel - It explains the usefulness and efficiency of massive MIMO technology and is usually derived from the Shannon theorem - The formula of ergodic-rate is given by $$ \begin{equation*} {R_k} = {\log _2}\left( {1 + SIN{R_k}} \right) \end{equation*} $$ - The ergodic-sum-rate for $K$ users is written as $$ \begin{equation*} {R_{sum}} = \sum\limits_{k = 1}^K E \left\{ {{R_k}} \right\} \approx K \cdot {\log _2}\left( {1 + SIN{R_k}} \right) \end{equation*} $$ - The above received signal model can be further written as $$ \begin{equation*} y_{k} = \mathbf {h}_{k,2}^{T}\mathbf{\Phi }\mathbf {H}_{1}\mathbf {g}_{k}x_{k}+\sum _{n, \ n\neq k}^{K}\mathbf {h}_{k,2}^{T}\mathbf{\Phi }\mathbf {H}_{1}\mathbf {g}_{n}x_{n}+w_{k} \end{equation*} $$ - Where $g_k$ is the $k$-th column vector of the matrix $\mathbf{G}$ - Without joint detection of data streams for all users, the second term is treated as cochannel interference. - The SINR at the $k$-th user is given by $$ \begin{equation*} \rho _{k} = \frac {|\mathbf {h}_{k,2}^{T}\mathbf {\Phi H}_{1}\mathbf {g}_{k}|^{2}}{\sum\limits_{n, \ n\neq k}^{K} | \mathbf {h}_{k,2}^{T}\mathbf {\Phi H}_{1}\mathbf {g}_{n}|^{2}+\sigma _{n}^{2}} \end{equation*} $$ - The objective is to find out the optimal $\mathbf{G}$ and $\mathbf{\Phi}$ by maximizing sum rate and the optimization problem can be formulated as $$ \begin{align*} \max \limits_{\mathbf{G}, \mathbf{\Phi }} \;\; &C(\mathbf {G},\mathbf{\Phi }, \mathbf {h}_{k,2},\mathbf {H}_{1}) \\ \textrm {s.t.} \;\; &tr\{\mathbf{G}\mathbf{G}^{\mathcal {H}} \} \leq P_{t} \\ &|\phi _{n}|=1 \;\; \forall n=1,2, \ldots, N. \end{align*} $$ - The first constraint is to ensure that we don't exceed the total transmission power allowed at the BS - The second constraint is modulus constraints for RIS (but in our case, we'll have phase-depent amplitude with discrete phases) ### **Method** - Deep Deterministic Policy Gradient (DDPG) - Actor-Critic Architecture: - DDPG is an extension of the actor-critic architecture - In this setup, there are 2 main components: the actor and the critic - The actor is responsible for learning a policy that maps states to actions (approximate the action) - The critic evaluates the quality of the actions chosen by the actor - Deterministic Policy: - Unlike some other DRL algorithms that learn stochastic policies (where the agent selects actions based on a probability distribution), DDPG learns a deterministic policy - This means that given a state, the actor **directly outputs the best action** to take - Experience Replay: - Similar to other DRL algorithms, DDPG utilizes experience replay - This involves storing past experiences (state, action, reward, next state) in a replay buffer and sampling mini-batches from it during training - Experience replay helps in breaking the temporal correlations in the data, leading to more stable learning - Target Networks: - DDPG employs **2 sets** of neural networks for both the actor and the critic: the "target" networks and the "online (training?)" networks - The target networks are used to estimate the value function and the optimal action, while the online (training) networks are updated during training - **Soft updates** (using a small fraction of the online (training) network's weights) are applied to the target networks to stabilize learning <!-- - Target Q-Network: - The critic in DDPG learns the Q-function, which estimates the expected cumulative reward starting from a given state and taking a specific action - This Q-network is trained using a target Q-network to reduce the learning's inherent instability in deep networks --> - Policy Gradient Update: - The actor is updated using policy gradients, with the goal of maximizing the expected cumulative reward - The gradients are propagated through the critic network using the chain rule of derivatives (foward pass?) - This process guides the actor to select actions that have higher estimated Q-values - Ornstein-Uhlenbeck Noise: - To **add exploration** in the continuous action space, DDPG uses Ornstein-Uhlenbeck noise - This noise process introduces temporally correlated noise to the actions, allowing the agent to explore the action space more effectively - DDPG network <img src='https://hackmd.io/_uploads/HkIwzJcah.png' width=65% height=65%> <img src='https://hackmd.io/_uploads/SJer9k9ph.png' width=85% height=65%> - The training network - The updates on the training critic network are given as follows $$ \begin{align*} \mathbf{\theta }_{c}^{(t+1)} &= \mathbf{\theta }_{c}^{(t)} - \mu _{c} \Delta _{\mathbf{\theta }_{c}^{(train)}} \ell (\mathbf{\theta }_{c}^{(train)}) \\ \ell (\mathbf{\theta }_{c}^{(train)}) &= \bigg (r^{(t)}+\gamma q(\mathbf{\theta }_{c}^{(target)}|s^{(t+1)},a') - q(\mathbf{\theta }_{c}^{(train)}|s^{(t)},a^{(t)}) \bigg)^{2} \end{align*} $$ - $\mu_c$ is the learning rate for the update on training critic network - $a'$ is the action output from the **target?** actor network - $\mathbf{\theta }_{c}^{(train)}$ is the training critic network - $\mathbf{\theta }_{c}^{(target)}$ is the target critic network - The parameters of the target network are updated as that of the training network in certain time slots - The update on target network is much slower than the training network - $\Delta _{\mathbf{\theta }_{c}^{(train)}} \ell (\mathbf{\theta }_{c}^{(train)})$ is the gradient with respect to the training critic network - The update on the training actor network is given as $$ \begin{align*} &\mathbf{\theta }_{a}^{(t+1)} \\ &\mathbf{\theta }_{a}^{(t)} - \mu _{a} \Delta _{a}q(\mathbf{\theta }_{c}^{(target)}|s^{(t)}, a) \Delta _{\mathbf{\theta }_{a}^{(train)}} \pi(\mathbf{\theta }_{a}^{(train)}|s^{(t)}) \end{align*} $$ - $\mu_a$ is the learning rate for the update on training critic network - $\mathbf{\theta }_{a}^{(train)}$ is the training actor network - $\pi(\mathbf{\theta }_{a}^{(train)}|s^{(t)})$ is the training actor network given input $s^{(t)}$ - $\Delta _{a}q(\mathbf{\theta }_{c}^{(target)}|s^{(t)}, a)$ is the gradient of target critic network with respect to the action - $\Delta _{\mathbf{\theta }_{a}^{(train)}} \pi(\mathbf{\theta }_{a}^{(train)}|s^{(t)})$ is the gradient of training actor network with respect to its training actor network - Note - The update of training actor network $\mathbf{\theta }_{a}^{(train)}$ is affected by the target critic network $\mathbf{\theta }_{c}^{(target)}$ through gradient of the target critic network with respect to the action $a$ - which ensures that the next selection of action is on the favorite direction of actions to optimize the $Q$ value function - The target network - The updates on the target critic network and the target actor network are given as follows $$ \begin{align*} \mathbf{\theta }_{c}^{(target)} \leftarrow &\tau_{c} \mathbf{\theta }_{c}^{(train)} + (1 - \tau _{c}) \mathbf{\theta }_{c}^{(target)} \\ \mathbf{\theta }_{a}^{(target)} \leftarrow &\tau_{a} \mathbf{\theta }_{a}^{(train)} + (1 - \tau_{a}) \mathbf{\theta }_{a}^{(target)} \end{align*} $$ - where $\tau_{c}, \tau_{a}$ are the learning rate for updating of the target critic network and the target actor network, respectively ### **Algorithm** - *State* - The state $s^{(t)}$ at time step $t$ consists of 4 components: - the transmission power at the $t$-th time step - the received power of users at the $t$-th time step - the action from the $(t - 1)$-th time step - the channel matrix $\mathbf {H}_{1} \in \mathbb{C}^{N \times M}$ and $\mathbf{h}_{k,2} \in \mathbb{C}^{N \times 1}$ - The real part and the imaginary part will be separated as independent input port - The dimension of the state space is $D_s = 2K + 2K^2 + (2MK + 2N) + (2NM + 2NK)$ - I don't understand why the dimension of the received power is $2K^2$ - *Action* - The action is simply constructed by the transmit beamforming matrix $\mathbf {G} \in \mathbb{C}^{M \times K}$ and the phase shift matrix $\mathbf{\Phi} \in \mathbb{C}^{N \times N}$ - The dimension of the action space is $D_a = 2MK + 2N$ - *Reward* - The reward at the $t$-th time step is determined as the sum rate capacity $C(\mathbf{G}^{(t)}, \mathbf{\Phi}^{(t)}, \mathbf {h}_{k,2}, \mathbf {H}_{1})$ - Given the instantaneous channels $\mathbf {H}_{1}, \mathbf{h}_{k,2}, \forall k$ and the action $\mathbf{G}^{(t)}, \mathbf{\Phi}^{(t)}$ obtained from the actor network <!-- - Algorithm Description <img src='https://hackmd.io/_uploads/ryl0yV962.png' width=50% height=50%> --> ### **Results** - The Benchmarks and Settings - Weighted Minimum Mean Square Error (WMMSE) algorithm - Iterative algorithm based on Fractional Programming (FP) with the ZF beamforming - The hyper-parameter settings <img src='https://hackmd.io/_uploads/B1JZhN5a3.png' width=50% height=50%> - The sum rate versus $P_t$ to show the proposed DRL-based algorithm in comparison with two benchmarks - The sum rates increase with the transmit power $P_t$ <img src='https://hackmd.io/_uploads/HkDDnEqTh.png' width=70% height=50%> - The sum rate as a function of element number N with the proposed DRL-based algorithm as well as two benchmarks ($P_t = 20 dB, M = K = 64$) - The average sum rates increase with the number of elements $N$ <img src='https://hackmd.io/_uploads/rymIaV5ph.png' width=70% height=50%> - The impact of $P_t$ on the performance - The rewards as a function of time steps <img src='https://hackmd.io/_uploads/HyEtgH9a2.png' width=70% height=80%> - The average rewards is given by $$ \begin{align*} \textrm {average}\_{}{\textrm {reward}}(K_{i})=\frac {\sum _{k=1}^{K_{i}} \textrm {reward} (k)}{K_{i}}, \;\; K_{i}=1,2,\ldots,K, \\ \end{align*} $$ - The rewards converges faster at the low SNR $(P_t = 5 dB)$ than high SNR $(P_t = 20 dB)$ - With higher SNR, the dynamic range of instant rewards is large, resulting in more fluctuations and worse convergence - The average rewards versus time steps under different $P_t = \{ −10dB, 0dB, 10dB, 20dB, 30dB \}$ - SNRs have significantly effect on the convergence rate and performance <img src='https://hackmd.io/_uploads/By2xzrqah.png' width=70% height=70%> - The impact of element number $N$ on the performance - The average rewards versus time steps under different system parameter settings - The increase of elements $N$ doesn’t increase the convergence time <img src='https://hackmd.io/_uploads/Hy4SmHqp2.png' width=70% height=70%> - The sum rate as a function of $P_t$ under two scenarios - The average sum rates increase with the transmit power $P_t$ <img src='https://hackmd.io/_uploads/H1pYXScph.png' width=70% height=70%> - The CDF of sum rate for various system settings - The average sum rates improve with the transmission power $P_t$ and the number of RIS elements $N$ <img src='https://hackmd.io/_uploads/Byl6mr56n.png' width=70% height=70%> - In the proposed DRL algorithm, they use constant learning and decaying rates for the critic and actor neural networks - The average rewards versus steps under different learning rates - Different learning rates have the great influence on the performance <img src='https://hackmd.io/_uploads/SyiRmrqa3.png' width=70% height=70%> - The average rewards versus steps under different decaying rates - Different decaying rates exerts less influence on the performance and convergence rate <img src='https://hackmd.io/_uploads/ryg2rrca3.png' width=70% height=70%> ## **Future works** - Since this paper has been cited by 397 other papers, I'll search among them to find newer papers that might be suitable for our needs - My plan for next several weeks is to watch some Deep Reinforcement Learning courses on Udemy <img src='https://hackmd.io/_uploads/HkM7PV9an.png' width=90% height=90%> ## **References** - C. Huang, R. Mo and C. Yuen, "[Reconfigurable Intelligent Surface Assisted Multiuser MISO Systems Exploiting Deep Reinforcement Learning](https://ieeexplore.ieee.org/abstract/document/9110869)," in *IEEE Journal on Selected Areas in Communications*, vol. 38, no. 8, pp. 1839-1850, Aug. 2020. (Cited by 397) - L. Kibona, J. Liu and Y. Liu, "[Ergodic Sum-rate Analysis for Massive MIMO under Imperfect Channel State Information](https://ieeexplore.ieee.org/document/9021601)," *2019 Photonics & Electromagnetics Research Symposium - Fall (PIERS - Fall)*, Xiamen, China, 2019, pp. 3243-3249. - 2.3. Ergodic Achievable Sum Rate