meeting 10/17 - HackMD

# **meeting 10/17** **Advisor: Prof. Chih-Yu Wang \ Presenter: Shao-Heng Chen \ Date: Oct 17, 2023**   ## **Current Progress** Here are some of the things I've been up to for the past two weeks ### **True Discrete action space version** - ```discrete_env.py``` - ```space.Discrete()``` 就是真正的離散 action space，但因為一次 (一個 time step) 只能採取一個 discrete action，所以得透過維護一個 Ns 個大的 sliding window 來滿足所有 RIS elements 的相位控制 ```python # Discrete actions # action: RIS matrix self.bits = bits self.n_actions = 2 ** bits self.action_dim = self.Ns self.action_space = spaces.Discrete(start=0, n=self.n_actions) from collections import deque self.prev_actions = deque(maxlen=self.action_dim) ``` ### **Normalize Box action space** - ```numpy_env.py``` (```temp_env.py```) - 把 ```space.Box()``` 的值域範圍改設為 $-1$ 到 $1$ 以滿足 normalized 跟 symmetric 的 action space 要求建議，然後在 ```reset()``` 跟 ```step()``` 裡面再換算回對應的 index，換算有 2 種直觀方案: ```python # Discrete actions # action: RIS matrix self.bits = bits # 3 self.n_actions = 2 ** bits # 8 self.action_dim = self.Ns # 16 self.action_space = spaces.Box(low=-1, high=1, shape=(self.action_dim,), dtype=np.float32) spacing_degree = 360. / self.n_actions act = [i for i in range(self.n_actions)] deg = [spacing_degree*i - 180. - 15. for i in range(1, self.n_actions + 1)] rad = np.radians(deg).tolist() self.angle_set_deg = { key:val for (key, val) in zip(act, deg) } self.angle_set_rad = { key:val for (key, val) in zip(act, rad) } self.splits = [-1 + 2 / self.n_actions * i for i in range(0, self.n_actions + 1)] ```  - 第 1 種: 先把 $-1$ 到 $1$ 的浮點數換算回 $0$ 到 $2^{bits} - 1$ 之間的浮點數，再換算回整數 index - 因為定義域跟對應域都是連續整數，所以直接線性插值換算即可 <img src='https://hackmd.io/_uploads/rkJq4YG-p.png' width=90% height=90%>  ```python def _linear_interpolation(self, x): y = (x + 1) / 2 * (self.n_actions - 1) return y ``` - 接著直接四捨五入到最接近的整數 ```python actions = self.action_space.sample() rescaled_actions = np.round(self._linear_interpolation(actions)).astype(int) ``` - 但稍微想一下發現會有取樣不平均的問題 - 像是 ```[0, 0.5)``` 換算得 ```0```，```[0.5, 1.5)``` 換算得 ```1```，```[1.5, 2.5)``` 換算得 ```2```，以此類推 ```[6.5, 7]``` 換算得 ```7``` - 可以發現不管 Agent 的輸出預測為何，因為換算方式就會造成某 2 個相位的出現頻率自然比較低，感覺不太合理 - 第 $2$ 種: 用區間來看，舉例來說，若考慮 ```3-bit``` 的解析度，也就是有 $8$ 種相位選擇，那我們把區間列出來 ```[-1, -0.75, -0.5, -0.25, 0.0, 0.25, 0.5, 0.75, 1]``` - 然後依照 action 落在哪個區間就換算到對應的 index ```python def _intervals_to_indices(self, act): for j in range(0, self.n_actions): if self.splits[j] <= act < self.splits[j + 1]: # print(f"{j}: {self.splits[j]} {act} {self.splits[j + 1]}") return j return np.random.randint(low=0, high=self.n_actions, size=1) def _rescale(self, actions): # method 1: rounding # np.round(self._linear_interpolation(actions)).astype(int) # method 2: convert intervals to indices indices = np.zeros(self.action_dim, dtype=int) # 2**bits for i, act in enumerate(actions): indices[i] = self._intervals_to_indices(act) return indices init_action = self._rescale(actions=self.action_space.sample()) ``` - Example 1: ``` -------------------------------------------------------------------------- ['raw_actions'] content: [ 0.86260501 0.25409108 0.67931822 0.05806967 -0.71057042 0.49869931 0.36905372 0.08236469 0.28365709 -0.08693747 -0.73377 -0.56789524 -0.81521461 -0.46595009 -0.04973559 -0.03200363] -------------------------------------------------------------------------- ['rescaled_actions'] content: [6.51911755 4.3893188 5.87761378 3.70324384 1.01300354 5.24544757 4.79168803 3.78827642 4.49279983 3.19571885 0.931805 1.51236665 0.64674885 1.8691747 3.32592542 3.38798728] -------------------------------------------------------------------------- ['rounding_to_indices'] content: [7 4 6 4 1 5 5 4 4 3 1 2 1 2 3 3] -------------------------------------------------------------------------- ['intervals_to_indices'] content: [7 5 6 4 1 5 5 4 5 3 1 1 0 2 3 3] -------------------------------------------------------------------------- ['splits'] content: [-1.0, -0.75, -0.5, -0.25, 0.0, 0.25, 0.5, 0.75, 1.0] -------------------------------------------------------------------------- 7: 0.75 0.8626050135776033 1.0 5: 0.25 0.254091084375899 0.5 6: 0.5 0.6793182222884977 0.75 4: 0.0 0.05806966993733753 0.25 1: -0.75 -0.7105704167528657 -0.5 5: 0.25 0.49869930707737953 0.5 5: 0.25 0.3690537225556607 0.5 4: 0.0 0.08236469086912535 0.25 5: 0.25 0.2836570929141733 0.5 3: -0.25 -0.08693747262204732 0.0 1: -0.75 -0.7337700006462966 -0.5 1: -0.75 -0.5678952430761326 -0.5 0: -1.0 -0.8152146149596089 -0.75 2: -0.5 -0.4659500867446056 -0.25 3: -0.25 -0.04973559305090958 0.0 3: -0.25 -0.032003634110262524 0.0 -------------------------------------------------------------------------- ``` - Example 2: ``` ------------------------------------------------------------------------- ['raw_actions'] contents: [-0.13645365 0.86694455 0.26510687 0.25270982 -0.74446027 -0.14332935 0.59276364 0.32181936 -0.54876036 0.07306222 0.05676657 0.50935485 -0.83035216 0.27335135 0.11853613 0.56464119] ------------------------------------------------------------------------- ['rescaled_actions'] content: [3.02241224 6.53430594 4.42787405 4.38448439 0.89438906 2.99834729 5.57467276 4.62636776 1.57933874 3.75571778 3.69868299 5.28274198 0.59376745 4.45672972 3.91487646 5.47624418] -------------------------------------------------------------------------- ['rounding_to_indices'] content: [3 7 4 4 1 3 6 5 2 4 4 5 1 4 4 5] -------------------------------------------------------------------------- ['intervals_to_indices'] content: [3 7 5 5 1 3 6 5 1 4 4 6 0 5 4 6] -------------------------------------------------------------------------- ['splits'] content: [-1.0, -0.75, -0.5, -0.25, 0.0, 0.25, 0.5, 0.75, 1.0] -------------------------------------------------------------------------- 3: -0.25 -0.13645364616140365 0.0 7: 0.75 0.8669445542987182 1.0 5: 0.25 0.26510687105090147 0.5 5: 0.25 0.2527098249105435 0.5 1: -0.75 -0.7444602675309533 -0.5 3: -0.25 -0.1433293456869662 0.0 6: 0.5 0.5927636444961728 0.75 5: 0.25 0.3218193591280276 0.5 1: -0.75 -0.5487603595500505 -0.5 4: 0.0 0.07306222292277642 0.25 4: 0.0 0.05676656758243781 0.25 6: 0.5 0.5093548506679428 0.75 0: -1.0 -0.8303521559803033 -0.75 5: 0.25 0.27335134932760363 0.5 4: 0.0 0.11853613056239842 0.25 6: 0.5 0.5646411935042213 0.75 -------------------------------------------------------------------------- ``` ### **Apply GPU acceleration** - Ture Discrete 跟 Box Discrete 都各實作一個 GPU 加速的版本 - ```discrete_torch_env.py``` 跟 ```torch_env.py``` - 都跑得過 ```gym``` 跟 ```stable-baselines3``` 的 ```env_checker.check_env()``` - 但預防萬一未來會換其他 RL Library 或手刻算法，所以目前會繼續維護 ```numpy``` 跟 ```torch``` 的版本 - 用 GPU 加速大概可以快 2 倍，縮短一半的執行時間 ## **Learn and Save** ### **True discrete action space** - ```numpy``` implementation ![](https://hackmd.io/_uploads/HkLZEnwg6.png) - ```torch``` implementation ![](https://hackmd.io/_uploads/H1DQGFkZa.png) - Comparison between these two ![](https://hackmd.io/_uploads/HJgtGYkba.png) ## **Load and Predict** ### **Results** - 跑 inference 的時候發現，Agent 其實什麼都沒學到 ``` (sb3) C:\Users\paulc\Downloads\RIS-MISO-DRL>python load_and_predict.py 2023-10-16 2023-10-16 Wrapping the env with a `Monitor` wrapper Wrapping the env in a DummyVecEnv. ------------ episide: 1 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]] reward: [-360.32953] -------------------------------------------- ------------ episide: 2 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]] reward: [17.118927] -------------------------------------------- ------------ episide: 3 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]] reward: [118.018456] -------------------------------------------- ------------ episide: 4 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]] reward: [127.62513] -------------------------------------------- ------------ episide: 5 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. -1. -1. 1. -1. -1. 1. 1.]] reward: [51.577286] -------------------------------------------- ------------ episide: 6 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]] reward: [-174.78543] -------------------------------------------- ------------ episide: 7 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. -1.]] reward: [-54.175446] -------------------------------------------- ------------ episide: 8 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. -1. -1. 1. 1. 1. -1. 1.]] reward: [-25.15158] -------------------------------------------- ------------ episide: 9 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. -1. -1. 1. 1. -1. 0.9999995 -0.9855779]] reward: [53.733917] -------------------------------------------- ------------ episide: 10 ------------ step: 1/1 action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. -1. -1. 1. -1. -1. 1. 1.]] reward: [149.57697] -------------------------------------------- ``` ``` (sb3) C:\Users\paulc\Downloads\RIS-MISO-DRL>python load_and_predict.py 2023-10-17 2023-10-17 Wrapping the env with a `Monitor` wrapper Wrapping the env in a DummyVecEnv. ------------ episide: 1 ------------ step: 1/3 action: [-0.02131669 0.0490021 0.00824801 0.01051491 0.07197212 -0.0462294 -0.01631422 -0.04279681 -0.03263782 -0.026789 -0.05933313 0.02598276 0.01468066 -0.02560562 -0.15320425 0.06579673] reward: 54.2869873046875 indices: [3 4 4 4 4 3 3 3 3 3 3 4 4 3 3 4] obs: [3. 4. 4. 4. 4. 3. 3. 3. 3. 3. 3. 4. 4. 3. 3. 4.] -------------------------------------------- step: 2/3 action: [-0.00331428 0.03132646 0.00887772 -0.00822616 0.07711727 -0.0391115 -0.0017681 -0.04044004 -0.04086838 -0.04706105 -0.04594048 0.03371527 -0.00198596 -0.01590204 -0.11628495 0.06917711] reward: 54.2869873046875 indices: [3 4 4 3 4 3 3 3 3 3 3 4 3 3 3 4] obs: [3. 4. 4. 3. 4. 3. 3. 3. 3. 3. 3. 4. 3. 3. 3. 4.] -------------------------------------------- step: 3/3 action: [-0.0060223 0.03252952 0.01223794 -0.01143168 0.07552139 -0.0400262 -0.00502498 -0.04340912 -0.03510792 -0.04401602 -0.04444296 0.02969764 -0.00109319 -0.02033744 -0.12010101 0.06528159] reward: 54.2869873046875 indices: [3 4 4 3 4 3 3 3 3 3 3 4 3 3 3 4] obs: [5. 5. 4. 5. 3. 2. 2. 6. 7. 7. 3. 4. 6. 6. 0. 6.] -------------------------------------------- ------------ episide: 2 ------------ step: 1/3 action: [-0.01807443 0.03684139 0.00533386 -0.00280623 -0.00448481 -0.04475321 0.02278076 -0.03370964 -0.01108776 -0.0517271 -0.0558898 0.02972344 -0.04232261 -0.0260974 -0.11799991 0.05132137] reward: -26.043609619140625 indices: [3 4 4 3 3 3 4 3 3 3 3 4 3 3 3 4] obs: [3. 4. 4. 3. 3. 3. 4. 3. 3. 3. 3. 4. 3. 3. 3. 4.] -------------------------------------------- step: 2/3 action: [ 0.01238814 0.0299063 -0.01049508 0.01096653 0.07127298 -0.01620197 0.03347323 -0.00860694 -0.0350171 -0.06126057 -0.0719668 0.04238606 -0.0091862 0.00415695 -0.07294588 0.07278994] reward: -26.043609619140625 indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 4 3 4] obs: [4. 4. 3. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 4. 3. 4.] -------------------------------------------- step: 3/3 action: [ 0.02103273 0.02837837 -0.01036341 0.01712001 0.07781161 -0.02096083 0.04075465 -0.00289786 -0.0437214 -0.06717494 -0.07946283 0.04880828 -0.00270583 0.00693063 -0.06607414 0.07997666] reward: -26.043609619140625 indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 4 3 4] obs: [5. 7. 2. 4. 6. 1. 2. 2. 5. 0. 3. 1. 1. 5. 5. 2.] -------------------------------------------- ------------ episide: 3 ------------ step: 1/3 action: [ 0.05199433 -0.01210307 -0.00384858 0.02168908 0.11057067 -0.00924241 0.06541479 -0.04730614 -0.0769427 -0.07981339 -0.07270621 0.09210572 -0.03299163 0.03383501 -0.04911559 0.11411195] reward: -358.31024169921875 indices: [4 3 3 4 4 3 4 3 3 3 3 4 3 4 3 4] obs: [4. 3. 3. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 4. 3. 4.] -------------------------------------------- step: 2/3 action: [ 0.04207677 -0.0011614 0.0264533 0.02537564 0.08660375 -0.0531966 0.02446204 -0.03892119 -0.06152189 -0.05672617 -0.06550181 0.06565554 -0.02404642 -0.00049087 -0.08765226 0.09832495] reward: -358.31024169921875 indices: [4 3 4 4 4 3 4 3 3 3 3 4 3 3 3 4] obs: [4. 3. 4. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 3. 3. 4.] -------------------------------------------- step: 3/3 action: [ 0.03900013 -0.00065568 0.0234519 0.02473833 0.08915967 -0.05321316 0.0213735 -0.03646151 -0.06049293 -0.05427819 -0.06441119 0.06433693 -0.0224547 0.00132045 -0.08930709 0.09611174] reward: -358.31024169921875 indices: [4 3 4 4 4 3 4 3 3 3 3 4 3 4 3 4] obs: [1. 7. 0. 3. 2. 1. 0. 6. 1. 4. 0. 5. 2. 4. 1. 4.] -------------------------------------------- ------------ episide: 4 ------------ step: 1/3 action: [ 0.02609184 0.03017071 -0.0056693 0.01308658 0.0290275 -0.07856165 0.02822203 -0.02433331 -0.05612957 -0.08991605 -0.08471522 0.0802256 -0.01516445 -0.02843781 -0.08050141 0.09025438] reward: -25.100173950195312 indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 3 3 4] obs: [4. 4. 3. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 3. 3. 4.] -------------------------------------------- step: 2/3 action: [ 0.05831401 0.00690948 -0.01142158 0.01969866 0.0891875 -0.04466347 0.0318232 -0.02673637 -0.07499496 -0.08569863 -0.07822106 0.07705219 -0.0190332 0.01112465 -0.05420144 0.10679524] reward: -25.100173950195312 indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 4 3 4] obs: [4. 4. 3. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 4. 3. 4.] -------------------------------------------- step: 3/3 action: [ 0.0597463 0.00696555 -0.01015426 0.0191804 0.09275264 -0.04336767 0.03165875 -0.02653229 -0.07436275 -0.08480518 -0.08059401 0.07674515 -0.01913859 0.00935793 -0.05882743 0.10956027] reward: -25.100173950195312 indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 4 3 4] obs: [4. 4. 6. 0. 6. 0. 6. 6. 7. 7. 5. 7. 3. 4. 2. 7.] -------------------------------------------- ``` - 學不起來的主要原因 - ```action``` 太少 (太多) - ```state``` 太複雜、維度太高 - ```reward``` 不利於學習 - 其他可能原因 - 目前是用各個算法的默認參數，還沒做任何調整，然後 default setting 很可能不適用這個客製化環境 - 測試方向 - 試試看較簡化的問題，如 Average MSE，來看 Agent 是否真的可以正常工作 - 試試看已知有效的類似問題，如 Sum Rate Maximization