# **meeting 10/17**
**Advisor: Prof. Chih-Yu Wang \
Presenter: Shao-Heng Chen \
Date: Oct 17, 2023**
<!-- Chih-Yu Wang -->
<!-- Wei-Ho Chung -->
## **Current Progress**
Here are some of the things I've been up to for the past two weeks
### **True Discrete action space version**
- ```discrete_env.py```
- ```space.Discrete()``` 就是真正的離散 action space,但因為一次 (一個 time step) 只能採取一個 discrete action,所以得透過維護一個 Ns 個大的 sliding window 來滿足所有 RIS elements 的相位控制
```python
# Discrete actions
# action: RIS matrix
self.bits = bits
self.n_actions = 2 ** bits
self.action_dim = self.Ns
self.action_space = spaces.Discrete(start=0, n=self.n_actions)
from collections import deque
self.prev_actions = deque(maxlen=self.action_dim)
```
### **Normalize Box action space**
- ```numpy_env.py``` (```temp_env.py```)
- 把 ```space.Box()``` 的值域範圍改設為 $-1$ 到 $1$ 以滿足 normalized 跟 symmetric 的 action space 要求建議,然後在 ```reset()``` 跟 ```step()``` 裡面再換算回對應的 index,換算有 2 種直觀方案:
```python
# Discrete actions
# action: RIS matrix
self.bits = bits # 3
self.n_actions = 2 ** bits # 8
self.action_dim = self.Ns # 16
self.action_space = spaces.Box(low=-1, high=1,
shape=(self.action_dim,), dtype=np.float32)
spacing_degree = 360. / self.n_actions
act = [i for i in range(self.n_actions)]
deg = [spacing_degree*i - 180. - 15. for i in range(1, self.n_actions + 1)]
rad = np.radians(deg).tolist()
self.angle_set_deg = {
key:val for (key, val) in zip(act, deg)
}
self.angle_set_rad = {
key:val for (key, val) in zip(act, rad)
}
self.splits = [-1 + 2 / self.n_actions * i for i in range(0, self.n_actions + 1)]
```
<!-- - I've set the range of ```space.Box()``` to $-1$ and $1$ to satisfy the recommended requirements of a 'normalized' and 'symmetric' action space. -->
- 第 1 種: 先把 $-1$ 到 $1$ 的浮點數換算回 $0$ 到 $2^{bits} - 1$ 之間的浮點數,再換算回整數 index
- 因為定義域跟對應域都是連續整數,所以直接 線性插值 換算即可
<img src='https://hackmd.io/_uploads/rkJq4YG-p.png' width=90% height=90%>
<!-- $$
\begin{align*}
\because &\;\; \frac{y - y_0}{x - x_0} = \frac{y_1 - y_0}{x_1 - x_0} \\
\Rightarrow \;\; y &= y_0 + (x - x_0)\cdot\frac{y_1 - y_0}{x_1 - x_0} \\
&= 0 + (x - (-1))\cdot\frac{2^{bits} - 1 - 0}{1 - (-1)} \\
\therefore \;\; y &= (x + 1)\cdot\frac{2^{bits} - 1}{2}
\end{align*}
$$ -->
```python
def _linear_interpolation(self, x):
y = (x + 1) / 2 * (self.n_actions - 1)
return y
```
- 接著直接四捨五入到最接近的整數
```python
actions = self.action_space.sample()
rescaled_actions = np.round(self._linear_interpolation(actions)).astype(int)
```
- 但稍微想一下發現會有 取樣不平均 的問題
- 像是 ```[0, 0.5)``` 換算得 ```0```,```[0.5, 1.5)``` 換算得 ```1```,```[1.5, 2.5)``` 換算得 ```2```,以此類推 ```[6.5, 7]``` 換算得 ```7```
- 可以發現不管 Agent 的輸出預測為何,因為換算方式就會造成某 2 個相位的出現頻率自然比較低,感覺不太合理
- 第 $2$ 種: 用區間來看,舉例來說,若考慮 ```3-bit``` 的解析度,也就是有 $8$ 種相位選擇,那我們把區間列出來 ```[-1, -0.75, -0.5, -0.25, 0.0, 0.25, 0.5, 0.75, 1]```
- 然後依照 action 落在哪個區間就換算到對應的 index
```python
def _intervals_to_indices(self, act):
for j in range(0, self.n_actions):
if self.splits[j] <= act < self.splits[j + 1]:
# print(f"{j}: {self.splits[j]} {act} {self.splits[j + 1]}")
return j
return np.random.randint(low=0, high=self.n_actions, size=1)
def _rescale(self, actions):
# method 1: rounding
# np.round(self._linear_interpolation(actions)).astype(int)
# method 2: convert intervals to indices
indices = np.zeros(self.action_dim, dtype=int) # 2**bits
for i, act in enumerate(actions):
indices[i] = self._intervals_to_indices(act)
return indices
init_action = self._rescale(actions=self.action_space.sample())
```
- Example 1:
```
--------------------------------------------------------------------------
['raw_actions']
content:
[ 0.86260501 0.25409108 0.67931822 0.05806967 -0.71057042 0.49869931
0.36905372 0.08236469 0.28365709 -0.08693747 -0.73377 -0.56789524
-0.81521461 -0.46595009 -0.04973559 -0.03200363]
--------------------------------------------------------------------------
['rescaled_actions']
content:
[6.51911755 4.3893188 5.87761378 3.70324384 1.01300354 5.24544757
4.79168803 3.78827642 4.49279983 3.19571885 0.931805 1.51236665
0.64674885 1.8691747 3.32592542 3.38798728]
--------------------------------------------------------------------------
['rounding_to_indices']
content:
[7 4 6 4 1 5 5 4 4 3 1 2 1 2 3 3]
--------------------------------------------------------------------------
['intervals_to_indices']
content:
[7 5 6 4 1 5 5 4 5 3 1 1 0 2 3 3]
--------------------------------------------------------------------------
['splits']
content:
[-1.0, -0.75, -0.5, -0.25, 0.0, 0.25, 0.5, 0.75, 1.0]
--------------------------------------------------------------------------
7: 0.75 0.8626050135776033 1.0
5: 0.25 0.254091084375899 0.5
6: 0.5 0.6793182222884977 0.75
4: 0.0 0.05806966993733753 0.25
1: -0.75 -0.7105704167528657 -0.5
5: 0.25 0.49869930707737953 0.5
5: 0.25 0.3690537225556607 0.5
4: 0.0 0.08236469086912535 0.25
5: 0.25 0.2836570929141733 0.5
3: -0.25 -0.08693747262204732 0.0
1: -0.75 -0.7337700006462966 -0.5
1: -0.75 -0.5678952430761326 -0.5
0: -1.0 -0.8152146149596089 -0.75
2: -0.5 -0.4659500867446056 -0.25
3: -0.25 -0.04973559305090958 0.0
3: -0.25 -0.032003634110262524 0.0
--------------------------------------------------------------------------
```
- Example 2:
```
-------------------------------------------------------------------------
['raw_actions']
contents:
[-0.13645365 0.86694455 0.26510687 0.25270982 -0.74446027 -0.14332935
0.59276364 0.32181936 -0.54876036 0.07306222 0.05676657 0.50935485
-0.83035216 0.27335135 0.11853613 0.56464119]
-------------------------------------------------------------------------
['rescaled_actions']
content:
[3.02241224 6.53430594 4.42787405 4.38448439 0.89438906 2.99834729
5.57467276 4.62636776 1.57933874 3.75571778 3.69868299 5.28274198
0.59376745 4.45672972 3.91487646 5.47624418]
--------------------------------------------------------------------------
['rounding_to_indices']
content:
[3 7 4 4 1 3 6 5 2 4 4 5 1 4 4 5]
--------------------------------------------------------------------------
['intervals_to_indices']
content:
[3 7 5 5 1 3 6 5 1 4 4 6 0 5 4 6]
--------------------------------------------------------------------------
['splits']
content:
[-1.0, -0.75, -0.5, -0.25, 0.0, 0.25, 0.5, 0.75, 1.0]
--------------------------------------------------------------------------
3: -0.25 -0.13645364616140365 0.0
7: 0.75 0.8669445542987182 1.0
5: 0.25 0.26510687105090147 0.5
5: 0.25 0.2527098249105435 0.5
1: -0.75 -0.7444602675309533 -0.5
3: -0.25 -0.1433293456869662 0.0
6: 0.5 0.5927636444961728 0.75
5: 0.25 0.3218193591280276 0.5
1: -0.75 -0.5487603595500505 -0.5
4: 0.0 0.07306222292277642 0.25
4: 0.0 0.05676656758243781 0.25
6: 0.5 0.5093548506679428 0.75
0: -1.0 -0.8303521559803033 -0.75
5: 0.25 0.27335134932760363 0.5
4: 0.0 0.11853613056239842 0.25
6: 0.5 0.5646411935042213 0.75
--------------------------------------------------------------------------
```
### **Apply GPU acceleration**
- Ture Discrete 跟 Box Discrete 都各實作一個 GPU 加速的版本
- ```discrete_torch_env.py``` 跟 ```torch_env.py```
- 都跑得過 ```gym``` 跟 ```stable-baselines3``` 的 ```env_checker.check_env()```
- 但預防萬一未來會換其他 RL Library 或手刻算法,所以目前會繼續維護 ```numpy``` 跟 ```torch``` 的版本
- 用 GPU 加速大概可以快 2 倍,縮短一半的執行時間
## **Learn and Save**
### **True discrete action space**
- ```numpy``` implementation

- ```torch``` implementation

- Comparison between these two

## **Load and Predict**
### **Results**
- 跑 inference 的時候發現,Agent 其實什麼都沒學到
```
(sb3) C:\Users\paulc\Downloads\RIS-MISO-DRL>python load_and_predict.py
2023-10-16
2023-10-16
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------
episide: 1
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]]
reward: [-360.32953]
--------------------------------------------
------------
episide: 2
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]]
reward: [17.118927]
--------------------------------------------
------------
episide: 3
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]]
reward: [118.018456]
--------------------------------------------
------------
episide: 4
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]]
reward: [127.62513]
--------------------------------------------
------------
episide: 5
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. -1. -1. 1. -1. -1. 1. 1.]]
reward: [51.577286]
--------------------------------------------
------------
episide: 6
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1.]]
reward: [-174.78543]
--------------------------------------------
------------
episide: 7
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. -1.]]
reward: [-54.175446]
--------------------------------------------
------------
episide: 8
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. -1. -1. 1. 1. 1. -1. 1.]]
reward: [-25.15158]
--------------------------------------------
------------
episide: 9
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1.
1. -1. 1. -1. -1. 1.
1. -1. 0.9999995 -0.9855779]]
reward: [53.733917]
--------------------------------------------
------------
episide: 10
------------
step: 1/1
action: [[ 1. 1. 1. -1. 1. -1. 1. -1. 1. -1. -1. 1. -1. -1. 1. 1.]]
reward: [149.57697]
--------------------------------------------
```
```
(sb3) C:\Users\paulc\Downloads\RIS-MISO-DRL>python load_and_predict.py
2023-10-17
2023-10-17
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------
episide: 1
------------
step: 1/3
action: [-0.02131669 0.0490021 0.00824801 0.01051491 0.07197212 -0.0462294
-0.01631422 -0.04279681 -0.03263782 -0.026789 -0.05933313 0.02598276
0.01468066 -0.02560562 -0.15320425 0.06579673]
reward: 54.2869873046875
indices: [3 4 4 4 4 3 3 3 3 3 3 4 4 3 3 4]
obs: [3. 4. 4. 4. 4. 3. 3. 3. 3. 3. 3. 4. 4. 3. 3. 4.]
--------------------------------------------
step: 2/3
action: [-0.00331428 0.03132646 0.00887772 -0.00822616 0.07711727 -0.0391115
-0.0017681 -0.04044004 -0.04086838 -0.04706105 -0.04594048 0.03371527
-0.00198596 -0.01590204 -0.11628495 0.06917711]
reward: 54.2869873046875
indices: [3 4 4 3 4 3 3 3 3 3 3 4 3 3 3 4]
obs: [3. 4. 4. 3. 4. 3. 3. 3. 3. 3. 3. 4. 3. 3. 3. 4.]
--------------------------------------------
step: 3/3
action: [-0.0060223 0.03252952 0.01223794 -0.01143168 0.07552139 -0.0400262
-0.00502498 -0.04340912 -0.03510792 -0.04401602 -0.04444296 0.02969764
-0.00109319 -0.02033744 -0.12010101 0.06528159]
reward: 54.2869873046875
indices: [3 4 4 3 4 3 3 3 3 3 3 4 3 3 3 4]
obs: [5. 5. 4. 5. 3. 2. 2. 6. 7. 7. 3. 4. 6. 6. 0. 6.]
--------------------------------------------
------------
episide: 2
------------
step: 1/3
action: [-0.01807443 0.03684139 0.00533386 -0.00280623 -0.00448481 -0.04475321
0.02278076 -0.03370964 -0.01108776 -0.0517271 -0.0558898 0.02972344
-0.04232261 -0.0260974 -0.11799991 0.05132137]
reward: -26.043609619140625
indices: [3 4 4 3 3 3 4 3 3 3 3 4 3 3 3 4]
obs: [3. 4. 4. 3. 3. 3. 4. 3. 3. 3. 3. 4. 3. 3. 3. 4.]
--------------------------------------------
step: 2/3
action: [ 0.01238814 0.0299063 -0.01049508 0.01096653 0.07127298 -0.01620197
0.03347323 -0.00860694 -0.0350171 -0.06126057 -0.0719668 0.04238606
-0.0091862 0.00415695 -0.07294588 0.07278994]
reward: -26.043609619140625
indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 4 3 4]
obs: [4. 4. 3. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 4. 3. 4.]
--------------------------------------------
step: 3/3
action: [ 0.02103273 0.02837837 -0.01036341 0.01712001 0.07781161 -0.02096083
0.04075465 -0.00289786 -0.0437214 -0.06717494 -0.07946283 0.04880828
-0.00270583 0.00693063 -0.06607414 0.07997666]
reward: -26.043609619140625
indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 4 3 4]
obs: [5. 7. 2. 4. 6. 1. 2. 2. 5. 0. 3. 1. 1. 5. 5. 2.]
--------------------------------------------
------------
episide: 3
------------
step: 1/3
action: [ 0.05199433 -0.01210307 -0.00384858 0.02168908 0.11057067 -0.00924241
0.06541479 -0.04730614 -0.0769427 -0.07981339 -0.07270621 0.09210572
-0.03299163 0.03383501 -0.04911559 0.11411195]
reward: -358.31024169921875
indices: [4 3 3 4 4 3 4 3 3 3 3 4 3 4 3 4]
obs: [4. 3. 3. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 4. 3. 4.]
--------------------------------------------
step: 2/3
action: [ 0.04207677 -0.0011614 0.0264533 0.02537564 0.08660375 -0.0531966
0.02446204 -0.03892119 -0.06152189 -0.05672617 -0.06550181 0.06565554
-0.02404642 -0.00049087 -0.08765226 0.09832495]
reward: -358.31024169921875
indices: [4 3 4 4 4 3 4 3 3 3 3 4 3 3 3 4]
obs: [4. 3. 4. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 3. 3. 4.]
--------------------------------------------
step: 3/3
action: [ 0.03900013 -0.00065568 0.0234519 0.02473833 0.08915967 -0.05321316
0.0213735 -0.03646151 -0.06049293 -0.05427819 -0.06441119 0.06433693
-0.0224547 0.00132045 -0.08930709 0.09611174]
reward: -358.31024169921875
indices: [4 3 4 4 4 3 4 3 3 3 3 4 3 4 3 4]
obs: [1. 7. 0. 3. 2. 1. 0. 6. 1. 4. 0. 5. 2. 4. 1. 4.]
--------------------------------------------
------------
episide: 4
------------
step: 1/3
action: [ 0.02609184 0.03017071 -0.0056693 0.01308658 0.0290275 -0.07856165
0.02822203 -0.02433331 -0.05612957 -0.08991605 -0.08471522 0.0802256
-0.01516445 -0.02843781 -0.08050141 0.09025438]
reward: -25.100173950195312
indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 3 3 4]
obs: [4. 4. 3. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 3. 3. 4.]
--------------------------------------------
step: 2/3
action: [ 0.05831401 0.00690948 -0.01142158 0.01969866 0.0891875 -0.04466347
0.0318232 -0.02673637 -0.07499496 -0.08569863 -0.07822106 0.07705219
-0.0190332 0.01112465 -0.05420144 0.10679524]
reward: -25.100173950195312
indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 4 3 4]
obs: [4. 4. 3. 4. 4. 3. 4. 3. 3. 3. 3. 4. 3. 4. 3. 4.]
--------------------------------------------
step: 3/3
action: [ 0.0597463 0.00696555 -0.01015426 0.0191804 0.09275264 -0.04336767
0.03165875 -0.02653229 -0.07436275 -0.08480518 -0.08059401 0.07674515
-0.01913859 0.00935793 -0.05882743 0.10956027]
reward: -25.100173950195312
indices: [4 4 3 4 4 3 4 3 3 3 3 4 3 4 3 4]
obs: [4. 4. 6. 0. 6. 0. 6. 6. 7. 7. 5. 7. 3. 4. 2. 7.]
--------------------------------------------
```
- 學不起來的主要原因
- ```action``` 太少 (太多)
- ```state``` 太複雜、維度太高
- ```reward``` 不利於學習
- 其他可能原因
- 目前是用各個算法的默認參數,還沒做任何調整,然後 default setting 很可能不適用這個 客製化環境
- 測試方向
- 試試看較簡化的問題,如 Average MSE,來看 Agent 是否真的可以正常工作
- 試試看已知有效的類似問題,如 Sum Rate Maximization