---
# System prepended metadata

title: 'Lecture 8 : Deep Learning Software'
tags: [cs231]

---

---
tags: cs231
---
# Lecture 8 : Deep Learning Software
---
## 1.CPU vs. GPU

### CPU
中央處理單元(Center Processing Unit)
- 組成元件
CPU的構造包含控制單元（CU）、算術邏輯單元（ALU）、暫存
器（Register）、快取記憶體（Cache），它們之間透過匯流排
(Bus)來傳遞資料及指令
    - 控制單元:負責控制資料和指令的單元
    - 算術邏輯單元:在做運算與判斷
    - 暫存器:暫時存放資料的記憶體
    - 快取記憶體:存放最近被CPU存取過或常用的資料或指令
    (CPU內部暫存器上的資料和指令需要和主記憶體做交換，但是暫存器和主記憶體的速度差很多，會造成CPU一直在等待主記憶體傳送和接收資料，浪費CPU的資源。)
![](https://i.imgur.com/On7kXCI.png)
- 小補充：
![](https://i.imgur.com/FD4Ozel.png)
- 怎麼看規格：
![](https://i.imgur.com/2DZcesr.png)
    - Intel型號：
    ![](https://i.imgur.com/G7h9oVP.png)
        - i7:效能強度i9>i7>i5>i3
        - 四位數:前面的9指第九代，700則是指數字越大規格越好
        - X：最強的意思，有X在內的Intel處理器都是效能超強但超貴款
        - K：可以超頻，讓效能短期內急速提升，主要用來玩遊戲，價格較高
        - H：使用較好的圖形處理器
        - U：低電壓，通常用在輕薄筆電
        - Y：超低電壓，通常用在二合一平板
        - T：電源優化，會比較省電
        - Q：四核心
    - AMD型號：
    ![](https://i.imgur.com/FJvUUBB.png)
        - 系列：由低到高 APU / Althlon / Ryzen系列 R3 / R5 / R7 / R9
        - 尾碼：X→可超頻，G→有內顯
    - 核心數：多核心類似多人，多人就好辦事Ｒ！
    - 時脈（處理器基礎頻率 & 最大超頻）:代表處理器一秒鐘可以跑幾個運算單位，時脈越高通常代表處理器效能越強；
        - 基礎頻率:處理器一般會呈現出的頻率
        - 最大超頻:是指時脈數的高峰上限
    - 快取記憶體:其實是因為處理器的速度太快了，它需要快速讀取資料才能維持強效能，但記憶體的讀取速度太慢，因此會設置快取區，將一些會被大量重複使用的資料暫放在裡面，當作高速裝置和慢速裝置之間的緩衝區。最一般的分辨標準就是快取容量越大=越好XD
        - 存取速度：快取記憶體>記憶體>硬碟
        - 容量大小：硬碟>記憶體>快取記憶體
        - 製作成本：快取記憶體>記憶體>硬碟
- CPU的系列 (由低階到高階)
    - Intel：Celeron(賽揚)、Pentium(奔騰)、Core i3、i5、i7、i9系列
    - AMD：Athlon(速龍)、Ryzen R3、R5、R7、R9 系列 
    - 等級差不多就是i3=R3、i5=R5，i7=R7, i9=R9，所以很好判斷XD
### GPU
- 我們不一樣
CPU 5%是ALU；GPU 40％是ALU
![](https://i.imgur.com/FtjrUl4.png)
- CUDA
    - 說明：CUDA是一種由NVIDIA推出的通用並行計算架構，AMD的用戶抱歉，你不能用
    提供一則去年的[新聞](https://3c.ltn.com.tw/news/38730)
    ![](https://i.imgur.com/pPqFb3M.png =300x200)
    - [CUDA處理流程](https://blog.csdn.net/qq_42596142/article/details/103157468?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3.nonecase)
        - 1. 初始化，將必要的資料copy到GPU
        - 2. 執行配置，調用CUDA kernel function
        - 3. CPU、GPU同時計算
        - 4. 將運算結果copy到記憶體
        ![](https://i.imgur.com/tcqjKE7.png)


- OpenCL:Open Computing Language
即開放計算語言。OpenCL為異構平臺提供了一個編寫程式，尤其是並行程式的開放的框架標準。OpenCL所支持的異構平臺可由多核CPU、GPU或其他型別的處理器組成。講義說 usually slow :(

![](https://i.imgur.com/vfrYKzH.png)

---
## 2.Tensorflow
### (1)建立常數和變數
``` python
import tensorflow as tf
#建立計算圖
ts_c = tf.constant(2,name='ts_c')
ts_x = tf.Variable(ts_c+5,name='ts_x')
#執行計算圖
with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)
    print('ts_c=',sess.run(ts_c))
    print('ts_x=',sess.run(ts_x))
```
### (2)placeholder
希望在執行計算圖的階段才設定數值，就可以用placeholder
``` python
width = tf.placeholder("int32")
height = tf.placeholder("int32")
area=tf.multiply(width,height)
with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)
    # feed_dict 傳入參數
    print('area=',sess.run(area,feed_dict={width: 6, height: 8}))
```
### (3)TensorBoard
參考出處：https://blog.csdn.net/sinat_33761963/article/details/62433234
建立tf.placeholder與tf.multiply時，加入name的命名，名稱會show在tensorboard graph
``` python
import tensorflow as tf
width = tf.placeholder("int32",name='width')
height = tf.placeholder("int32",name='height')
area=tf.multiply(width,height,name='area')  

with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)
    print('area=',sess.run(area,feed_dict={width: 6,height: 8}))
#將顯示在tensorboard的資料，寫入log檔
tf.summary.merge_all()
train_writer = tf.summary.FileWriter('log/area',sess.graph)
```
啟動tensorboard的指令如下，需要指定log檔目錄
tensorboard會讀取此目錄，並顯示在tensorboard上
``` shell
tensorboard --logdir=log/area
```
``` shell
Starting TensorBoard 41 on port 6006
(You can navigate to http://127.0.1.1:6006)
```
web 介面，我覺得很猛
![](https://i.imgur.com/cFgVo5e.png)
![](https://i.imgur.com/bxbtbQL.png)
單擊節點上的“+”字樣，可以看到該節點的內部信息
![](https://i.imgur.com/PalDTj9.png)
![](https://i.imgur.com/0ZhtKmQ.png)


### (4)設定 cpu gpu 的方法
[官方文件參考連結](https://www.tensorflow.org/guide/gpu?hl=zh-cn)
- tf.device
    - gpu
        ``` python
        size=500
        with tf.device("/gpu:0"):
            W = tf.random_normal([size, size],name='W')
            X = tf.random_normal([size, size],name='X')
            mul = tf.matmul(W, X,name='mul')
            sum_result = tf.reduce_sum(mul,name='sum')

        tfconfig = tf.ConfigProto(log_device_placement=True)
        with tf.Session(config=tfconfig) as sess:
            result = sess.run(sum_result)
        ```  
        - ConfigProto() 中參數 log_device_placement=True  會print出執行操作所用的設備
        ![](https://i.imgur.com/oDqPM52.png)
        - 如果安裝的是GPU版本的tensorflow，機器上有支持的GPU，也正確安裝了顯卡驅動、CUDA和cuDNN，默認情況下，Session會在GPU上運行，默認在GPU:0上執行：
        ![](https://i.imgur.com/3oDd1dh.png)
    - cpu
        ``` python
        size=500
        with tf.device("/cpu:0"):
            W = tf.random_normal([size, size],name='W')
            X = tf.random_normal([size, size],name='X')
            mul = tf.matmul(W, X,name='mul')
            sum_result = tf.reduce_sum(mul,name='sum')

        tfconfig = tf.ConfigProto(log_device_placement=True)
        with tf.Session(config=tfconfig) as sess:
            result = sess.run(sum_result)
        ```
        - tensorflow中不同的GPU使用/gpu:0和/gpu:1區分，而CPU不區分設備號，統一使用 /cpu:0
        ![](https://i.imgur.com/9YtCpAZ.png)
    - 綜合
        ``` python
        # Creates a graph.
        c = []
        for d in ['/device:gpu:2', '/device:gpu:3']:
          with tf.device(d):
            a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
            b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
            c.append(tf.matmul(a, b))
        with tf.device('/cpu:0'):
          sum = tf.add_n(c)
        # Creates a session with log_device_placement set to True.
        sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
        # Runs the op.
        print(sess.run(sum))
        ```
- 透過環境設置設定
    - 設定 gpu
    ``` python
    import os
    #保證程序中的GPU序號是和硬件中的序號是相同的，不加的話可能會造成不必要的麻煩。
    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    # 使用第一, 三塊GPU
    os.environ["CUDA_VISIBLE_DEVICES"] = "0, 2" 
    ```
    - 禁用GPU

    ``` python
    import os
    os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
    ```
---
## 3. PyTorch
pytorch主要元素:
1. PyTorch張量(Tensors)
2. Autograd模塊
3. Optim模塊
4. Nn模塊

### (1)PyTorch張量(Tensors)

![](https://i.imgur.com/YHG6MbH.png =300x400)

1. torch.randn(): create random tensors for data and weights
Torch定義了7種CPU tensor類型, 8種GPU tensor類型：
![](https://i.imgur.com/s4rNRVx.png)
2. forward pass
矩陣相乘：
torch.mul(a, b) : 矩陣對應相乘，维度必須相等，a(1,2)和b(1,2) -> (1,2)
torch.mm(a, b) : 矩陣相乘，维度不用相等，a(1,2)和b的(2,3) -> (2,3)
取值：
`Torch.clamp(input, min, max, out=None) -> Tensor`
    ``` 
    min,  if x < min
     x ,  if min <= x <= max
    max,  if x > max
     ```
3. backward pass
4. 計算梯度和更新權重
5. run on GPU
![](https://i.imgur.com/FGcF113.png =300x100)

### (2)Autograd模塊

![](https://i.imgur.com/yW1c0qK.png =400x300)
- 因為pytorch是動態計算圖，所以有autograd自動微分的功能。
- 要進行autograd必需先將tensor數據包成Variable，Varibale和tensor基本一致，所區别在于多了下面幾个属性。
- Variable() : 是tensor的外包装
data:存tensors
grad:存導數
creator : grad_fn 的值可以知道他是不是一個計算結果
![](https://i.imgur.com/7bchIYs.png)
![](https://i.imgur.com/46d7ugv.png)

https://www.jianshu.com/p/cbce2dd60120

＊也可以不要用pytorch內建，可以自己定義functions
![](https://i.imgur.com/U3TS4xm.png)


### (3)NN模塊
PyTorch已經包好了一個神經網絡層(layer,激活函數,loss)，可以直接用
類似Tensorflow的Keras

![](https://i.imgur.com/VfrIaUQ.png)
- 直接把w1, w2架構放進nn的layer裡
- 跟他說要哪個激活函數
- 也有loss function
- 把data x放進model
- pred放進loss function
- 算backward
- 更新weight
包好一整包，直接用！

## 4. Optim模塊
![](https://i.imgur.com/zWS2E4E.png =300x350)

- optimizer = torch.optim.Adam() : 加入優化的方法，也有SGD, Adagrad...
- optimizer.zero_grad() : 重新計算每次batch時要將模型的參數梯度初始化為0，清空前面的梯度
- loss.backward() : 計算目前的梯度
- optimizer.step() : 根據目前梯度更新參數

＊一個batch的數據，計算一次梯度，更新一次網絡
＊預設只能調L2

`optimizer = optim.Adam(model.parameters(),lr=learning_rate,weight_decay=0.01)`
*weight_decay (float, optional)(L2 penalty):  (default: 0)*
https://pytorch.org/docs/stable/optim.html

＊想加L1：只能自己寫，寫法各式各樣
![](https://i.imgur.com/SHHJ5t6.png =300x130)
https://stackoverflow.com/questions/42704283/adding-l1-l2-regularization-in-pytorch
https://stackoverflow.com/questions/44641976/in-pytorch-how-to-add-l1-regularizer-to-activations


## 5. Data loaders
![](https://i.imgur.com/OFyMKp9.png =300x300)

#### Data loaders方便批次處理資料

![](https://i.imgur.com/oBVNtf6.png =300x100)
- Batch_size=8 組資料
- Epoch 10 訓練所有數據10次

## 6. Static and Dynamic graphs
![](https://i.imgur.com/l7YjXCH.png)
### 靜態計算圖：
Once graph is built, can serialize it and run it without the code that built the graph!

- 只需要建一次圖，每次重複使用，所以執行時不需要程式碼
- 可以找到最佳化圖以後再執行，因為要最佳化所以需要耗費一些成本
- 對於CPU/GPU而言，可以先行配置運算元到能達到最佳利用的配置，完成有效地分散和平行運算，所以從這裡分攤前面的成本
- 沒有彈性，需要用tensorflow專用function來調整（講師說用一個不知道什麼的fun包，看起來會很難懂ㄎㄎ ）
![](https://i.imgur.com/uMDVZ4X.png =250x150)

### 動態計算圖：
Graph building and execution are intertwined, so alway need to keep code around!
- 每次執行都建一次圖
- 需保留程式碼，透過追蹤程式的方式來動態執行
- 有彈性，程式碼符合python，簡單明瞭
 ![](https://i.imgur.com/8ROQyrD.png)


---
## 問題
|         組別         |<center> 問題 </center>|
|:-------------------:|----------------------|
|        第一組        |1. cuDNN是如何加速運算的？實務上應如何使用？|
|        第二組        |1. 2017年講義第 56 頁，有創 new_w1, new_w2，如果直接放進 sess.run() 中會發生什麼事？為什麼不 w1.assign 到 w1 而是 new_w1？ <br> 2.講義 p. 107 頁為何 optimizer.zero_grad()？還有怎麼加入 L1, L2？ </br> |
|        第三組        |<center>報告組</center>| 
|        第四組        |1. 我對tensorflow跟GPU迷Ｕ問題<br>2. 我要翹課again |

---
## 問題回覆

### 第一組
參考出處：https://zhuanlan.zhihu.com/p/83971195
1. cuDNN是如何加速運算的？實務上應如何使用?

- 什麼是CUDA?
    CUDA(Compute Unified Device Architecture)，是顯卡廠商NVIDIA推出的運算平台。 CUDA™是一種由NVIDIA推出的通用並行計算架構，該架構使GPU能夠解決複雜的計算問題。

- 什麼是CUDNN?
    NVIDIA cuDNN是用於深度神經網絡的GPU加速庫。它強調性能、易用性和低內存開銷。 NVIDIA cuDNN可以集成到更高級別的機器學習框架中，如加州大學伯克利分校的流行caffe軟件。簡單的，插入式設計可以讓開發人員專注於設計和實現神經網絡模型，而不是調整性能，同時還可以在GPU上實現高性能現代並行計算。

- CUDNN怎麼實現加速？
    最直接的方式就是利用 cuDNN 這個計算庫。通過將捲積神經網絡的計算變換為對更 GPU 友好的矩陣運算，cuDNN 可以有效提高整個網絡的訓練速度
    ![](https://i.imgur.com/lUSOtu9.png =300x200)
- 怎麼實現？
   先安裝CUDA再去安裝CUDNN 
$ tar -xzvf cudnn-9.0-linux-x64-v7.tgz

- 小整理
    - CPU適合串行計算，擅長邏輯控制。 
    - GPU擅長並行高強度並行計算，適用於AI算法的訓練學習
    - CUDA 是NVIDIA專門負責管理分配運算單元的框架
    - cuDNN是用於深層神經網絡的gpu加速庫
    - 在使用GPU訓練模型，可以選擇使用cuDNN加速庫進行加速，當然不使用cudnn時也能夠進行GPU訓練，但是使用cudnn加速後的速度比單純使用GPU訓練時，速度能夠快一倍！

### 第二組
- p.56
![](https://i.imgur.com/FppEQwX.jpg)
tf.group()用於創造一個操作，可以將傳入參數的所有操作進行分組。
``` python
w = tf.Variable(1)
mul = tf.multiply(w, 2)
add = tf.add(w, 2)
group = tf.group(mul, add)
tuple = tf.tuple([mul, add])
# sess.run(group)和sess.run(tuple)都會求Tensor(add)
# Tensor(mul)的值。區別是，tf.group()返回的是`op`
# tf.tuple()返回的是list of tensor。
# 這樣就會導致，sess.run(tuple)的時候，會返回 Tensor(mul),Tensor(add)的值.
# 而 sess.run(group)不會
```
[tensorflow group source code](https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/ops/control_flow_ops.py#L2863-L2921)
``` python
def group(*inputs, **kwargs):
  """Create an op that groups multiple operations.
  When this op finishes, all ops in `inputs` have finished. This op has no
  output.
  See also `tf.tuple` and
  `tf.control_dependencies`.
  Args:
    *inputs: Zero or more tensors to group.
    name: A name for this operation (optional).
  Returns:
    An Operation that executes all its inputs.
  Raises:
    ValueError: If an unknown keyword argument is provided.
  """
```
- p.107
![](https://i.imgur.com/ITdKh9K.png)


### 第四組
ㄅㄅ

![](https://i.imgur.com/d2c2eMY.png)