Lecture 8 : Deep Learning Software

--- tags: cs231 --- # Lecture 8 : Deep Learning Software --- ## 1.CPU vs. GPU ### CPU 中央處理單元(Center Processing Unit) - 組成元件 CPU的構造包含控制單元（CU）、算術邏輯單元（ALU）、暫存器（Register）、快取記憶體（Cache），它們之間透過匯流排 (Bus)來傳遞資料及指令 - 控制單元:負責控制資料和指令的單元 - 算術邏輯單元:在做運算與判斷 - 暫存器:暫時存放資料的記憶體 - 快取記憶體:存放最近被CPU存取過或常用的資料或指令 (CPU內部暫存器上的資料和指令需要和主記憶體做交換，但是暫存器和主記憶體的速度差很多，會造成CPU一直在等待主記憶體傳送和接收資料，浪費CPU的資源。) ![](https://i.imgur.com/On7kXCI.png) - 小補充： ![](https://i.imgur.com/FD4Ozel.png) - 怎麼看規格： ![](https://i.imgur.com/2DZcesr.png) - Intel型號： ![](https://i.imgur.com/G7h9oVP.png) - i7:效能強度i9>i7>i5>i3 - 四位數:前面的9指第九代，700則是指數字越大規格越好 - X：最強的意思，有X在內的Intel處理器都是效能超強但超貴款 - K：可以超頻，讓效能短期內急速提升，主要用來玩遊戲，價格較高 - H：使用較好的圖形處理器 - U：低電壓，通常用在輕薄筆電 - Y：超低電壓，通常用在二合一平板 - T：電源優化，會比較省電 - Q：四核心 - AMD型號： ![](https://i.imgur.com/FJvUUBB.png) - 系列：由低到高 APU / Althlon / Ryzen系列 R3 / R5 / R7 / R9 - 尾碼：X→可超頻，G→有內顯 - 核心數：多核心類似多人，多人就好辦事Ｒ！ - 時脈（處理器基礎頻率 & 最大超頻）:代表處理器一秒鐘可以跑幾個運算單位，時脈越高通常代表處理器效能越強； - 基礎頻率:處理器一般會呈現出的頻率 - 最大超頻:是指時脈數的高峰上限 - 快取記憶體:其實是因為處理器的速度太快了，它需要快速讀取資料才能維持強效能，但記憶體的讀取速度太慢，因此會設置快取區，將一些會被大量重複使用的資料暫放在裡面，當作高速裝置和慢速裝置之間的緩衝區。最一般的分辨標準就是快取容量越大=越好XD - 存取速度：快取記憶體>記憶體>硬碟 - 容量大小：硬碟>記憶體>快取記憶體 - 製作成本：快取記憶體>記憶體>硬碟 - CPU的系列 (由低階到高階) - Intel：Celeron(賽揚)、Pentium(奔騰)、Core i3、i5、i7、i9系列 - AMD：Athlon(速龍)、Ryzen R3、R5、R7、R9 系列 - 等級差不多就是i3=R3、i5=R5，i7=R7, i9=R9，所以很好判斷XD ### GPU - 我們不一樣 CPU 5%是ALU；GPU 40％是ALU ![](https://i.imgur.com/FtjrUl4.png) - CUDA - 說明：CUDA是一種由NVIDIA推出的通用並行計算架構，AMD的用戶抱歉，你不能用提供一則去年的[新聞](https://3c.ltn.com.tw/news/38730) ![](https://i.imgur.com/pPqFb3M.png =300x200) - [CUDA處理流程](https://blog.csdn.net/qq_42596142/article/details/103157468?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3.nonecase) - 1. 初始化，將必要的資料copy到GPU - 2. 執行配置，調用CUDA kernel function - 3. CPU、GPU同時計算 - 4. 將運算結果copy到記憶體 ![](https://i.imgur.com/tcqjKE7.png) - OpenCL:Open Computing Language 即開放計算語言。OpenCL為異構平臺提供了一個編寫程式，尤其是並行程式的開放的框架標準。OpenCL所支持的異構平臺可由多核CPU、GPU或其他型別的處理器組成。講義說 usually slow :( ![](https://i.imgur.com/vfrYKzH.png) --- ## 2.Tensorflow ### (1)建立常數和變數 ``` python import tensorflow as tf #建立計算圖 ts_c = tf.constant(2,name='ts_c') ts_x = tf.Variable(ts_c+5,name='ts_x') #執行計算圖 with tf.Session() as sess: init = tf.global_variables_initializer() sess.run(init) print('ts_c=',sess.run(ts_c)) print('ts_x=',sess.run(ts_x)) ``` ### (2)placeholder 希望在執行計算圖的階段才設定數值，就可以用placeholder ``` python width = tf.placeholder("int32") height = tf.placeholder("int32") area=tf.multiply(width,height) with tf.Session() as sess: init = tf.global_variables_initializer() sess.run(init) # feed_dict 傳入參數 print('area=',sess.run(area,feed_dict={width: 6, height: 8})) ``` ### (3)TensorBoard 參考出處：https://blog.csdn.net/sinat_33761963/article/details/62433234 建立tf.placeholder與tf.multiply時，加入name的命名，名稱會show在tensorboard graph ``` python import tensorflow as tf width = tf.placeholder("int32",name='width') height = tf.placeholder("int32",name='height') area=tf.multiply(width,height,name='area') with tf.Session() as sess: init = tf.global_variables_initializer() sess.run(init) print('area=',sess.run(area,feed_dict={width: 6,height: 8})) #將顯示在tensorboard的資料，寫入log檔 tf.summary.merge_all() train_writer = tf.summary.FileWriter('log/area',sess.graph) ``` 啟動tensorboard的指令如下，需要指定log檔目錄 tensorboard會讀取此目錄，並顯示在tensorboard上 ``` shell tensorboard --logdir=log/area ``` ``` shell Starting TensorBoard 41 on port 6006 (You can navigate to http://127.0.1.1:6006) ``` web 介面，我覺得很猛 ![](https://i.imgur.com/cFgVo5e.png) ![](https://i.imgur.com/bxbtbQL.png) 單擊節點上的“+”字樣，可以看到該節點的內部信息 ![](https://i.imgur.com/PalDTj9.png) ![](https://i.imgur.com/0ZhtKmQ.png) ### (4)設定 cpu gpu 的方法 [官方文件參考連結](https://www.tensorflow.org/guide/gpu?hl=zh-cn) - tf.device - gpu ``` python size=500 with tf.device("/gpu:0"): W = tf.random_normal([size, size],name='W') X = tf.random_normal([size, size],name='X') mul = tf.matmul(W, X,name='mul') sum_result = tf.reduce_sum(mul,name='sum') tfconfig = tf.ConfigProto(log_device_placement=True) with tf.Session(config=tfconfig) as sess: result = sess.run(sum_result) ``` - ConfigProto() 中參數 log_device_placement=True 會print出執行操作所用的設備 ![](https://i.imgur.com/oDqPM52.png) - 如果安裝的是GPU版本的tensorflow，機器上有支持的GPU，也正確安裝了顯卡驅動、CUDA和cuDNN，默認情況下，Session會在GPU上運行，默認在GPU:0上執行： ![](https://i.imgur.com/3oDd1dh.png) - cpu ``` python size=500 with tf.device("/cpu:0"): W = tf.random_normal([size, size],name='W') X = tf.random_normal([size, size],name='X') mul = tf.matmul(W, X,name='mul') sum_result = tf.reduce_sum(mul,name='sum') tfconfig = tf.ConfigProto(log_device_placement=True) with tf.Session(config=tfconfig) as sess: result = sess.run(sum_result) ``` - tensorflow中不同的GPU使用/gpu:0和/gpu:1區分，而CPU不區分設備號，統一使用 /cpu:0 ![](https://i.imgur.com/9YtCpAZ.png) - 綜合 ``` python # Creates a graph. c = [] for d in ['/device:gpu:2', '/device:gpu:3']: with tf.device(d): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3]) b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2]) c.append(tf.matmul(a, b)) with tf.device('/cpu:0'): sum = tf.add_n(c) # Creates a session with log_device_placement set to True. sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) # Runs the op. print(sess.run(sum)) ``` - 透過環境設置設定 - 設定 gpu ``` python import os #保證程序中的GPU序號是和硬件中的序號是相同的，不加的話可能會造成不必要的麻煩。 os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" # 使用第一, 三塊GPU os.environ["CUDA_VISIBLE_DEVICES"] = "0, 2" ``` - 禁用GPU ``` python import os os.environ["CUDA_VISIBLE_DEVICES"] = "-1" ``` --- ## 3. PyTorch pytorch主要元素: 1. PyTorch張量(Tensors) 2. Autograd模塊 3. Optim模塊 4. Nn模塊 ### (1)PyTorch張量(Tensors) ![](https://i.imgur.com/YHG6MbH.png =300x400) 1. torch.randn(): create random tensors for data and weights Torch定義了7種CPU tensor類型, 8種GPU tensor類型： ![](https://i.imgur.com/s4rNRVx.png) 2. forward pass 矩陣相乘： torch.mul(a, b) : 矩陣對應相乘，维度必須相等，a(1,2)和b(1,2) -> (1,2) torch.mm(a, b) : 矩陣相乘，维度不用相等，a(1,2)和b的(2,3) -> (2,3) 取值： `Torch.clamp(input, min, max, out=None) -> Tensor` ``` min, if x < min x , if min <= x <= max max, if x > max ``` 3. backward pass 4. 計算梯度和更新權重 5. run on GPU ![](https://i.imgur.com/FGcF113.png =300x100) ### (2)Autograd模塊 ![](https://i.imgur.com/yW1c0qK.png =400x300) - 因為pytorch是動態計算圖，所以有autograd自動微分的功能。 - 要進行autograd必需先將tensor數據包成Variable，Varibale和tensor基本一致，所區别在于多了下面幾个属性。 - Variable() : 是tensor的外包装 data:存tensors grad:存導數 creator : grad_fn 的值可以知道他是不是一個計算結果 ![](https://i.imgur.com/7bchIYs.png) ![](https://i.imgur.com/46d7ugv.png) https://www.jianshu.com/p/cbce2dd60120 ＊也可以不要用pytorch內建，可以自己定義functions ![](https://i.imgur.com/U3TS4xm.png) ### (3)NN模塊 PyTorch已經包好了一個神經網絡層(layer,激活函數,loss)，可以直接用類似Tensorflow的Keras ![](https://i.imgur.com/VfrIaUQ.png) - 直接把w1, w2架構放進nn的layer裡 - 跟他說要哪個激活函數 - 也有loss function - 把data x放進model - pred放進loss function - 算backward - 更新weight 包好一整包，直接用！ ## 4. Optim模塊 ![](https://i.imgur.com/zWS2E4E.png =300x350) - optimizer = torch.optim.Adam() : 加入優化的方法，也有SGD, Adagrad... - optimizer.zero_grad() : 重新計算每次batch時要將模型的參數梯度初始化為0，清空前面的梯度 - loss.backward() : 計算目前的梯度 - optimizer.step() : 根據目前梯度更新參數＊一個batch的數據，計算一次梯度，更新一次網絡＊預設只能調L2 `optimizer = optim.Adam(model.parameters(),lr=learning_rate,weight_decay=0.01)` *weight_decay (float, optional)(L2 penalty): (default: 0)* https://pytorch.org/docs/stable/optim.html ＊想加L1：只能自己寫，寫法各式各樣 ![](https://i.imgur.com/SHHJ5t6.png =300x130) https://stackoverflow.com/questions/42704283/adding-l1-l2-regularization-in-pytorch https://stackoverflow.com/questions/44641976/in-pytorch-how-to-add-l1-regularizer-to-activations ## 5. Data loaders ![](https://i.imgur.com/OFyMKp9.png =300x300) #### Data loaders方便批次處理資料 ![](https://i.imgur.com/oBVNtf6.png =300x100) - Batch_size=8 組資料 - Epoch 10 訓練所有數據10次 ## 6. Static and Dynamic graphs ![](https://i.imgur.com/l7YjXCH.png) ### 靜態計算圖： Once graph is built, can serialize it and run it without the code that built the graph! - 只需要建一次圖，每次重複使用，所以執行時不需要程式碼 - 可以找到最佳化圖以後再執行，因為要最佳化所以需要耗費一些成本 - 對於CPU/GPU而言，可以先行配置運算元到能達到最佳利用的配置，完成有效地分散和平行運算，所以從這裡分攤前面的成本 - 沒有彈性，需要用tensorflow專用function來調整（講師說用一個不知道什麼的fun包，看起來會很難懂ㄎㄎ） ![](https://i.imgur.com/uMDVZ4X.png =250x150) ### 動態計算圖： Graph building and execution are intertwined, so alway need to keep code around! - 每次執行都建一次圖 - 需保留程式碼，透過追蹤程式的方式來動態執行 - 有彈性，程式碼符合python，簡單明瞭 ![](https://i.imgur.com/8ROQyrD.png) --- ## 問題 | 組別 |<center> 問題 </center>| |:-------------------:|----------------------| | 第一組 |1. cuDNN是如何加速運算的？實務上應如何使用？| | 第二組 |1. 2017年講義第 56 頁，有創 new_w1, new_w2，如果直接放進 sess.run() 中會發生什麼事？為什麼不 w1.assign 到 w1 而是 new_w1？ <br> 2.講義 p. 107 頁為何 optimizer.zero_grad()？還有怎麼加入 L1, L2？ </br> | | 第三組 |<center>報告組</center>| | 第四組 |1. 我對tensorflow跟GPU迷Ｕ問題<br>2. 我要翹課again | --- ## 問題回覆 ### 第一組參考出處：https://zhuanlan.zhihu.com/p/83971195 1. cuDNN是如何加速運算的？實務上應如何使用? - 什麼是CUDA? CUDA(Compute Unified Device Architecture)，是顯卡廠商NVIDIA推出的運算平台。 CUDA™是一種由NVIDIA推出的通用並行計算架構，該架構使GPU能夠解決複雜的計算問題。 - 什麼是CUDNN? NVIDIA cuDNN是用於深度神經網絡的GPU加速庫。它強調性能、易用性和低內存開銷。 NVIDIA cuDNN可以集成到更高級別的機器學習框架中，如加州大學伯克利分校的流行caffe軟件。簡單的，插入式設計可以讓開發人員專注於設計和實現神經網絡模型，而不是調整性能，同時還可以在GPU上實現高性能現代並行計算。 - CUDNN怎麼實現加速？最直接的方式就是利用 cuDNN 這個計算庫。通過將捲積神經網絡的計算變換為對更 GPU 友好的矩陣運算，cuDNN 可以有效提高整個網絡的訓練速度 ![](https://i.imgur.com/lUSOtu9.png =300x200) - 怎麼實現？先安裝CUDA再去安裝CUDNN $ tar -xzvf cudnn-9.0-linux-x64-v7.tgz - 小整理 - CPU適合串行計算，擅長邏輯控制。 - GPU擅長並行高強度並行計算，適用於AI算法的訓練學習 - CUDA 是NVIDIA專門負責管理分配運算單元的框架 - cuDNN是用於深層神經網絡的gpu加速庫 - 在使用GPU訓練模型，可以選擇使用cuDNN加速庫進行加速，當然不使用cudnn時也能夠進行GPU訓練，但是使用cudnn加速後的速度比單純使用GPU訓練時，速度能夠快一倍！ ### 第二組 - p.56 ![](https://i.imgur.com/FppEQwX.jpg) tf.group()用於創造一個操作，可以將傳入參數的所有操作進行分組。 ``` python w = tf.Variable(1) mul = tf.multiply(w, 2) add = tf.add(w, 2) group = tf.group(mul, add) tuple = tf.tuple([mul, add]) # sess.run(group)和sess.run(tuple)都會求Tensor(add) # Tensor(mul)的值。區別是，tf.group()返回的是`op` # tf.tuple()返回的是list of tensor。 # 這樣就會導致，sess.run(tuple)的時候，會返回 Tensor(mul),Tensor(add)的值. # 而 sess.run(group)不會 ``` [tensorflow group source code](https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/ops/control_flow_ops.py#L2863-L2921) ``` python def group(*inputs, **kwargs): """Create an op that groups multiple operations. When this op finishes, all ops in `inputs` have finished. This op has no output. See also `tf.tuple` and `tf.control_dependencies`. Args: *inputs: Zero or more tensors to group. name: A name for this operation (optional). Returns: An Operation that executes all its inputs. Raises: ValueError: If an unknown keyword argument is provided. """ ``` - p.107 ![](https://i.imgur.com/ITdKh9K.png) ### 第四組ㄅㄅ ![](https://i.imgur.com/d2c2eMY.png)