SIMD Introduction

# SIMD Introduction ###### tags: `sysprog2016` contributed by <`yenWu`> :::info 主講人: [jserv](http://wiki.csie.ncku.edu.tw/User/jserv) / 課程討論區: [2016 年系統軟體課程](https://www.facebook.com/groups/system.software2016/) :mega: 返回「[進階電腦系統理論與實作](http://wiki.csie.ncku.edu.tw/sysprog/schedule)」課程進度表 ::: --- # What's SIMD and How we use it? 1. SIMD Mode vs Scalar Mode 2. SIMD Memory access 3. SIMD application ---- ## SIMD Mode vs Scalar Mode ![](https://i.imgur.com/e9u70x5.png) * SIMD 即是 Single instruction Mutiple data，一次讀取多個 words ，並且能夠同步運算，而 Scalar 則是我們一般循序漸進的算法。 ---- ## SIMD Memory access ### SIMD ![](https://i.imgur.com/YGrzijR.png =400x150) ### Scalar ![](https://i.imgur.com/tbBZ9fi.png =400x150) ---- ### SIMD application ![](https://i.imgur.com/fPRT37q.png =650x400) 1. CPU(MMX/SSE/AVX) 2. GPU 3. DSP --- # SIMD Optimization 1. Auto/Semi-Auto Method 2. Compiler Intrinsics 3. Specific Framework/Infrastructure 4. Coding in Assembly ---- ## Auto/Semi-Auto Method ![](https://i.imgur.com/Nm1BtQ1.png) * 使用 OpenMp，不用動原本的 code 只需要加上 '#pragma omp simd'，OpenMP 就會把你的向量運算轉換成 SIMD!!!，可是記得要符合 OpenMP 的一些條件 ---- ## Compiler Intrinsics * Intel SSE/AVX * ARM NEON/MIPS ASE > instrinsic 是編譯器透過函式形式，提供開發者得以指定硬體指令的機制，往往直接對應到機械碼[name=jserv] ---- ## Data Parallel Frameworks * ==OpenCL/Cuda==/C++AMP * ==OpenVX/Halide== * SIMD-Optimized Libraries * Apple Accelerate * OpenCV * ffmpeg/x264 * fftw/Ne10 ---- ## Coding in Assembly ![](https://i.imgur.com/N51zxGM.png) ### 我們可以看到 assembly 的 code 會有更好的效能，但是為什麼呢? ---- ## Coding in Assembly * assembly 讓我們能精準地控制 code 的 size 和 cycle 和 register 的使用 * 有做過 HW3 prefetch 的同學就會知道，`Prefetch Distance`的計算會需要有整個 loop 的 cycle --- ## The difficult part of SIMD * Finding Parallelism in Algorithm > 好的平行演算法才能發揮出 SIMD 該有的性能。 * Boundary handling > 由於 SIMD 都是一次讀取多個 word，所以 alignment 是個很重要的議題。 ---- ## The difficult part of SIMD * Divergence > 沒有支援一次把四個 int 加起來 return。 * Register Spilling > 當使用超過 SIMD 專用的 register 時，就必須要存到 stack 去，這個的 overhead 是非常大的。 ---- ## The difficult part of SIMD * Unsupported Operations > SIMD 沒有支援除法，可能原因也是除法器不太好做而且也很耗時間。 --- # What are important in SIMD? ![](https://i.imgur.com/RGnJxqu.png) ---- ## Data Parallel Algorithm * Map reduce ---- ## Thread Level Parallel Programing * muti-core * every core usually have only one SIMD ---- ## Instruction Level Parallel Programing * Instruction pipeline * Superscalar * HyperThreading * Out-of-order execution ---- ## Instruction pipeline ![](https://i.imgur.com/8I7tBHf.png) ---- ## Superscalar ![](https://i.imgur.com/8wADkg3.png) ---- ## Hyperthreading ![](https://i.imgur.com/y6t3dP2.png) ---- ## Out-of-order execution ![](https://i.imgur.com/OhY0N68.png) ---- ## Architecture * Register(xmm0 ~ xmm7, ymm0 ~ ymm15) * Memory Hierarchy ---- ## XXM register ![](https://i.imgur.com/exHCwa2.png) ---- ## Memory Hierarchy ![](https://i.imgur.com/Or5nAKc.png) * Cache locality --- ## Math vs Computer ![](https://i.imgur.com/NolK9C4.png) ---- ## Math vs Computer * 上圖解釋了一件很重要的事情，就是讀 column 和 row 所帶來的 latency 是有極大差異的，cache 依次是讀進一個 line 的，所以讀column，會需要抓 4 個 line，反之 row 只需要一個 line * 再來是線性代數，我們可以全部使用 row 的方式算出矩陣乘法，最後是挑選 4\*4 是有意義的，首先就是跟 SIMD 的 register 的大小，再來是很多大型的矩陣運算都能拆成很多 4\*4 的矩陣運算在合起來 --- ###### tag <`yenWU`> <`SIMD`> <`prefetch`>