SIMD Introduction

tags: `sysprog2016`

contributed by <yenWu>

主講人: jserv / 課程討論區: 2016 年系統軟體課程

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

返回「進階電腦系統理論與實作」課程進度表

What's SIMD and How we use it?

SIMD Mode vs Scalar Mode
SIMD Memory access
SIMD application

SIMD Mode vs Scalar Mode

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

SIMD 即是 Single instruction Mutiple data，一次讀取多個 words ，並且能夠同步運算，而 Scalar 則是我們一般循序漸進的算法。

SIMD Memory access

SIMD

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Scalar

SIMD application

CPU(MMX/SSE/AVX)
GPU
DSP

SIMD Optimization

Auto/Semi-Auto Method
Compiler Intrinsics
Specific Framework/Infrastructure
Coding in Assembly

Auto/Semi-Auto Method

使用 OpenMp，不用動原本的 code 只需要加上 '#pragma omp simd'，OpenMP 就會把你的向量運算轉換成 SIMD!!!，可是記得要符合 OpenMP 的一些條件

Compiler Intrinsics

Intel SSE/AVX
ARM NEON/MIPS ASE

instrinsic 是編譯器透過函式形式，提供開發者得以指定硬體指令的機制，往往直接對應到機械碼jserv

Data Parallel Frameworks

OpenCL/Cuda/C++AMP
OpenVX/Halide
SIMD-Optimized Libraries
- Apple Accelerate
- OpenCV
- ffmpeg/x264
- fftw/Ne10

Coding in Assembly

我們可以看到 assembly 的 code 會有更好的效能，但是為什麼呢?

Coding in Assembly

assembly 讓我們能精準地控制 code 的 size 和 cycle 和 register 的使用
有做過 HW3 prefetch 的同學就會知道，Prefetch Distance的計算會需要有整個 loop 的 cycle

The difficult part of SIMD

Finding Parallelism in Algorithm

好的平行演算法才能發揮出 SIMD 該有的性能。

Boundary handling

由於 SIMD 都是一次讀取多個 word，所以 alignment 是個很重要的議題。

The difficult part of SIMD

Divergence

沒有支援一次把四個 int 加起來 return。

當使用超過 SIMD 專用的 register 時，就必須要存到 stack 去，這個的 overhead 是非常大的。

The difficult part of SIMD

Unsupported Operations

SIMD 沒有支援除法，可能原因也是除法器不太好做而且也很耗時間。

What are important in SIMD?

Data Parallel Algorithm

Map reduce

Thread Level Parallel Programing

muti-core
every core usually have only one SIMD

Instruction Level Parallel Programing

Instruction pipeline
Superscalar
HyperThreading
Out-of-order execution

Instruction pipeline

Superscalar

Hyperthreading

Out-of-order execution

Architecture

Register(xmm0 ~ xmm7, ymm0 ~ ymm15)
Memory Hierarchy

XXM register

Memory Hierarchy

Cache locality

Math vs Computer

上圖解釋了一件很重要的事情，就是讀 column 和 row 所帶來的 latency 是有極大差異的，cache 依次是讀進一個 line 的，所以讀column，會需要抓 4 個 line，反之 row 只需要一個 line
再來是線性代數，我們可以全部使用 row 的方式算出矩陣乘法，最後是挑選 4*4 是有意義的，首先就是跟 SIMD 的 register 的大小，再來是很多大型的矩陣運算都能拆成很多 4*4 的矩陣運算在合起來

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

SIMD Introduction

tags: sysprog2016

What's SIMD and How we use it?

SIMD Mode vs Scalar Mode

SIMD Memory access

SIMD

Scalar

SIMD application

SIMD Optimization

Auto/Semi-Auto Method

Compiler Intrinsics

Data Parallel Frameworks

Coding in Assembly

我們可以看到 assembly 的 code 會有更好的效能，但是為什麼呢?

Coding in Assembly

The difficult part of SIMD

The difficult part of SIMD

The difficult part of SIMD

What are important in SIMD?

Data Parallel Algorithm

Thread Level Parallel Programing

Instruction Level Parallel Programing

Instruction pipeline

Superscalar

Hyperthreading

Out-of-order execution

Architecture

XXM register

Memory Hierarchy

Math vs Computer

Math vs Computer

tag <yenWU> <SIMD> <prefetch>

tags: `sysprog2016`

tag <`yenWU`> <`SIMD`> <`prefetch`>