# 2017q1 Homework3 (software-pipelining)
contributed by<rayleigh0407>
## 前置作業
- 閱讀論文
1. INTRODUCTION
- only relatively simple software prefetching algorithms have appeared in state-of-the-art compilers like icc [ICC] and gcc [GCC-4.0]
- First, there are few rigorous guidelines on how best to insert prefetch intrinsics. Secondly, the complexity of the interaction between software and hardware prefetching is not well understood.
- GHB hardware prefetcher & stream-based prefetcher
2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING
- use stream prefetcher, GHB prefetcher, and content-based prefetcher as hardware-based prefetching mechanisms.
- ![](http://i.imgur.com/iGABVqv.png)
- ![](http://i.imgur.com/zkCZubp.png)
- ![](http://i.imgur.com/jwFWbCr.png)
- use x86 SSE extensions (_mm_prefetch)
- ==**Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency.**==
- ![](http://i.imgur.com/c1EL12c.png)
- 預先取所需陣列位置+D後之值 D : prefetch distance
- $$ D \ge \lceil {l \over s}\rceil $$
- l : prefetch latency. s : 整個迴圈最短路徑.
> 抱歉沒能先整理好重點
> -- rayleigh0407
- 閱讀程式碼
**_mm_loadu_si128**
```
dst[127:0] := MEM[mem_addr+127:mem_addr]
```
**_mm_unpacklo_epi32**
```
INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){
dst[31:0]:= src1[31:0]
dst[63:32] := src2[31:0]
dst[95:64] := src1[63:32]
dst[127:96] := src2[63:32]
RETURN dst[127:0]
}
dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0])
```
**_mm_unpackhi_epi32**
```
INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){
dst[31:0] := src1[95:64]
dst[63:32] := src2[95:64]
dst[95:64] := src1[127:96]
dst[127:96] := src2[127:96]
RETURN dst[127:0]
}
dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0])
```
**_mm_unpacklo_epi64**
```
INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){
dst[63:0] := src1[63:0]
dst[127:64] := src2[63:0]
RETURN dst[127:0]
}
dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0])
```
**_mm_unpackhi_epi64**
```
INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){
dst[63:0] := src1[127:64]
dst[127:64] := src2[127:64]
RETURN dst[127:0]
}
dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0])
```
**_mm_storeu_si128**
```
MEM[mem_addr+127:mem_addr] := a[127:0]
```
藉由這些 SSE 運算工具, 可以得知轉置的過程如下
$$
Step1\ : \ I\ =\
\left[
\begin{matrix}
0_0 & 0_1 & 0_2 & 1_1 \\
1_0 & 1_1 & 1_2 & 1_3 \\
2_0 & 2_1 & 2_2 & 2_3 \\
3_0 & 3_1 & 3_2 & 3_3 \\
\end{matrix}
\right]\\
Step2\ : \ T\ = \
\left[
\begin{matrix}
0_0 & 1_0 & 0_1 & 1_1 \\
2_0 & 3_0 & 2_1 & 3_1 \\
0_2 & 1_2 & 0_3 & 1_3 \\
2_2 & 3_2 & 2_3 & 3_3 \\
\end{matrix}
\right]\\
Step3\ : \ I\ =\
\left[
\begin{matrix}
0_0 & 1_0 & 2_0 & 3_0 \\
0_1 & 1_1 & 2_1 & 3_1 \\
0_2 & 1_2 & 2_2 & 3_2 \\
0_3 & 1_3 & 2_3 & 3_3 \\
\end{matrix}
\right]\\
$$
##