# 2017q1 Homework3 (software-pipelining) contributed by<rayleigh0407> ## 前置作業 - 閱讀論文 1. INTRODUCTION - only relatively simple software prefetching algorithms have appeared in state-of-the-art compilers like icc [ICC] and gcc [GCC-4.0] - First, there are few rigorous guidelines on how best to insert prefetch intrinsics. Secondly, the complexity of the interaction between software and hardware prefetching is not well understood. - GHB hardware prefetcher & stream-based prefetcher 2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING - use stream prefetcher, GHB prefetcher, and content-based prefetcher as hardware-based prefetching mechanisms. - ![](http://i.imgur.com/iGABVqv.png) - ![](http://i.imgur.com/zkCZubp.png) - ![](http://i.imgur.com/jwFWbCr.png) - use x86 SSE extensions (_mm_prefetch) - ==**Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency.**== - ![](http://i.imgur.com/c1EL12c.png) - 預先取所需陣列位置+D後之值 D : prefetch distance - $$ D \ge \lceil {l \over s}\rceil $$ - l : prefetch latency. s : 整個迴圈最短路徑. > 抱歉沒能先整理好重點 > -- rayleigh0407 - 閱讀程式碼 **_mm_loadu_si128** ``` dst[127:0] := MEM[mem_addr+127:mem_addr] ``` **_mm_unpacklo_epi32** ``` INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0]:= src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) ``` **_mm_unpackhi_epi32** ``` INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) ``` **_mm_unpacklo_epi64** ``` INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) ``` **_mm_unpackhi_epi64** ``` INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) ``` **_mm_storeu_si128** ``` MEM[mem_addr+127:mem_addr] := a[127:0] ``` 藉由這些 SSE 運算工具, 可以得知轉置的過程如下 $$ Step1\ : \ I\ =\ \left[ \begin{matrix} 0_0 & 0_1 & 0_2 & 1_1 \\ 1_0 & 1_1 & 1_2 & 1_3 \\ 2_0 & 2_1 & 2_2 & 2_3 \\ 3_0 & 3_1 & 3_2 & 3_3 \\ \end{matrix} \right]\\ Step2\ : \ T\ = \ \left[ \begin{matrix} 0_0 & 1_0 & 0_1 & 1_1 \\ 2_0 & 3_0 & 2_1 & 3_1 \\ 0_2 & 1_2 & 0_3 & 1_3 \\ 2_2 & 3_2 & 2_3 & 3_3 \\ \end{matrix} \right]\\ Step3\ : \ I\ =\ \left[ \begin{matrix} 0_0 & 1_0 & 2_0 & 3_0 \\ 0_1 & 1_1 & 2_1 & 3_1 \\ 0_2 & 1_2 & 2_2 & 3_2 \\ 0_3 & 1_3 & 2_3 & 3_3 \\ \end{matrix} \right]\\ $$ ##