Try   HackMD

2017q1 Homework3 (software-pipelining)

contributed by<rayleigh0407>

前置作業

  • 閱讀論文
  1. INTRODUCTION
    • only relatively simple software prefetching algorithms have appeared in state-of-the-art compilers like icc [ICC] and gcc [GCC-4.0]
    • First, there are few rigorous guidelines on how best to insert prefetch intrinsics. Secondly, the complexity of the interaction between software and hardware prefetching is not well understood.
    • GHB hardware prefetcher & stream-based prefetcher
  2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING
    • use stream prefetcher, GHB prefetcher, and content-based prefetcher as hardware-based prefetching mechanisms.
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • use x86 SSE extensions (_mm_prefetch)
    • Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency.
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • 預先取所需陣列位置+D後之值 D : prefetch distance
    • Dls
    • l : prefetch latency. s : 整個迴圈最短路徑.

抱歉沒能先整理好重點
rayleigh0407

  • 閱讀程式碼

_mm_loadu_si128

dst[127:0] := MEM[mem_addr+127:mem_addr]

_mm_unpacklo_epi32

INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ 
    dst[31:0]:= src1[31:0] 
    dst[63:32] := src2[31:0] 
    dst[95:64] := src1[63:32] 
    dst[127:96] := src2[63:32] 
    RETURN dst[127:0] 
} 

dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0])

_mm_unpackhi_epi32

INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){
    dst[31:0] := src1[95:64]
    dst[63:32] := src2[95:64]
    dst[95:64] := src1[127:96]
    dst[127:96] := src2[127:96]
    RETURN dst[127:0] 
} 

dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0])

_mm_unpacklo_epi64

INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){
    dst[63:0] := src1[63:0]
    dst[127:64] := src2[63:0]
    RETURN dst[127:0] 
} 

dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0])

_mm_unpackhi_epi64

INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ 
    dst[63:0] := src1[127:64]
    dst[127:64] := src2[127:64]
    RETURN dst[127:0] 
} 

dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0])

_mm_storeu_si128

MEM[mem_addr+127:mem_addr] := a[127:0]

藉由這些 SSE 運算工具, 可以得知轉置的過程如下

Step1 : I = [00010211101112132021222330313233]Step2 : T = [00100111203021310212031322322333]Step3 : I = [00102030011121310212223203132333]