2016q3 Homework3 (software-pipelining)

contributed by < hunng >

論文內容重點截錄

原文

1. INTRODUCTION

only relatively simple software prefetching algorithms have appeared in state-of-the-art compilers like icc and gcc
programmers often have to insert prefetching intrinsics manually
- few rigorous guidelines on how best to insert prefetch intrinsics
- complexity of the interaction between software and hardware prefetching is not well understood
group the benchmarks into three categories
- positive (performance improvements from software prefetching in excess of 5%)
  - milc, GemsFDTD, mcf, lbm, libquantum, and gcc
- negative (degrade performance more than 5% by using software prefetching)
  - sphinx3 and leslie3d
- neutral
  - bzip2, cactusADM, soplex, and bwaves

2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING

unit-stride cache-line accesses as streams
access stride distances greater than two cache lines as strided

2.1 Software Prefetch Intrinsics

use software prefetch intrinsics that are part of the x86 SSE SIMD extensions
- an instrinsic for x86 systems is _mm_prefetch(char *ptr, int hint)
  - requires the programmer to specify both a prefetch address as well as a prefetch usage hint
  - One intrinsic is roughly translated to two instructions for direct address prefetches or four instructions for indirect memory addresses.

Table II. Different Prefetch Hints in Intrinsic

| HINT              | L1 | L2 | L3 | Remarks                                           |
|-------------------+----+----+----+---------------------------------------------------|
| _MM_HINT_T0 (T0)  | O  | O  | O  | Temporal data with respect to all level caches    |
| _MM_HINT_T1 (T1)  | X  | O  | O  | Temporal data with respect to first level cache   |
| _MM_HINT_T2 (T2)  | X  | X  | O  | Temporal data with respect to second level cache  |
| _MM_HINT_NTA (NTA)| O  | X  | X  | Non-temporal data with respect to all level caches|

2.2 Prefetch Classification

Table III. Prefetch Classifications

Classification|Accuracy|Timeliness                                                      
--------------+--------+----------------------------------------------------------------
Timely        |   O    |best prefetching                                                
Late          |   O    |demand request arrives before prefetched block is serviced      
Early         |   O    |demand request arrives after the block is evicted from a cache  
Redundant_dc  |   O    |the same cache block is in data cache                           
Redundant_mshr|   O    |the same cache block is in MSHR5                                
Incorrect     |   X    |block is never used even after the block is evicted from a cache

2.3 Software Prefetch Distance

Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency.
The average memory latency and the average execution time of one loop iteration can vary
- so the prefetch distance should be chosen in such a way that it is large enough to hide the latency.
  
  $D \geq ⌈ \frac{l}{s} ⌉$
D is called the prefetch distance
- define prefetch distance as the distance ahead of which a prefetch should be requested on a memory address
l is called the prefetch latency
s is the length of the shortest path through the loop body

2.4 Direct and Indirect Memory Indexing

direct memory indexing
- can be easily prefetched by hardware since the memory addresses show regular stream/stride behavior
indirect memory indexing
- requires special hardware prefetching mechanisms
- is relatively simpler to compute in software
- software prefetching to be more effective than hardware prefetching
- still have a higher overhead than direct memory access

3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING

3.1 Benefits of Software Prefetching over Hardware Prefetching

Large Number of Streams (Limited Hardware Resources)
- The number of streams in the stream prefetcher is limited by hardware resources, such as stream detectors and book-keeping mechanisms
- e.g. Software prefetching can insert prefetch requests independently for each stream in lbm, whereas it is difficult for hardware prefetchers that are typically easily confused when there are too many streams.
Short Streams
- Hardware prefetchers require training time to detect the direction and distance of a stream or stride
  - require at least two cache misses to detect the direction of stream
  - If the length of stream is extremely short, there will not be enough cache misses to train a hardware prefetcher and load useful cache blocks.
Irregular Memory Access
- prefetch irregular data structures by inserting appropriate prefetch intrinsics
Cache Locality Hint
- Hardware prefetchers in today’s high-performance processors such as Nehalem [Intel 2009] and Barcelona [AMD] place the data in the lower-level cache of the requesting core, whereas most software prefetched data is placed directly into the L1 cache.
- Lower-level (L2 or L3) prefetching block insertion greatly reduces the higher level (L1) cache pollution, but L2 to L1 latency can degrade performance significantly.
- In the software prefetching mechanism, there is greater flexibility of placing data in the right cache level
Loop Bounds
- Several methods prevent generating prefetch requests out of array bounds in software such as loop unrolling, software pipelining, and using branch instructions
  - is not possible in hardware, especially when the hardware prefetcher is aggressive (large prefetch distance and high prefetch degree)

3.2 Negative Impacts of Software Prefetching

Increased Instruction Count
- Unlike hardware prefetching, software prefetch instructions consume fetch and execution bandwidth and machine resources
Static Insertion
- The decision to prefetch and choice of these parameters are made statically (by programmer or compiler), and therefore cannot adapt to runtime behavioral changes such as varying memory latency, effective cache size, and bandwidth, especially in heterogeneous architectures.
Code Structure Change
- When the number of instructions in a loop is extremely small, it can be challenging to insert software prefetching instructions to provide timely prefetching requests because there are not enough computations between prefetching instructions and demand requests.
- loop splitting
  - breaking the loop into multiple loops which have the same bodies but iterate over different contiguous portions of the index range.

3.3 Synergistic Effects when Using Software and Hardware Prefetching Together

Handling Multiple Streams
- Using software and hardware prefetching together can help cover more data streams.
Positive Training
- Software prefetch requests can actually help train the hardware prefetcher.
- If a block prefetched by a software prefetcher is late, then a trained hardware prefetcher can improve prefetch timeliness.

3.4 Antagonistic Effects when Using Software and Hardware Prefetching Together

Negative Training
- Software prefetch requests can slow down the hardware prefetcher training.
  - If prefetched blocks by software prefetching hide a part of one or more streams, the hardware prefetcher will not be trained properly.
  - software prefetch instructions can trigger overly aggressive hardware prefetches, which results in early requests.
Harmful Software Prefetching
- Hardware prefetchers generally have a lower accuracy than software prefetchers.
- When software prefetch requests are incorrect or early, the increased stress on cache and memory bandwidth can further reduce the effectiveness of hardware prefetching.

4. EXPERIMENTAL METHODOLOGY

4.1 Prefetch Intrinsic Insertion Algorithm

profile each application to identify prefetch candidates.
- A prefetch candidate is any load whose L1 misses per thousand instructions (MPKI) exceeds 0.05
- To make the choice of prefetch distance systematic, compute it as
  - K is a constant factor
  - L is an average memory latency, use K = 4 and L = 300 for all benchmarks
  - $I P C_{b e n c h}$ is the profiled average IPC of each benchmark
  - $W_{l o o p}$ is the average instruction count in one loop iteration
    - determine
      $I P C_{b e n c h}$
      $W_{l o o p}$ from the profile data, the choice of distance is effectively profile-driven

D i s t a n c e = \frac{K * L * I P C_{b e n c h}}{W_{l o o p}}

For loads on complex data structures, such as RDS, hash, C++ overloaded operators, and short functions without any loops, do not use these prefetch intrinsics.

4.2 Simulation Methodology

MacSim
- a trace-driven and cycle-level simulator
- to evaluate software and hardware prefetching effects

5. EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING

5.1 Overhead and Limitations of Software Prefetching

Instruction Overhead
- major negative impacts of software prefetching
- For GemsFDTD, bwaves, and leslie3d, the total number of instructions increases by more than 50%
- These extra instructions come not just from prefetch instructions but also from regular instructions due to handling of indirect memory accesses and index calculations
- Surprisinly, GemesFDTD is still in the positive group
Software Prefetching Overhead
- eliminate
  - cache pollution (SW+P)
    - would be caused by early or incorrect prefetches (but small in experiments)
  - bandwidth consumption (SW+B)
    - mostly small values
    - current machines provides enough bandwidth for single-thread applications
  - memory access latency (SW+L)
    - larger values
    - software prefetching is not completely hiding memory latency
  - redundant prefetch overhead (SW+R)
    - the negative effect of redundant prefetch instructions is negligible even though instructions increased significantly
  - instruction overhead (SW+I)
    - even though the number of instructions increased significantly for GemsFDTD, bwaves, and leslie3d, their actual time overhead is not high
The Effect of Prefetch Distance
- Most benchmarks from the neutral and negative groups are insensitive to the prefetch distance
Static Distance vs. Machine Configuration
- One limitation of using static prefetch distances is that the optimal distance might vary by machine
- the static prefetch distance variance does not impact performance significantly even if the machine configuration is changed at runtime
Cache-Level Insertion Policy
- an out-of-order processor can easily tolerate L1 misses, so hardware prefetchers usually insert prefetched blocks into the last level cache to reduce the chance of polluting the L1 cache.
  - when the L1 cache miss rate is higher than what a processor can tolerate, even an L1 cache miss can degrade performance
  - Since software prefetching is generally accurate, we can safely insert prefetched blocks into the L1 cache directly
- T0 always performs better than other hints in all evaluated benchmarks

5.2 Effect of Using Hardware and Software Prefeching Together

Hardware Prefetcher Training Effects
- two different hardware prefetcher training policies
  1. NT (SW+GHB, SW+STR)
    - hardware prefetcher's training algorithm ignores software prefetch requests
    - only trained by demand cache misses
  2. Train (SW+GHB+T, SW+STR+T)
    - includes software prefetch requests
- sphinx3
  - For both GHB and STR, benchmark shows a small benefit from training with software prefetches.
- milc, gcc, and soplex
  - negative effects for both GHB and STR
- mcf and bwaves
  - largely unaffected by training policy
- the remaining benchmarks
  - the effect of the training policy is prefetcher-specific, for instance, positive with GHB and negative with STR
- it is generally better not to train hardware prefetching with software prefetching requests
Prefetch Coverage
- the coverage of software prefetching is higher than that of hardware prefetchers, but the software coverage of benchmarks in that neutral and negative group is lower than taht of the hardware prefetchers
  - the less coverage is the main reason for performance loss in the neutral and negative groups
Prefetching Classification
- Only bzip2 and soplex show more than 10% early prefetches.
- For mcf, 10% of prefetches are late, due to a dependent load instruction that the prefetch intrinsic generates
  - By eliminating the latency overhead (SW+L), the performance of mcf is improved by 9%.
- even though a significant number of redundant prefetches exists in many benchmarks, there is little negative effect on the performance

5.3 Using the GCC Compiler

insert intrinisics after compiling the code with gcc
The gcc compiler inserts more software prefetches than icc, except for mcf and lbm
gcc still cannot cover all cache misses

5.4 Real System Experiments

measured the effect of software prefetching on two recent Intel processors, Core 2 and Nehalem
software prefetching reduced the performance of cactusADM, soplex, bwaves, sphinx3, and leslie3d by a small amount in the real system measurements
all benchmarks in the positive group show benefits from using software prefetches
due to the multiple cache levels of prefetchers in Nehalem (L3 to L2 and L2 to L1), the benefits of software prefetching are reduced

5.5 Profile-Guided Optimization (PGO)

the performance of using PGO lies between that of Base and SW

5.6 Hardware Prefetcher for Short Streams

One weakness of hardware prefetching is the difficulty of exploiting short streams.
- ASD
On average, ASD provides a 7% performance benefit on milc without software prefetching. However, software prefetching itself can still provide more than a 2x speedup.
software prefetching is much more effective for prefetching short streams than ASD

5.7 Content Directed Prefetching

target linked and other irregular data structures
CDP provides 3.9% and 6.5% improvement on mcf and gcc, respectively
software prefetching is more effective for irregular data structures

5.8 Manual Tuning

cannot be sure that software prefetching is being used to its full potential
- manually and exhaustively tune the benchmarks
  - remove useless prefetches
  - adjust the prefetch distance of each software prefetch
  - apply additional overhead-reduction techniques, such as loop unrolling and prologue/epilogue transformations
there are no significant performance changes between manual tuning and the algorithm, especially in the positive group, which shows that the algorithm is good to use as a guideline for inserting software prefetch intrinisics.

5.9 Summary of New Findings

Hardware prefetchers can under-exploit even regular access patterns
- software prefetching is frequently more effective
The prefetch distance does need to be set carefully, but as long as the prefetch distance is greater than the minimum distance
when the L1 cache miss rate is much higher than 20%, reducing L1 cache misses by prefetching into the L1 cache can be effective.
having some extra instructions (software prefetching instructions) does not increase execution time by much
Software prefetching can be used to train a hardware prefetcher and yield some performance improvement. However, it can also degrade performance.

6. CASE STUDIES: SOURCE CODE ANALYSIS OF INDIVIDUAL BENCHMARKS

物件導向 with C

AVX

instruction set

作業內容

使用封裝技巧改寫

首先先將物件設計好
我預計將 src 與 out 放進此物件中
而省略的其中的 w 和 h 使用 TEST_W 和 TEST_H 代替








typedef struct _prefetch_object prefetch;
typedef void (*func_t) (prefetch *);

struct _prefetch_object {
    int *src;
    int *out;
    func_t transpose;
};

並利用一個函式 init_prefetch 初始化









void init_prefetch(prefetch **self) {
    *self = malloc(sizeof(prefetch));
    (*self)->src = (int *) malloc(sizeof(int) * TEST_W * TEST_H);
    (*self)->out = (int *) malloc(sizeof(int) * TEST_W * TEST_H);
    srand(time(NULL));
    for (int y = 0; y < TEST_H; y++)
	for (int x = 0; x < TEST_W; x++)
	    *((*self)->src + y * TEST_W + x) = rand();
}

替換完後，因 src 是另外產生所以時間有些許不同，之後會將原版移除

  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse prefetch: 	 74236 us
sse: 		 136378 us
naive: 		 272622 us
sse prefetch: 	 74436 us
sse: 		 128086 us
naive: 		 248119 us

測試成功後，將每個函式分開至個別的 .c 檔
舉 naive_transpose.c 為例


































typedef struct _prefetch_object prefetch;
typedef void (*func_t) (prefetch *);

struct _prefetch_object {
    int *src;
    int *out;
    func_t transpose;
};

static void naive_transpose_object(prefetch *p)
{
    for (int x = 0; x < TEST_W; x++)
        for (int y = 0; y < TEST_H; y++)
            *(p->out + x * TEST_H + y) = *(p->src + y * TEST_W + x);
}

void init_prefetch(prefetch **self)
{
    *self = malloc(sizeof(prefetch));
    (*self)->src = (int *) malloc(sizeof(int) * TEST_W * TEST_H);
    (*self)->out = (int *) malloc(sizeof(int) * TEST_W * TEST_H);
    (*self)->transpose = naive_transpose_object;
    srand(time(NULL));
    for (int y = 0; y < TEST_H; y++)
        for (int x = 0; x < TEST_W; x++)
            *((*self)->src + y * TEST_W + x) = rand();
}

void clean(prefetch **self)
{
    free((*self)->out);
    free((*self)->src);
}