# CS 203 Reading Notes - 10/18
WRL Technical Note TN-14
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers
https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-14.pdf
## 1. Introduction
- Cache performace has become increasingly important.
- Cycle time has been decreasing faster than main memory access time.
## 2. Baseline Design

## 3. Reducing Conflict Misses: Miss Caching and Victim Caching
### Types of Misses
:::info
1. conflict
2. compulsory
3. capacity
4. coherence: Invalidation to preserve multiprocessor cache consistency.
(Refer to appendex B-3 for point 1, 2, 3)
:::
* Conflict missses: Direct-mapped (rate@20~40%) > Set-associative
* Performance: Direct-mapped > Set-associative
### 3.1 Miss Caching
Add associativity to a direct-mapped cache by placing a small miss cache on-chip between a first-level cache and the access port to the second-level cache (Figure 4).

* When miss occurs -> Return data to miss cache **and** direct-mapped cache
* When upper-level cache is probed, also probe miss cache.
* If missed on direct mapped cache but hits miss cache, reload in the next cycle from miss cache.
* Replaces long off-chip miss penalty -> one-cycle on-chip miss.
* Removes **data** conflict cache misses > **instruction** conflict cache misses
* Instruction conflicts tend to be widely spaced because the instructions within one procedure will not conflict with each other as long as the procedure size is less than the cache size, which is almost always the case.
* Miss cache removes conflict miss most effectively
### 3.2 Victim Caching
* From the miss cache (**fully associative cache**) above, when cache miss occurs, data is placed into the miss cache and the direct-mapped cache. --> redundency.
* Better use: Load the victim line to the fully associative cache from the direct mapped cache.
* Miss in direct mapped cache, hit in victim cache --> swap hit and a victim from direct mapped cache.

*Note that victim caches consisting of just one line are useful, in contrast to miss caches which must have two lines to be useful.*
* Huge performance gain could be obtained by relative small victim cache compared to set associative cache as shown in figure 9.

### 3.3 The Effect of Direct-Mapped Cache Size on Victim Cache Performance
* Direct-mapped cache $\nearrow$, relative size of victim cache $\searrow$
* $\Rightarrow$ Percentage of conflict missed $\searrow$, performance gain from victim cache $\searrow$.
### 3.4 The Effect of Line Size on Victim Cache Performance
* For the same cache size, longer line size(probably similar to block size)
* $\Rightarrow$ Percentage of conflict misses $\nearrow$
* $\Rightarrow$ Benefit of victim cache $\nearrow$
### 3.5 Victim Caches and Second-Level Caches
* Size of cache $\nearrow$
* Percentage of ++conflict++, ++compulsory++ misses $\nearrow$.
* Percentage of ++capacity++ misses $\searrow$.
* Number of conflict misses increases with increasing line sizes.
* $\Rightarrow$ L2$ that have larget line size may benefit from victim cache also/
* If conflict occur at L1$, it may also occur at L2\$ $\Rightarrow$ Include a L2$ victim cache could be useful.
### 3.6 Miss Caches, Victim Caches, and Error Correction
* Miss cache could be beneficial for **yield enhancement** and **fault tolerance**.
* Victim cache is not useful for correction of misses due to parity errors.
* But when miss is caused by parity error:
* If loading victim cache with the missed data, and not the victim.
* $\Rightarrow$ Victim cache could do error-recovery and performance enhancements.
## 4 Reducing Capacity and Compulsory Misses
* Use prefetch techniques.
1. Longer line sizes.
2. Prefetching methods.
* But $\nearrow$ line size could cause $\nearrow$ miss rate, amount of data transferred.
### 4.1 Reducing Capacity and Compulsory Misses with Long Lines
* Increasing line size may not be beneficial if accounting conflict misses. (Figure 14)

* Difference between **spacial locality** of ++instructions and data++ cause the line of figure 14 to diverge.
* Data tends to be more scattered.
* Note that performance of instruction cache may differ for individual programs.
* Miss cache could mitigate $\nearrow$ conflict error caused by longer lines.
### 4.2 Reducing Capacity and Compulsory Misses with Prefetch Techniques
* *Prefetch always* prefetches after every reference
#### 4.2.1 Stream Buffers
* Start the prefetch before a tag transition can take place $\Rightarrow$ Stream buffers.
* A stream buffer:
* Consist a series of entries.
* Each consist of a tag, available bit, data line.
* When a miss occurs:
* Stream buffer begins prefetching successive lines starting at the miss target.
* Tag for the address is entered into the stream.
* Available bit is set to false.
* When the **prefetch data returns it is placed in the entry** with its tag and the available bit is set to true.
* The buffer is a simple FIFO **queue**.
* Only the head of the queue have a tag comparator.
* $\Rightarrow$ if the buffer misses, flush all in the stream.
* More compilcated buffer could provide out-of-line sequence.

#### 4.2.2 Multi-Way Stream Buffers
* Stream buffer presented in the previous section could remove:
* 72% of the **instruction** cache misses
* But it could only remove 25% of the **data** cache misses
* $\Rightarrow$ Data references tend to consist of interleaved streams of data from different sources.
* When all buffers misses, clears the LRU stream buffer.
* Performace enchancement: (4-way buffer)
* Instruction: virtually unchanged.
* Data: remove ~43% in test programs.
#### 4.2.3 Quasi-Sequential Stream Buffers
* Place a comparator at each location in the stream buffer.
* Performance enhancement:
* Instruction: remove 76% cache misses, over 4% improvement over a fully sequential
* useful fetching when code is skipped, e.g. when *then* or *else* clauses are skipped in *if statements*.
* Data: ??? did not mention :thinking_face:
* Improve the performance of a four-way data stream buffer.
* Hardware required for a few more comparators is small.
* But not for multi-way.
#### 4.2.4 Stream Buffer Performance vs. Cache Size

#### 4.2.6 Comparison to Classical Prefetch Performance
* Stream buffer approaches are much more feasible to implement.
* Stream buffer could take advantage of pipelined memory systems (unlike prefetch on miss or tagged prefetch for sequential reference patterns
* Have lower latency requirements on prefetched data than the other prefetching techniques.
* Extra hardware required by a stream buffer is often comparable to the additional tag storage required by tagged prefetch
### 4.3 Combining Long Lines and Stream Buffers
* Long cache lines and stream buffers can be used advantageously together.
* Strengths and weaknesses of **long lines** and **stream buffers** are complimentary.
* Long lines:
* Could benefit due to spacial locality.
* But pollutes cache with excessive long lines.
* Stream buffer:
* Does not pollute cache since only enters cache when requested on a miss.
* Needs at least one reference to successive data or the stream will be flushed without being used.
* Stream buffer alone performs better than long lines.
## Conclusions
* Small miss caches (e.g., 2 to 5 entries) have been shown to be effective in reducing data cache conflict misses for direct-mapped caches in range of 1K to 8K bytes.
* Effectively remove tight conflicts where misses alternate between several addresses that map to the same line in the cache.
* **Miss caches** are increasingly beneficial as line sizes increase and the percentage of conflict misses increases.
* % of conflict misses $\nearrow$, % of these misses removable by a miss cache also $\nearrow$.
* **Victim caches** are an improvement to miss caching that saves the victim of the cache miss instead of the target in a small associative cache.
* **Victim caches** are even more effective at removing **conflict misses** than **miss caches**.
* **Stream buffers** prefetch cache lines after a missed cache line. They store the line until it is requested by a cache miss (if ever)
* Avoid unnecessary pollution of the cache.
* Useful at reducing the number of **capacity** and **compulsory misses**.
* Take full advantage of the memory bandwidth available in **pipelined** memory systems for sequential references (tagged prefetch/ prefetch on miss does not)
* Stream buffers can also tolerate longer memory system latencies
* They prefetch data ++much in advance of other++ prefetch techniques.
* Compensate for instruction conflict misses $\because$ sequential also.
* **Multi-way stream buffers** are a set of stream buffers.
* Replaced in LRU.
* Useful for data references that contain **interleaved accesses** (e.g. array operations).

Last notes:
This study has concentrated on applying victim caches and stream buffers to first-level caches.
An interesting area for future work is the application of these techniques to second-level caches.