# Paper notes Maxime
> Just a place for me to dump my thoughts and findings of every paper I read, so these notes don't get lost.
> [name=Maxime Cannoodt]
## A Closer Look at Lightweight Graph Reordering
> Faldu, P., Diamond, J., & Grot, B. (2019). A Closer Look at Lightweight Graph Reordering. 2019 IEEE International Symposium on Workload Characterization (IISWC), 1–13. https://doi.org/10.1109/IISWC47752.2019.9041948
### Abstract
- Common property of graphs is **Power-law distribution of vertex connectivity**
- Naive improvement: **lightweight skew-aware vertex reordering** placing hot vertices adjecent in memory
- => Destroys community structure within graph, which may negate performance gain
- Authors propose ***Degree-Based Grouping (DBG)*** as a novel lightweight technique
- Coarse reordering that preserves graph structure while reducing cache footprint of hot vertices
### Introduction
- Graph analytrics exhibit highly irregular memory access patterns, resulting in poor spatial and temporal locality => high cache miss rate
- **Natural** or **scale-free graphs**: graphs that exhibit the skewed power law distribution
- Vertices usually require 4 to 16 bytes
- Modern cache block size is typically 64-128 bytes
- **Skew-aware techniques**
- *Sort Reordering*: Sort vertices by degree. But this destroys spatial locality of connected vertices
- All skew-aware techniques are generally effective at reducing LLC misses
- Fine-grain reordering increases higher level cache misses
- => Balance must be struck between temporal and spatial locality
- **DBG** meets halfway by maintaining a level of spatial locality while also grouping based on hot vertices
### Background and motivation
- Properties of natural graphs are explained
- power distribution
- community structure
- Short overview of CSR structure
- Objectives of high performance reordering:
- O1 = low reorder time
- O2 = high cache efficiency
- O3 = structure preservation
- Limitations of skew aware techniques
### New keywords:
- **Gorder**: a comomunity-aware reodering technique. The paper references it as an effective but costly technique, which they claim is impractical because of high reorder time.
---
## The GAP Benchmark Suite
> Beamer, S., Asanović, K., & Patterson, D. (2015). The GAP Benchmark Suite. 1–16. http://arxiv.org/abs/1508.03619
### Abstract
- help standardize graph processing evaluation to more easily compare research efforts and quantify solutions
- graph framework developersrcan demonstrate the generality of their model by implementing the benchmark's kernels
- Algorithm designers can demonstrate their contribution by on the input graphs and reference implementations of GAP
- Hardware platform designers and performance analysts can use the suite as a representative workload of graph processing
---
## Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads
> Basak, A., Li, S., Hu, X., Oh, S. M., Xie, X., Zhao, L., Jiang, X., & Xie, Y. (2019). Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads. 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 373–386. https://doi.org/10.1109/HPCA.2019.00051
### Abstract
- "performance is bounded by the inefficiencies in the memory subsystem for single-machine in-memory graph analytics"
- "We find that the **load-load dependency chains** involving different application data types form the primary bottleneck in achieving high memory-level parallelism."
- Shared L3 cache shows higher performance sensitivity
- This paper proposes the **DROPLET** prefetcher
### Introduction
- Importance of single machiene performance for common-case graph analytics vs distributed systems
- No need for partitioning between systems
- Cheap high density memory enables the entire graph to remain in-memory
- Less programming efforts compared to distributed systems
### Background
- "Hardware prefetchers located in the cache hierarchy moniotor simple access patterns"
### Characterization
Core and MLP analysis:
- **Observation 1**: "instruction windows size is not the factor impeding MLP"
- **Observation 2**: "Load-load dependency chains prevent achieving high MLP"
- **Observation 3**: "Graph property data is the consumer in a dependency chain"
Cache Hierarchy analysis:
- **Observation 4**: "The private L2 cache shows negligible performance sensitvity, whereas the shared LLC shows higher performance sensitvity.""
- **Observation 5**: Property data is the primary beneficiary of the LLC capacity.
- **Observation 6**: Graph structure cachelines has the largest reuse distance among all the data types. Graph property cacheline has a larger reuse distance than that serviced by the L2 cache.
### New keywords:
- **VLDP**: Variable Length Delta Prefetcher, upon which DROPLET shows improved performance
---
## Optimizing Indirect Memory References with milk
> Kiriansky, V., Zhang, Y., & Amarasinghe, S. (2016). Optimizing Indirect Memory References with milk. Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 299–312. https://doi.org/10.1145/2967938.2967948
### Abstract
- DRAM bandwidth is increasing much slower than per CPU core count
- DRAM latency has been virtually stagnant
- "*Traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap.*"
### Introduction
- OUr compiler transformations dynamically partition the index values in d to capture all available locality in looped memory accesses
- "MILK's **DRAM-conscious Clustering** transforms loops into three logical phases:"
- **Collection**: dynamically generated indices of indirect accesses are collected in cache local buffers
- **Distribution**: Indicies are written to DRAM resident partitions using efficient sequential DRAM access
- **Delivery**: Each partition's deferred updates are read from DRAM and processed, along with dependent statements.
- References targeting the same cache line are grouped to maximize locality, which can be processed by a single virtual processor to eliminate synchronization
- ***USES GAP for results!!!***
### Background
- The memory hierarchy is optimized for sequential accesses, often at the expense of rnadom accesses
- A lot of hardware details to proof why this is
- Consequence: a lot of graph processing applications (and similar) are bound by memory latency rather than bandwidth!
- Low level of MLP is also a cause of underutuilized memory bandwidth
- Result of this: very little opportunity for OoO execution... -> serialized execution
- Hardware prefetchers increase MLP, but these only exist for sequential accesses, not random requests
- More reasons why architectural improvements for random accesses are difficult
- Their conclusion: **MILK is more effective than hardware level improvements**
---
## Array Tracking Prefetcher for Indirect Accesses
> Cavus, M., Sendag, R., & Yi, J. J. (2019). Array Tracking Prefetcher for Indirect Accesses. Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018, 132–139. https://doi.org/10.1109/ICCD.2018.00028
### Abstract
- Proposes **Array Tracking Prefetcher (ATP)**
- Combination of hardware and software
### Background and introduction
- Indirect memory accesses appear in data structures that are implemented as nested arrays, e.g. `A[B[i]]`
- Entries of array B can easily be prefetched by a hardware stream prefetcher because accesses to B are linear
- No spatial locality in accesses to A! -> **impossible to hide memory latency to A through prefetching**
- Software prefetching brings overhead:
- Prefetching instructions for address calculation, border checking, etc.
- Benefit is limited by lack of run-time information needed for optimal placing of prefetch instructions
- Hardware prefetchers for indirect memory accesses must be complex in order to capture irregular access patterns
- **This paper**: combine the strengths of hardware and software prefetching for a hybrid solution
- "Providing hints to the hardware mechanism is better than a pure hardware mechanism because the hardware mechanism can configure itself based on the behavior of the software."