# Paper notes Maxime > Just a place for me to dump my thoughts and findings of every paper I read, so these notes don't get lost. > [name=Maxime Cannoodt] ## A Closer Look at Lightweight Graph Reordering > Faldu, P., Diamond, J., & Grot, B. (2019). A Closer Look at Lightweight Graph Reordering. 2019 IEEE International Symposium on Workload Characterization (IISWC), 1–13. https://doi.org/10.1109/IISWC47752.2019.9041948 ### Abstract - Common property of graphs is **Power-law distribution of vertex connectivity** - Naive improvement: **lightweight skew-aware vertex reordering** placing hot vertices adjecent in memory - => Destroys community structure within graph, which may negate performance gain - Authors propose ***Degree-Based Grouping (DBG)*** as a novel lightweight technique - Coarse reordering that preserves graph structure while reducing cache footprint of hot vertices ### Introduction - Graph analytrics exhibit highly irregular memory access patterns, resulting in poor spatial and temporal locality => high cache miss rate - **Natural** or **scale-free graphs**: graphs that exhibit the skewed power law distribution - Vertices usually require 4 to 16 bytes - Modern cache block size is typically 64-128 bytes - **Skew-aware techniques** - *Sort Reordering*: Sort vertices by degree. But this destroys spatial locality of connected vertices - All skew-aware techniques are generally effective at reducing LLC misses - Fine-grain reordering increases higher level cache misses - => Balance must be struck between temporal and spatial locality - **DBG** meets halfway by maintaining a level of spatial locality while also grouping based on hot vertices ### Background and motivation - Properties of natural graphs are explained - power distribution - community structure - Short overview of CSR structure - Objectives of high performance reordering: - O1 = low reorder time - O2 = high cache efficiency - O3 = structure preservation - Limitations of skew aware techniques ### New keywords: - **Gorder**: a comomunity-aware reodering technique. The paper references it as an effective but costly technique, which they claim is impractical because of high reorder time. --- ## The GAP Benchmark Suite > Beamer, S., Asanović, K., & Patterson, D. (2015). The GAP Benchmark Suite. 1–16. http://arxiv.org/abs/1508.03619 ### Abstract - help standardize graph processing evaluation to more easily compare research efforts and quantify solutions - graph framework developersrcan demonstrate the generality of their model by implementing the benchmark's kernels - Algorithm designers can demonstrate their contribution by on the input graphs and reference implementations of GAP - Hardware platform designers and performance analysts can use the suite as a representative workload of graph processing --- ## Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads > Basak, A., Li, S., Hu, X., Oh, S. M., Xie, X., Zhao, L., Jiang, X., & Xie, Y. (2019). Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads. 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 373–386. https://doi.org/10.1109/HPCA.2019.00051 ### Abstract - "performance is bounded by the inefficiencies in the memory subsystem for single-machine in-memory graph analytics" - "We find that the **load-load dependency chains** involving different application data types form the primary bottleneck in achieving high memory-level parallelism." - Shared L3 cache shows higher performance sensitivity - This paper proposes the **DROPLET** prefetcher ### Introduction - Importance of single machiene performance for common-case graph analytics vs distributed systems - No need for partitioning between systems - Cheap high density memory enables the entire graph to remain in-memory - Less programming efforts compared to distributed systems ### Background - "Hardware prefetchers located in the cache hierarchy moniotor simple access patterns" ### Characterization Core and MLP analysis: - **Observation 1**: "instruction windows size is not the factor impeding MLP" - **Observation 2**: "Load-load dependency chains prevent achieving high MLP" - **Observation 3**: "Graph property data is the consumer in a dependency chain" Cache Hierarchy analysis: - **Observation 4**: "The private L2 cache shows negligible performance sensitvity, whereas the shared LLC shows higher performance sensitvity."" - **Observation 5**: Property data is the primary beneficiary of the LLC capacity. - **Observation 6**: Graph structure cachelines has the largest reuse distance among all the data types. Graph property cacheline has a larger reuse distance than that serviced by the L2 cache. ### New keywords: - **VLDP**: Variable Length Delta Prefetcher, upon which DROPLET shows improved performance --- ## Optimizing Indirect Memory References with milk > Kiriansky, V., Zhang, Y., & Amarasinghe, S. (2016). Optimizing Indirect Memory References with milk. Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 299–312. https://doi.org/10.1145/2967938.2967948 ### Abstract - DRAM bandwidth is increasing much slower than per CPU core count - DRAM latency has been virtually stagnant - "*Traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap.*" ### Introduction - OUr compiler transformations dynamically partition the index values in d to capture all available locality in looped memory accesses - "MILK's **DRAM-conscious Clustering** transforms loops into three logical phases:" - **Collection**: dynamically generated indices of indirect accesses are collected in cache local buffers - **Distribution**: Indicies are written to DRAM resident partitions using efficient sequential DRAM access - **Delivery**: Each partition's deferred updates are read from DRAM and processed, along with dependent statements. - References targeting the same cache line are grouped to maximize locality, which can be processed by a single virtual processor to eliminate synchronization - ***USES GAP for results!!!*** ### Background - The memory hierarchy is optimized for sequential accesses, often at the expense of rnadom accesses - A lot of hardware details to proof why this is - Consequence: a lot of graph processing applications (and similar) are bound by memory latency rather than bandwidth! - Low level of MLP is also a cause of underutuilized memory bandwidth - Result of this: very little opportunity for OoO execution... -> serialized execution - Hardware prefetchers increase MLP, but these only exist for sequential accesses, not random requests - More reasons why architectural improvements for random accesses are difficult - Their conclusion: **MILK is more effective than hardware level improvements** --- ## Array Tracking Prefetcher for Indirect Accesses > Cavus, M., Sendag, R., & Yi, J. J. (2019). Array Tracking Prefetcher for Indirect Accesses. Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018, 132–139. https://doi.org/10.1109/ICCD.2018.00028 ### Abstract - Proposes **Array Tracking Prefetcher (ATP)** - Combination of hardware and software ### Background and introduction - Indirect memory accesses appear in data structures that are implemented as nested arrays, e.g. `A[B[i]]` - Entries of array B can easily be prefetched by a hardware stream prefetcher because accesses to B are linear - No spatial locality in accesses to A! -> **impossible to hide memory latency to A through prefetching** - Software prefetching brings overhead: - Prefetching instructions for address calculation, border checking, etc. - Benefit is limited by lack of run-time information needed for optimal placing of prefetch instructions - Hardware prefetchers for indirect memory accesses must be complex in order to capture irregular access patterns - **This paper**: combine the strengths of hardware and software prefetching for a hybrid solution - "Providing hints to the hardware mechanism is better than a pure hardware mechanism because the hardware mechanism can configure itself based on the behavior of the software."