Paper outline - HackMD

# Paper outline ## Title ## Abstract ### Context #### Graph - Graph processing important for big data applications - Power a wide range of applications (finance, networking and business logistics) - Abstracting real life networks -> same algorithms, different graphs ### Need - Searches on large graphs are heavily memory latency bound, many high latency DRAM accesses - Highly irregular nature of graph access patterns -> traditional prefetchers and caches perform poorly - Difficult to parallelize - Importance of single machine performance for common-case graph analytics vs distributed systems ### Task & object - We find out what the bottlenecks are - How to solve these bottlenecks by modifying the processor architecture/introducing new hardware and software components ### Findings & conclusions ## Main document ### Introduction - Situate graph processing in modern day computational needs - How is graph processing traditionally done? - Why is graph processing on multi-purpose processors increasingly important? #### What done - Different workload characterizations to find bottlenecks - Custom prefetchers based on graph data structure - Graph reordering to optimize cache locality - Domain-specialized branch predictors for graph processing - Optimizing indirect memory references ### Background - What are graphs? (real short) - Which properties do "natural graphs" exhibit (that can be exploited) ? - How are graphs stored? (CSR) - What does a typical graph processing application/algorithm look like ? ### Typical benchmarks (*which order in paper??*) > Beamer, S., Asanović, K., & Patterson, D. (2015). The GAP Benchmark Suite. 1–16. http://arxiv.org/abs/1508.03619 Graph processing frameworks: - GAP by scott beamer - Ligra - Galois ### Graph processing workload characterization (bottlenecks) > Basak, A., Li, S., Hu, X., Oh, S. M., Xie, X., Zhao, L., Jiang, X., & Xie, Y. (2019). Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads. 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 373–386. https://doi.org/10.1109/HPCA.2019.00051 > Beamer, S. (2016). Understanding and Improving Graph Algorithm Performance. In UC Berkeley Dissertations. http://www.scottbeamer.net/pubs/beamer-thesis.pdf > Beamer, S., Asanović, K., & Patterson, D. (2015). GAIL: The graph algorithm iron law. Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3 2015. https://doi.org/10.1145/2833179.2833187 > Beamer, S., Asanović, K., & Patterson, D. (2015). Locality exists in graph processing: Workload characterization on an ivy bridge server. Proceedings - 2015 IEEE International Symposium on Workload Characterization, IISWC 2015. https://doi.org/10.1109/IISWC.2015.12 > Samara, A., & Tuck, J. (2020). The Case for Domain-Specialized Branch Predictors for Graph-Processing. IEEE Computer Architecture Letters, 19(2), 101–104. https://doi.org/10.1109/LCA.2020.3005895 - Memory level parallellism bottlenecks - Cache bottlenecks - Explain dependency chain - Impact of L1, L2, L3 size - Power lab distribution: a small number of vertices are responsible for a high fraction of all connections in the graph, exhibit high reuse, yet sparsely distributed in memory -> underutilization of on-chip cache capacity - Memory hierarchy latency and bandwidth bottlenecks (Cavus et al) - a lot of graph processing applications (and similar) are bound by memory latency rather than bandwidth - Low level of MLP is a cause of underutilized memory bandwidth - Result: sequential execution on OoO processor in practice - Cause: Architecture in cache, DRAM, etc is optimized for sequential access, resulting in poorer random access performance - Existing hardware prefetchers increase MLP, but these only exist for sequential accesses, not random requests ### Solutions #### Graph reordering > Balaji, V., & Lucia, B. (2018). When is Graph Reordering an Optimization? Iiswc, 203–214. > Faldu, P., DIamond, J., & Grot, B. (2019). A Closer Look at Lightweight Graph Reordering. Proceedings of the 2019 IEEE International Symposium on Workload Characterization, IISWC 2019, 1–13. https://doi.org/10.1109/IISWC47752.2019.9041948 > Wei, H., Yu, J. X., Lu, C., & Lin, X. (2016). Speedup graph processing by graph ordering. Proceedings of the ACM SIGMOD International Conference on Management of Data, 26-June-20, 1813–1828. https://doi.org/10.1145/2882903.2915220 > Lakhotia, K., Singapura, S., Kannan, R., & Prasanna, V. (2018). ReCALL: Reordered Cache Aware Locality Based Graph Processing. Proceedings - 24th IEEE International Conference on High Performance Computing, HiPC 2017, 2017-Decem, 273–282. https://doi.org/10.1109/HiPC.2017.00039 > Arai, J., Shiokawa, H., Yamamuro, T., Onizuka, M., & Iwamura, S. (2016). Rabbit Order: Just-in-Time Parallel Reordering for Fast Graph Analysis. Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016. https://doi.org/10.1109/IPDPS.2016.110 > Samara, A., & Tuck, J. (2020). The Case for Domain-Specialized Branch Predictors for Graph-Processing. IEEE Computer Architecture Letters, 19(2), 101–104. https://doi.org/10.1109/LCA.2020.3005895 > Maass, S., Min, C., Kashyap, S., Kang, W., Kumar, M., & Kim, T. (2017). MOSAIC: Processing a trillion-edge graph on a single machine. Proceedings of the 12th European Conference on Computer Systems, EuroSys 2017. https://doi.org/10.1145/3064176.3064191 From *A Closer Look at Lightweight Graph Reordering (2019)*: - Why reorder graphs - Graph algorithms show poor spatial locality - Destroys MLP in OoO superscalar processors because of load-load dependency chains - Exploit power law in graphs - When does reordering achieve a speedup? - Depends on structure of graph - Depends on how lightweight reordering algorithm is (overhead in preprocessing may outweigh speedup in memory access) - *Skew-aware* techniques - *Sort Reordering* - Sort - Hub Sorting - Hub Clustering - Gorder (referenced by Faldu et al as effective but costly) - Rabbit (Arai et al) - Degree-Based Grouping (Faldu et al) - Limitations of skew-aware techniques - Offline vs. online reordering: For some applications graph structure known beforehand, used many times. A heavyweight reordering technique can be used. #### Prefetchers > Basak, A., Li, S., Hu, X., Oh, S. M., Xie, X., Zhao, L., Jiang, X., & Xie, Y. (2019). Analysis and optimization of the memory hierarchy for graph processing workloads. Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019. https://doi.org/10.1109/HPCA.2019.00051 > Ainsworth, S., & Jones, T. M. (2016). Graph prefetching using data structure knowledge. Proceedings of the International Conference on Supercomputing. https://doi.org/10.1145/2925426.2926254 > Cavus, M., Sendag, R., & Yi, J. J. (2019). Array Tracking Prefetcher for Indirect Accesses. Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018, 132–139. https://doi.org/10.1109/ICCD.2018.00028 > Yu, X., Hughes, C. J., Satish, N., & Devadas, S. (2015). IMP: Indirect memory prefetcher. Proceedings of the Annual International Symposium on Microarchitecture, MICRO. https://doi.org/10.1145/2830772.2830807 > Ainsworth, S., & Jones, T. M. (2018). An event-triggered programmable prefetcher for irregular workloads. ACM SIGPLAN Notices. https://doi.org/10.1145/3173162.3173189 > Green, O., Fox, J., Young, J., Shirako, J., & Bader, D. (2019). Performance impact of memory channels on sparse and irregular algorithms. 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms, IA3 2019. https://doi.org/10.1109/IA349570.2019.00016 > Malicevic, J., Lepers, B., & Zwaenepoel, W. (2019). Everything you always wanted to know about multicore graph processing but were afraid to ask. Proceedings of the 2017 USENIX Annual Technical Conference, USENIX ATC 2017. - Software prefetchers - Add instructions to prefetch the data to the cache delta cycles before it is needed - (+) No extra HW needed - (+) Can detect complex patterns - (-) No optimizations at runtime (only static optimizations) - (-) Adds extra instructions that potentially have real data dependencies and can flood the CPU - Hardware prefetchers - Add extra specialized HW to prefetch data - (+) Fast and does not consume CPU cycles - (-) Is either expensive or not very effective (f.e. irregular accesses are very hard) - (DROPLET, Basak et al) - Hybrid prefetchers - Has both a HW as well as a SW component - Combine the low runtime cost of HW with the more advanced detection patterns SW can have - Different possibilities of interaction between the components - Add tags that hints the HW how to prefetch in different places in the code - ATP - Event based with SW written by programmer - Paper: An event-triggered programmable prefetcher for irregular workloads. Software but uses extra HW "Graph prefetching using data structure knowledge" introduces a configured prefetcher to improve performance for breadth-first searches and sequential iteration on the commonly used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. #### Segmenting/partitioning the graph > Zhang, Y., Kiriansky, V., Mendis, C., Amarasinghe, S., & Zaharia, M. (2017). Making caches work for graph analytics. Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017, 2018-Janua, 293–302. https://doi.org/10.1109/BigData.2017.8257937 > Badawy, A.-H. A., & Yeung, D. (2017). Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis. IEEE Computer Architecture Letters, 16(2), 119–122. https://doi.org/10.1109/lca.2017.2695178 > Lakhotia, K., Kannan, R., & Prasanna, V. (2020). Accelerating PageRank using partition-centric processing. Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018, 427–440. https://www.usenix.org/system/files/conference/atc18/atc18-lakhotia.pdf Making caches work for graph analytics: - Divide vertex data in cache size segments, partitions the edges into subgraphs on these segments - All accesses are kept in cache - Intermediate updates from each segment are kept in buffer -> elimates random writes - CSR Segmenting improves cache utilization by working on one cache-sized segment of vertex data at a time. Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis: - Addresses the problem of optimizing graph-based programs for multicore processors -> modern processors are multicore -> Moore's Law - Characterize the importance of properly partitioning graphs among cores at multiple levels of the cache hierarchy Accelerating PageRank using Partition-Centric Processing: #### Compiler optimisations > Kiriansky, V., Zhang, Y., & Amarasinghe, S. (2016). Optimizing Indirect Memory References with milk. Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 299–312. https://doi.org/10.1145/2967938.2967948 - MILK (Kiriansky et al) #### Related work - In-memory Graph processing (PIM) - Cache blocking - Vertex scheduling - GPGPU and accelerators ## Conclusion - Bring up the bottlenecks with the solutions again - Discuss opportunity to combinate the different techniques - Open problems ?