Persistent Caching

# Persistent Caching ###### tags: `GPU` ###### members: @林孟賢 @席致寧 [paper](https://www.overleaf.com/3161447213kchpscspbmxk) [proposal](https://docs.google.com/document/d/1LPxwv7mzae4n1qHnJnDLL_UtjeJf0ngaChi5H_2DfHM/edit?usp=sharing) [report](https://drive.google.com/file/d/152gvtqASZaRv1jkt5Guig8FHHtJiFg0R/view?usp=sharing) [spreadsheet](https://docs.google.com/spreadsheets/d/197LhQm2fDLeWsN5sECN_cid8bA1nde6RrRPaRqrWZp8/edit#gid=0) [slide](https://docs.google.com/presentation/d/1Thz_tu13iYqtAW99xmj1XgWEpikkIbobYvLYku2fkGw/edit?usp=sharing) [oral test](https://docs.google.com/presentation/d/12lGzoFUw9xnw112_kMmgaDO5RW0ygAmvc0cVbPRxzM4/edit?usp=sharing) [reference ppaer](https://hackmd.io/867BaaD-QIe1ytWs7FXYpQ) [DAC paper bullet point](https://hackmd.io/ES09vnUQT-KgcvPX5h2fVw?both) ## Idea ![](https://i.imgur.com/vIoz63Q.png) ![](https://i.imgur.com/oMXxn4J.png)  ![](https://i.imgur.com/kcoBzo5.png) ![](https://i.imgur.com/bxPxGry.png) * spare register([reference paper](https://ieeexplore.ieee.org/abstract/document/6835970)) * Due to resource constraints there may be unallocated registers in a GPU. * If a thread warp has terminated, then registers allocated to the warp are free. * register file access * ![](https://i.imgur.com/9QIySp9.png) * new version of gpgpu-sim contain **Operand collector** which contain a cross-bar, and can access all register file * ![](https://i.imgur.com/pdAiXtc.png) * compiler assist * programming interface like openMP * need user to specify the independent for loop * need compiler to calculate the usage of register numbers in each iteration. * **到達定植分析**可以找出**循環不便變量** * runtime * in runtime we need a table to record the index of the for loop in each thread ### Compare to Software Solution additional space required by software solution : * `length[]` * 紀錄一個warp裡面每條thread所需要的iteration * additional space : O(V) * `start[]` * 紀錄一個warp裡面每條thread的起始位置 * additional space : O(V) * `origin_thread[]` * 紀錄每條邊來自於那一條thread * additional space : O(E) additional space required by hardware solution : * `index_table` * 紀錄一個warp裡面每條thread的start 跟 end * additional space : O(V) 1. hardware solution是rum time生成table, 當某條warp完成運算後即可release掉空間，所以實際上花的空間不會那麼多 2. software solution 會有 preprocessing kernel overhead 3. software solution要先將data(start, length, ...)存在global memory，hardware solution則是extend hardware component來存run time生成的table 4. software solution的memory access次數會變多 ## Related paper list * hardware * Graph-Waving architecture: Efficient execution of graph applications on GPUs * map each node to a warp instead of thread * need to change the user’s code * need to eliminate duplicate scaler instruction * Dynamic Thread Block Launch: A Lightweight Execution Mechanism toSupport Irregular Applications on GPUs * use dynamic launch “thread block” to replace dynamic launch “kernel”, thus alleviate the overhead of the “kernel” and increase parallelism * not fully utilized the SIMD lane when degree of the node is small * GraphPEG: Accelerating Graph Processing on GPUs * use “hardware function” to collect needed edges to the “shared memory”, then process on the collected edge * solve all the aforemention problem, but add many additional hardwares * Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists * Active Thread Compaction for GPU Path Tracing * software * gunrock * Accelerating CUDA Graph Algorithms at Maximum Warp * [paper](https://stanford-ppl.github.io/website/papers/ppopp070a-hong.pdf) * Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths * [paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6877269) * Scalable GPU Graph Traversal * [paper](https://mgarland.org/files/papers/gpubfs.pdf) * benchmark * Pannotia: Understanding Irregular GPGPU Graph Applications * [paper](https://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf) ## problem of equivalence software solution * atomic * ## solve active/inactive ![](https://i.imgur.com/bRnCHU5.gif) ## graph [pannotia](https://github.com/pannotia/pannotia/wiki) [datasets](https://www.cc.gatech.edu/dimacs10/archive/coauthor.shtml) ## PTX ### SPMV * original code: ``` cvta.to.global.u64 %rd2, %rd13; cvta.to.global.u64 %rd17, %rd12; cvta.to.global.u64 %rd18, %rd10; mul.wide.s32 %rd19, %r15, 4; #move add.s64 %rd26, %rd17, %rd19; #move add.s64 %rd25, %rd18, %rd19; #move BB0_3: .pragma "nounroll"; ld.global.u32 %r14, [%rd26]; # PC=0x100 mul.wide.s32 %rd20, %r14, 4; add.s64 %rd21, %rd2, %rd20; ld.global.f32 %f4, [%rd21]; ld.global.f32 %f5, [%rd25]; mul.f32 %f6, %f5, %f4; atom.shared.add.f32 %f7, [%r3], %f6; add.s64 %rd26, %rd26, 4; add.s64 %rd25, %rd25, 4; add.s32 %r15, %r15, 1; setp.lt.s32 %p3, %r15, %r5; @%p3 bra BB0_3; # PC=0x158 ``` * modified code ``` cvta.to.global.u64 %rd2, %rd13; cvta.to.global.u64 %rd17, %rd12; cvta.to.global.u64 %rd18, %rd10; # PC=0x0e0 add.s32 %r1, %r5, %r15; # PC=0x0e8 BB0_3: .pragma "nounroll"; mul.wide.s32 %rd19, %r15, 4; # PC=0x0f0 add.s64 %rd26, %rd17, %rd19; add.s64 %rd25, %rd18, %rd19; ld.global.u32 %r14, [%rd26]; mul.wide.s32 %rd20, %r14, 4; add.s64 %rd21, %rd2, %rd20; ld.global.f32 %f4, [%rd21]; ld.global.f32 %f5, [%rd25]; mul.f32 %f6, %f5, %f4; atom.shared.add.f32 %f7, [%r3], %f6; @%p3 bra BB0_3; # PC=0x140 ``` ### Pagerank * original code : ```c= #pragma unroll 1 for (int edge = start; edge < end; edge++) { int nid = col[edge]; atomicAdd(&page_rank2[nid], page_rank1[tid] / (float)(end - start)); } ``` ```= mul.wide.s32 %rd16, %r16, 4; add.s64 %rd19, %rd14, %rd16; //0x0e8 create table BB0_5: .pragma "nounroll"; ld.global.u32 %r14, [%rd19]; //0x0f0 col[edge] mul.wide.s32 %rd17, %r14, 4; add.s64 %rd18, %rd2, %rd17; ld.global.f32 %f2, [%rd3]; div.rn.f32 %f3, %f2, %f1; atom.global.add.f32 %f4, [%rd18], %f3; add.s64 %rd19, %rd19, 4; //0x120 edge++ add.s32 %r16, %r16, 1; //0x128 loop counter setp.lt.s32 %p4, %r16, %r15; //0x130 loop counter < iteration @%p4 bra BB0_5; //0x138 branch ``` * instruction找到0x0e8就換方式執行(還沒建表格) : ```= sub.s32 %r13, %r15, %r16; //PC = 0x0d0 cvt.rn.f32.s32%f1, %r13; //PC = 0x0d8 BB0_5: .pragma "nounroll"; mul.wide.s32 %rd16, %r16, 4; //PC = 0x0e0 r16 : index,從table拿 add.s64 %rd19, %rd14, %rd16; //PC = 0x0e8 rd14 : base address ld.global.u32 %r14, [%rd19]; mul.wide.s32 %rd17, %r14, 4; add.s64 %rd18, %rd2, %rd17; ld.global.f32 %f2, [%rd3]; div.rn.f32 %f3, %f2, %f1; atom.global.add.f32 %f4, [%rd18], %f3; //PC = 0x118 add.s32 %r16, %r16, 1; setp.lt.s32 %p4, %r16, %r15; @%p4 bra BB0_5; //PC = 0x130 ``` * %r16 : start, %r15 : end, 建表格 * 根據表格抓rd19的值 * gpu-sim要得到的資訊： * start index * end index * last codeblock PC (第11行) * exit PC (第15行) ```= //PC = 0x0c8 sub.s32 %r13, %r15, %r16; //PC = 0x0d0 cvt.rn.f32.s32%f1, %r13; //PC = 0x0d8 create table add.s32 %r1, %r15, %r16; //PC = 0x0e0 set index BB0_5: .pragma "nounroll"; mul.wide.s32 %rd16, %r16, 4; //PC = 0x0e8 r16 : index, 從table拿 add.s64 %rd19, %rd14, %rd16; //PC = 0x0f0 rd14 : base address ld.global.u32 %r14, [%rd19]; //PC = 0x0f8 rd19 redirect??? mul.wide.s32 %rd17, %r14, 4; //PC = 0x100 add.s64 %rd18, %rd2, %rd17; //PC = 0x108 ld.global.f32 %f2, [%rd3]; //PC = 0x110 rd3 redirect??? div.rn.f32 %f3, %f2, %f1; //PC = 0x118 atom.global.add.f32 %f4, [%rd18], %f3; //PC = 0x120 @%p4 bra BB0_5; //PC = 0x128 //PC = 0x130 ``` read reg: write reg: rd16, rd19, r14, rd17, rd18, f2, f3, f4 ### Pagerank_spmv ``` cvta.to.global.u64 %rd2, %rd12; cvta.to.global.u64 %rd17, %rd10; cvta.to.global.u64 %rd18, %rd11; mul.wide.s32 %rd19, %r15, 4; #move add.s64 %rd26, %rd17, %rd19; #move add.s64 %rd25, %rd18, %rd19; #move add.s32 %r1, %r4, %r15; #add BB2_3: .pragma "nounroll"; ld.global.u32 %r14, [%rd26]; #PC=0x4e0 mul.wide.s32 %rd20, %r14, 4; #PC=0x4e8 add.s64 %rd21, %rd2, %rd20; #PC=0x4f0 ld.global.f32 %f4, [%rd21]; #PC=0x4f8 ld.global.f32 %f5, [%rd25]; #PC=0x500 mul.f32 %f6, %f5, %f4; #PC=0x508 atom.shared.add.f32 %f7, [%r5], %f6; #PC=0x510 add.s64 %rd26, %rd26, 4; #PC=0x518 delete add.s64 %rd25, %rd25, 4; #PC=0x520 delete add.s32 %r15, %r15, 1; #PC=0x528 delete setp.lt.s32 %p3, %r15, %r4; #PC=0x530 delete @%p3 bra BB2_3; #PC=0x538 ld.shared.f32 %f10, [%r5]; BB2_5: cvta.to.global.u64 %rd22, %rd13; shl.b64 %rd23, %rd1, 2; add.s64 %rd24, %rd22, %rd23; ld.global.f32 %f8, [%rd24]; add.f32 %f9, %f10, %f8; st.global.f32 [%rd24], %f9; BB2_6: ret; ``` #### parallel ``` cvta.to.global.u64 %rd2, %rd12; cvta.to.global.u64 %rd17, %rd10; cvta.to.global.u64 %rd18, %rd11; #PC=0x4c0 add.s32 %r1, %r4, %r15; #PC=0x4c8 add BB2_3: .pragma "nounroll"; mul.wide.s32 %rd19, %r15, 4; #PC=0x4d0 move add.s64 %rd26, %rd17, %rd19; #PC=0x4d8 move add.s64 %rd25, %rd18, %rd19; #PC=0x4e0 move ld.global.u32 %r14, [%rd26]; #PC=0x4e8 mul.wide.s32 %rd20, %r14, 4; #PC=0x4f0 add.s64 %rd21, %rd2, %rd20; #PC=0x4f8 ld.global.f32 %f4, [%rd21]; #PC=0x500 ld.global.f32 %f5, [%rd25]; #PC=0x508 mul.f32 %f6, %f5, %f4; #PC=0x510 atom.shared.add.f32 %f7, [%r5], %f6; #PC=0x518 @%p3 bra BB2_3; #PC=0x520 ld.shared.f32 %f10, [%r5]; BB2_5: cvta.to.global.u64 %rd22, %rd13; shl.b64 %rd23, %rd1, 2; add.s64 %rd24, %rd22, %rd23; ld.global.f32 %f8, [%rd24]; add.f32 %f9, %f10, %f8; st.global.f32 [%rd24], %f9; BB2_6: ret; ``` ### BC ``` //PC=0x0d0 BB0_4: setp.ge.s32 %p4, %r22, %r21; //PC=0x0d8 @%p4 bra BB0_11; //PC=0x0e0 cvta.to.global.u64 %rd18, %rd11; //PC=0x0e8 cvta.to.global.u64 %rd19, %rd9; //PC=0x0f0 add.s32 %r5, %r13, 1; //PC=0x0f8 add.s64 %rd2, %rd18, %rd14; //PC=0x100 cvta.to.global.u64 %rd4, %rd12; //PC=0x108 create table add.s32 %r1, %r21, %r22; //PC=0x110 extend set index BB0_6: .pragma "nounroll"; mul.wide.s32 %rd21, %r22, 4; //PC=0x118 move, set register add.s64 %rd27, %rd19, %rd21; //PC=0x120 move ld.global.u32 %r7, [%rd27]; //PC=0x128 mul.wide.s32 %rd23, %r7, 4; //PC=0x130 add.s64 %rd6, %rd13, %rd23; //PC=0x138 ld.global.u32 %r23, [%rd6]; //PC=0x140 setp.gt.s32 %p5, %r23, -1; //PC=0x148 @%p5 bra BB0_8; //PC=0x150 mov.u32 %r19, 1; //PC=0x158 st.global.u32 [%rd4], %r19; //PC=0x160 *cont = 1; add.s32 %r20, %r13, 1; //PC=0x168 st.global.u32 [%rd6], %r20; //PC=0x170 d[w] = dist + 1; mov.u32 %r23, %r5; //PC=0x178 BB0_8: setp.ne.s32 %p6, %r23, %r5; //PC=0x180 break point r23: d[w], r5: @%p6 bra BB0_10; //PC=0x188 add.s64 %rd26, %rd18, %rd23; //PC=0x190 ld.global.f32 %f1, [%rd2]; //PC=0x198 atom.global.add.f32 %f2, [%rd26], %f1; //PC=0x1a0 BB0_10: add.s32 %r22, %r22, 1; //delete add.s64 %rd27, %rd27, 4; //delete setp.lt.s32 %p7, %r22, %r21; //delete @%p7 bra BB0_6; //PC=0x1a8 finish or not / destroy table BB0_11: ret; //PC=0x1b0 ``` ### BC_WL #### original ``` cvta.to.global.u64 %rd19, %rd12; cvta.to.global.u64 %rd20, %rd10; add.s32 %r6, %r14, 1; add.s64 %rd2, %rd19, %rd18; #PC=0x100 mul.wide.s32 %rd22, %r23, 4; #PC=0x108 move add.s64 %rd28, %rd20, %rd22; #PC=0x110 move cvta.to.global.u64 %rd4, %rd13; #PC=0x118 cvta.to.global.u64 %rd23, %rd11; #PC=0x120 add.s32 %r1, %r22, %r23; #PC=0x128 add BB0_5: .pragma "nounroll"; ld.global.u32 %r8, [%rd28]; #PC=0x130 mul.wide.s32 %rd24, %r8, 4; add.s64 %rd6, %rd23, %rd24; ld.global.u32 %r24, [%rd6]; setp.gt.s32 %p4, %r24, -1; @%p4 bra BB0_7; mov.u32 %r20, 1; st.global.u32 [%rd4], %r20; add.s32 %r21, %r14, 1; st.global.u32 [%rd6], %r21; mov.u32 %r24, %r6; BB0_7: setp.ne.s32 %p5, %r24, %r6; @%p5 bra BB0_9; add.s64 %rd27, %rd19, %rd24; ld.global.f32 %f1, [%rd2]; atom.global.add.f32 %f2, [%rd27], %f1; BB0_9: add.s32 %r23, %r23, 1; #remove add.s64 %rd28, %rd28, 4; #remove setp.lt.s32 %p6, %r23, %r22; #remove @%p6 bra BB0_5; ``` #### modified ``` cvta.to.global.u64 %rd19, %rd12; cvta.to.global.u64 %rd20, %rd10; add.s32 %r6, %r14, 1; #PC=0x100 add.s64 %rd2, %rd19, %rd18; #PC=0x108 cvta.to.global.u64 %rd4, %rd13; #PC=0x110 cvta.to.global.u64 %rd23, %rd11; #PC=0x118 add.s32 %r1, %r22, %r23; #PC=0x120 BB0_5: .pragma "nounroll"; mul.wide.s32 %rd22, %r23, 4; #PC=0x128 add.s64 %rd28, %rd20, %rd22; ld.global.u32 %r8, [%rd28]; mul.wide.s32 %rd24, %r8, 4; add.s64 %rd6, %rd23, %rd24; ld.global.u32 %r24, [%rd6]; setp.gt.s32 %p4, %r24, -1; @%p4 bra BB0_7; mov.u32 %r20, 1; st.global.u32 [%rd4], %r20; add.s32 %r21, %r14, 1; st.global.u32 [%rd6], %r21; mov.u32 %r24, %r6; BB0_7: setp.ne.s32 %p5, %r24, %r6; @%p5 bra BB0_9; add.s64 %rd27, %rd19, %rd24; ld.global.f32 %f1, [%rd2]; atom.global.add.f32 %f2, [%rd27], %f1; BB0_9: @%p6 bra BB0_5; #PC=0x1b8 ``` ### Color #### original original ``` BB0_4: mov.u32 %r33, -1; setp.ge.s32 %p4, %r2, %r30; @%p4 bra BB0_9; add.s32 %r5, %r30, -1; mov.u32 %r33, -1; cvta.to.global.u64 %rd17, %rd7; mul.wide.s32 %rd18, %r2, 4; //remove add.s64 %rd28, %rd17, %rd18; //remove cvta.to.global.u64 %rd3, %rd10; cvta.to.global.u64 %rd22, %rd8; mov.u32 %r31, %r2; //create table add.s32 %r1, %r31, %r30; //add, set index to table BB0_6: .pragma "nounroll"; mul.wide.s32 %rd18, %r31, 4; //move add.s64 %rd28, %rd17, %rd18; //move ld.global.u32 %r25, [%rd28]; //PC=0x130 mul.wide.s32 %rd20, %r25, 4; add.s64 %rd21, %rd12, %rd20; ld.global.u32 %r26, [%rd21]; setp.eq.s32 %p5, %r26, -1; setp.ne.s32 %p6, %r2, %r5; and.pred %p7, %p5, %p6; @!%p7 bra BB0_8; bra.uni BB0_7; BB0_7: mov.u32 %r27, 1; st.global.u32 [%rd3], %r27; ld.global.u32 %r28, [%rd28]; mul.wide.s32 %rd23, %r28, 4; add.s64 %rd24, %rd22, %rd23; ld.global.u32 %r29, [%rd24]; max.s32 %r33, %r29, %r33; BB0_8: add.s64 %rd28, %rd28, 4; //remove add.s32 %r31, %r31, 1; //remove setp.lt.s32 %p8, %r31, %r30; //remove r31 : index, r30 : end @%p8 bra BB0_6; BB0_9: cvta.to.global.u64 %rd25, %rd11; add.s64 %rd27, %rd25, %rd13; st.global.u32 [%rd27], %r33; BB0_10: ret; ``` modified ``` BB0_4: mov.u32 %r33, -1; setp.ge.s32 %p4, %r2, %r30; @%p4 bra BB0_9; add.s32 %r5, %r30, -1; mov.u32 %r33, -1; cvta.to.global.u64 %rd17, %rd7; cvta.to.global.u64 %rd3, %rd10; cvta.to.global.u64 %rd22, %rd8; mov.u32 %r31, %r2; //PC=0x118 create table add.s32 %r1, %r31, %r30; //PC=0x120 add, set index to table BB0_6: .pragma "nounroll"; mul.wide.s32 %rd18, %r31, 4; //PC=0x128 move add.s64 %rd28, %rd17, %rd18; //PC=0x130 move ld.global.u32 %r25, [%rd28]; //PC=0x138 mul.wide.s32 %rd20, %r25, 4; //PC=0x140 add.s64 %rd21, %rd12, %rd20; //PC=0x148 ld.global.u32 %r26, [%rd21]; //PC=0x150 setp.eq.s32 %p5, %r26, -1; //PC=0x158 setp.ne.s32 %p6, %r2, %r5; //PC=0x160 and.pred %p7, %p5, %p6; //PC=0x168 @!%p7 bra BB0_8; //PC=0x170 bra.uni BB0_7; //PC=0x178 BB0_7: mov.u32 %r27, 1; //PC=0x180 st.global.u32 [%rd3], %r27; //PC=0x188 ld.global.u32 %r28, [%rd28]; //PC=0x190 mul.wide.s32 %rd23, %r28, 4; //PC=0x198 add.s64 %rd24, %rd22, %rd23; //PC=0x1a0 ld.global.u32 %r29, [%rd24]; //PC=0x1a8 max.s32 %r33, %r29, %r33; //PC=0x1b0 BB0_8: @%p8 bra BB0_6; //PC=0x1b8 BB0_9: cvta.to.global.u64 %rd25, %rd11; add.s64 %rd27, %rd25, %rd13; st.global.u32 [%rd27], %r33; BB0_10: ret; ``` #### parallel original ``` add.s32 %r6, %r32, -1; cvta.to.global.u64 %rd18, %rd8; mul.wide.s32 %rd19, %r2, 4; //moved add.s64 %rd29, %rd18, %rd19; //moved cvta.to.global.u64 %rd4, %rd11; cvta.to.global.u64 %rd23, %rd9; mov.u32 %r33, %r2; // add.s32 %r1, %r32, %r33; //PC=, add BB0_6: .pragma "nounroll"; mul.wide.s32 %rd19, %r33, 4; //PC=, move add.s64 %rd29, %rd18, %rd19; //PC=, move ld.global.u32 %r22, [%rd29]; //PC=0x158 mul.wide.s32 %rd21, %r22, 4; //60 add.s64 %rd22, %rd13, %rd21; //68 ld.global.u32 %r23, [%rd22]; //70 setp.eq.s32 %p5, %r23, -1; //78 setp.ne.s32 %p6, %r2, %r6; //80 and.pred %p7, %p5, %p6; //88 @!%p7 bra BB0_8; //90 bra.uni BB0_7; //98 BB0_7: mov.u32 %r24, 1; //a0 st.global.u32 [%rd4], %r24; //a8 ld.global.u32 %r25, [%rd29]; //b0 mul.wide.s32 %rd24, %r25, 4; //b8 add.s64 %rd25, %rd23, %rd24; //c0 ld.global.u32 %r26, [%rd25]; //PC=0x1c8 atom.shared.max.s32 %r31, [%r5], %r26;//PC=0x1d0 BB0_8: add.s32 %r33, %r33, 1; //PC=0x1d8, remove add.s64 %rd29, %rd29, 4; //PC=0x1e0, remove setp.lt.s32 %p8, %r33, %r32; //PC=0x1e8, remove @%p8 bra BB0_6; //PC=0x1f0 ld.shared.u32 %r34, [%r5]; //PC=0x1f8 BB0_10: cvta.to.global.u64 %rd26, %rd12; add.s64 %rd28, %rd26, %rd17; st.global.u32 [%rd28], %r34; BB0_11: ret; ``` modified ``` add.s32 %r6, %r32, -1; cvta.to.global.u64 %rd18, %rd8; cvta.to.global.u64 %rd4, %rd11; cvta.to.global.u64 %rd23, %rd9; mov.u32 %r33, %r2; //PC=0x140 add.s32 %r1, %r32, %r33; //PC=0x148, add BB0_6: .pragma "nounroll"; mul.wide.s32 %rd19, %r33, 4; //PC=0x150, move add.s64 %rd29, %rd18, %rd19; //PC=0x158, move ld.global.u32 %r22, [%rd29]; //PC=0x160 mul.wide.s32 %rd21, %r22, 4; //PC=0x168 add.s64 %rd22, %rd13, %rd21; //PC=0x170 ld.global.u32 %r23, [%rd22]; //PC=0x178 setp.eq.s32 %p5, %r23, -1; //PC=0x180 setp.ne.s32 %p6, %r2, %r6; //PC=0x188 and.pred %p7, %p5, %p6; //PC=0x190 @!%p7 bra BB0_8; //PC=0x198 bra.uni BB0_7; //PC=0x1a0 BB0_7: mov.u32 %r24, 1; //PC=0x1a8 st.global.u32 [%rd4], %r24; //PC=0x1b0 ld.global.u32 %r25, [%rd29]; //PC=0x1b8 mul.wide.s32 %rd24, %r25, 4; //PC=0x1c0 add.s64 %rd25, %rd23, %rd24; //PC=0x1c8 ld.global.u32 %r26, [%rd25]; //PC=0x1d0 atom.shared.max.s32 %r31, [%r5], %r26;//PC=0x1d8 BB0_8: @%p8 bra BB0_6; //PC=0x1e0 ld.shared.u32 %r34, [%r5]; //PC=0x1e8 BB0_10: cvta.to.global.u64 %rd26, %rd12; add.s64 %rd28, %rd26, %rd17; st.global.u32 [%rd28], %r34; BB0_11: ret; ``` ### Color_wl original ``` add.s32 %r7, %r33, -1; cvta.to.global.u64 %rd18, %rd8; mul.wide.s32 %rd19, %r3, 4; #move add.s64 %rd29, %rd18, %rd19; #move cvta.to.global.u64 %rd3, %rd11; cvta.to.global.u64 %rd20, %rd10; cvta.to.global.u64 %rd23, %rd9; mov.u32 %r34, %r3; add.s32 %r1, %r33, %r34; #add BB0_5: .pragma "nounroll"; ld.global.u32 %r23, [%rd29]; #PC=0x348 mul.wide.s32 %rd21, %r23, 4; add.s64 %rd22, %rd20, %rd21; ld.global.u32 %r24, [%rd22]; setp.eq.s32 %p4, %r24, -1; setp.ne.s32 %p5, %r3, %r7; and.pred %p6, %p4, %p5; @!%p6 bra BB1_7; bra.uni BB1_6; BB1_6: mov.u32 %r25, 1; st.global.u32 [%rd3], %r25; ld.global.u32 %r26, [%rd29]; mul.wide.s32 %rd24, %r26, 4; add.s64 %rd25, %rd23, %rd24; ld.global.u32 %r27, [%rd25]; atom.shared.max.s32 %r32, [%r6], %r27; BB1_7: add.s32 %r34, %r34, 1; #remove add.s64 %rd29, %rd29, 4; #remove setp.lt.s32 %p7, %r34, %r33; #remove @%p7 bra BB1_5; ``` modified ``` add.s32 %r7, %r33, -1; cvta.to.global.u64 %rd18, %rd8; cvta.to.global.u64 %rd3, %rd11; cvta.to.global.u64 %rd20, %rd10; cvta.to.global.u64 %rd23, %rd9; mov.u32 %r34, %r3; #PC=0x140 add.s32 %r1, %r33, %r34; #PC=0x148 BB0_5: .pragma "nounroll"; mul.wide.s32 %rd19, %r34, 4; #PC=0x150 add.s64 %rd29, %rd18, %rd19; ld.global.u32 %r23, [%rd29]; mul.wide.s32 %rd21, %r23, 4; add.s64 %rd22, %rd20, %rd21; ld.global.u32 %r24, [%rd22]; setp.eq.s32 %p4, %r24, -1; setp.ne.s32 %p5, %r3, %r7; and.pred %p6, %p4, %p5; @!%p6 bra BB0_7; bra.uni BB0_6; BB0_6: mov.u32 %r25, 1; st.global.u32 [%rd3], %r25; ld.global.u32 %r26, [%rd29]; mul.wide.s32 %rd24, %r26, 4; add.s64 %rd25, %rd23, %rd24; ld.global.u32 %r27, [%rd25]; atom.shared.max.s32 %r32, [%r6], %r27; BB0_7: @%p7 bra BB0_5; #PC=0x1e0 ``` ### color max min ### origin ``` add.s32 %r5, %r52, -1; cvta.to.global.u64 %rd21, %rd6; mul.wide.s32 %rd22, %r2, 4; #move add.s64 %rd37, %rd21, %rd22; #move cvta.to.global.u64 %rd26, %rd9; cvta.to.global.u64 %rd27, %rd7; mov.u32 %r53, %r2; #PC=0x178 add.s32 %r1, %r52, %r53; #add BB0_6: .pragma "nounroll"; ld.global.u32 %r27, [%rd37]; #PC=0x180 mul.wide.s32 %rd24, %r27, 4; add.s64 %rd25, %rd12, %rd24; ld.global.u32 %r28, [%rd25]; setp.eq.s32 %p5, %r28, -1; setp.ne.s32 %p6, %r2, %r5; and.pred %p7, %p5, %p6; @!%p7 bra BB0_8; bra.uni BB0_7; BB0_7: mov.u32 %r29, 1; st.global.u32 [%rd26], %r29; ld.global.u32 %r30, [%rd37]; mul.wide.s32 %rd28, %r30, 4; add.s64 %rd29, %rd27, %rd28; ld.global.u32 %r31, [%rd29]; atom.shared.max.s32 %r36, [%r24], %r31; ld.global.u32 %r37, [%rd37]; mul.wide.s32 %rd30, %r37, 4; add.s64 %rd31, %rd27, %rd30; ld.global.u32 %r38, [%rd31]; atom.shared.min.s32 %r41, [%r26], %r38; BB0_8: add.s32 %r53, %r53, 1; #delete add.s64 %rd37, %rd37, 4; #delete setp.lt.s32 %p8, %r53, %r52; #delete @%p8 bra BB0_6; #PC=0x240 ld.shared.u32 %r55, [%r24]; ld.shared.u32 %r54, [%r26]; ``` ### parallel ``` add.s32 %r5, %r52, -1; cvta.to.global.u64 %rd21, %rd6; cvta.to.global.u64 %rd26, %rd9; cvta.to.global.u64 %rd27, %rd7; mov.u32 %r53, %r2; #PC=0x168 add.s32 %r1, %r52, %r53; #PC=0x170 add BB0_6: .pragma "nounroll"; mul.wide.s32 %rd22, %r53, 4; #PC=0x178 move add.s64 %rd37, %rd21, %rd22; #PC=0x180 move ld.global.u32 %r27, [%rd37]; #PC=0x mul.wide.s32 %rd24, %r27, 4; add.s64 %rd25, %rd12, %rd24; ld.global.u32 %r28, [%rd25]; setp.eq.s32 %p5, %r28, -1; setp.ne.s32 %p6, %r2, %r5; and.pred %p7, %p5, %p6; @!%p7 bra BB0_8; bra.uni BB0_7; BB0_7: mov.u32 %r29, 1; st.global.u32 [%rd26], %r29; ld.global.u32 %r30, [%rd37]; mul.wide.s32 %rd28, %r30, 4; add.s64 %rd29, %rd27, %rd28; ld.global.u32 %r31, [%rd29]; atom.shared.max.s32 %r36, [%r24], %r31; ld.global.u32 %r37, [%rd37]; mul.wide.s32 %rd30, %r37, 4; add.s64 %rd31, %rd27, %rd30; ld.global.u32 %r38, [%rd31]; atom.shared.min.s32 %r41, [%r26], %r38; BB0_8: @%p8 bra BB0_6; #PC=0x230 ld.shared.u32 %r55, [%r24]; ld.shared.u32 %r54, [%r26]; ``` ### mis #### origin ``` cvta.to.global.u64 %rd4, %rd10; cvta.to.global.u64 %rd19, %rd9; mul.wide.s32 %rd20, %r29, 4; #move add.s64 %rd28, %rd19, %rd20; #move BB1_6: .pragma "nounroll"; ld.global.u32 %r9, [%rd28]; #PC=0x210 mul.wide.s32 %rd21, %r9, 4; add.s64 %rd22, %rd1, %rd21; ld.global.u32 %r25, [%rd22]; setp.ne.s32 %p5, %r25, -1; @%p5 bra BB1_8; add.s64 %rd24, %rd4, %rd21; ld.global.u32 %r26, [%rd24]; atom.shared.min.s32 %r27, [%r6], %r26; BB1_8: add.s32 %r29, %r29, 1; #delete add.s64 %rd28, %rd28, 4; #delete setp.lt.s32 %p6, %r29, %r28; #delete @%p6 bra BB1_6; ld.shared.u32 %r30, [%r6]; BB1_10: cvta.to.global.u64 %rd25, %rd11; add.s64 %rd27, %rd25, %rd18; st.global.u32 [%rd27], %r30; BB1_11: ret; ``` #### modified ``` cvta.to.global.u64 %rd4, %rd10; cvta.to.global.u64 %rd19, %rd9; #PC=0x1f8 add.s32 %r1, %r28, %r29; #PC=0x200 add BB1_6: .pragma "nounroll"; mul.wide.s32 %rd20, %r29, 4; #PC=0x208 move add.s64 %rd28, %rd19, %rd20; #PC=0x210 move ld.global.u32 %r9, [%rd28]; #PC=0x218 mul.wide.s32 %rd21, %r9, 4; #PC=0x220 add.s64 %rd22, %rd1, %rd21; #PC=0x228 ld.global.u32 %r25, [%rd22]; #PC=0x230 setp.ne.s32 %p5, %r25, -1; #PC=0x238 @%p5 bra BB1_8; #PC=0x240 add.s64 %rd24, %rd4, %rd21; #PC=0x248 ld.global.u32 %r26, [%rd24]; #PC=0x250 atom.shared.min.s32 %r27, [%r6], %r26; #PC=0x258 BB1_8: @%p6 bra BB1_6; #PC=0x260 ld.shared.u32 %r30, [%r6]; #PC=0x268 BB1_10: cvta.to.global.u64 %rd25, %rd11; add.s64 %rd27, %rd25, %rd18; st.global.u32 [%rd27], %r30; BB1_11: ret; ``` ### mis_wl #### origin ``` cvta.to.global.u64 %rd2, %rd10; cvta.to.global.u64 %rd3, %rd11; cvta.to.global.u64 %rd20, %rd9; #PC=0x1f8 mul.wide.s32 %rd21, %r30, 4; #PC=0x200 move add.s64 %rd29, %rd20, %rd21; #PC=0x208 move add.s32 %r1, %r29, %r30; #add BB1_5: .pragma "nounroll"; ld.global.u32 %r10, [%rd29]; #PC=0x210 mul.wide.s32 %rd22, %r10, 4; add.s64 %rd23, %rd3, %rd22; ld.global.u32 %r26, [%rd23]; setp.ne.s32 %p4, %r26, -1; @%p4 bra BB1_7; add.s64 %rd25, %rd2, %rd22; ld.global.u32 %r27, [%rd25]; atom.shared.min.s32 %r28, [%r7], %r27; BB1_7: add.s32 %r30, %r30, 1; #remove add.s64 %rd29, %rd29, 4; #remove setp.lt.s32 %p5, %r30, %r29; #remove @%p5 bra BB1_5; ``` #### modified ``` cvta.to.global.u64 %rd2, %rd10; #PC=0x1e8 cvta.to.global.u64 %rd3, %rd11; #PC=0x1f0 cvta.to.global.u64 %rd20, %rd9; #PC=0x1f8 add.s32 %r1, %r29, %r30; #PC=0x200 BB1_5: .pragma "nounroll"; mul.wide.s32 %rd21, %r30, 4; #PC=0x208 add.s64 %rd29, %rd20, %rd21; ld.global.u32 %r10, [%rd29]; mul.wide.s32 %rd22, %r10, 4; add.s64 %rd23, %rd3, %rd22; ld.global.u32 %r26, [%rd23]; setp.ne.s32 %p4, %r26, -1; @%p4 bra BB1_7; add.s64 %rd25, %rd2, %rd22; ld.global.u32 %r27, [%rd25]; atom.shared.min.s32 %r28, [%r7], %r27; BB1_7: @%p5 bra BB1_5; #PC=0x260 ``` ### sssp #### origin ``` cvta.to.global.u64 %rd19, %rd11; cvta.to.global.u64 %rd20, %rd10; mul.wide.s32 %rd21, %r24, 4; #move add.s64 %rd28, %rd20, %rd21; #move add.s64 %rd27, %rd19, %rd21; #move BB0_3: .pragma "nounroll"; ld.global.u32 %r19, [%rd28]; #PC=0x100 mul.wide.s32 %rd22, %r19, 4; add.s64 %rd23, %rd17, %rd22; ld.global.u32 %r20, [%rd23]; ld.global.u32 %r21, [%rd27]; add.s32 %r22, %r20, %r21; atom.shared.min.s32 %r23, [%r6], %r22; add.s64 %rd28, %rd28, 4; #delete add.s64 %rd27, %rd27, 4; #delete add.s32 %r24, %r24, 1; #delete setp.lt.s32 %p3, %r24, %r4; #delete @%p3 bra BB0_3; ld.shared.u32 %r25, [%r6]; BB0_5: cvta.to.global.u64 %rd24, %rd13; shl.b64 %rd25, %rd1, 2; add.s64 %rd26, %rd24, %rd25; st.global.u32 [%rd26], %r25; BB0_6: ret; ``` #### modified ``` cvta.to.global.u64 %rd19, %rd11; cvta.to.global.u64 %rd20, %rd10; #PC=0x0e0 add.s32 %r1, %r4, %r24; #PC=0x0e8 add BB0_3: .pragma "nounroll"; mul.wide.s32 %rd21, %r24, 4; #PC=0x0f0 move add.s64 %rd28, %rd20, %rd21; #PC=0x0f8 move add.s64 %rd27, %rd19, %rd21; #PC=0x100 move ld.global.u32 %r19, [%rd28]; #PC=0x108 mul.wide.s32 %rd22, %r19, 4; #PC=0x110 add.s64 %rd23, %rd17, %rd22; #PC=0x118 ld.global.u32 %r20, [%rd23]; ld.global.u32 %r21, [%rd27]; add.s32 %r22, %r20, %r21; atom.shared.min.s32 %r23, [%r6], %r22; @%p3 bra BB0_3; #PC=0x140 ld.shared.u32 %r25, [%r6]; BB0_5: cvta.to.global.u64 %rd24, %rd13; shl.b64 %rd25, %rd1, 2; add.s64 %rd26, %rd24, %rd25; st.global.u32 [%rd26], %r25; BB0_6: ret; ``` --- ## outcome [spreadsheet](https://docs.google.com/spreadsheets/d/197LhQm2fDLeWsN5sECN_cid8bA1nde6RrRPaRqrWZp8/edit#gid=0) ### normal * pagerank * speedup : 2.02x * iterations : 26.6% * memory access number: 62.9% * BFS * speedup : 1.08x * iterations : 84.9% * memory access number: 41.4% * Color Max(pagerank graph) * speedup : 2.32x * iterations : 26.6% * memory access number: 15.2% * MIS(pagerank graph) * speedup : 1.91x * iterations : 26.6% * memory access number: 21.6% * SSSP * speedup : 1.97x * iterations : 26.6% * memory access number: 13.1% ### bypass l1 cache * Color Max * speedup : 1.49x * iterations : 99.9% * memory access number: 66.2% ### working list * BFS * gpgpu-sim's bug in global atomic increment * might because atom's outcome will not shown immediately * wl code has bug * speedup : x * iterations : % * memory access number: % --- ## Run Benchmarks [experiment spread sheet](https://docs.google.com/spreadsheets/d/12SWUiwHGojaicbE4O3QcscHBK07e-__A8j6FICQJRYQ/edit#gid=1338197220) [教授spread sheet](https://docs.google.com/spreadsheets/d/14ibl-NsZcrZzOVNC3gcKF7Ow2tnZt8DKzSlhPU888ak/edit#gid=283140300) ![](https://i.imgur.com/sPtCltQ.png) ### Loop Unroll ```cpp= #pragma unroll 1 for(int i=start; i<end; i++){ ... } ``` * `#pragma unroll N`會告訴compiler要unroll幾層loop * N=1只會unroll一層,即**turn off** loop unroll ### atomic operation * instruciton.cc * atom_impl * thread->m_last_dram_callback.function_wt * cuda-sim.cc * ptx_exec_inst * inst.add_callback(lane_id, last_callback().function_wt, last_callback().instruction, this, threads, true /*atomic*/); * abstract_hardware_model.cc * do_atomic * cb.function_wt(cb.instruction, cb.thread, cb.threads); * abstract_hardware_model.h * add_callback * dram_callback_t ### 1. gpgpusim.config ``` #lemonsien -gpgpu_perfect_mem 0 -gpgpu_concurrent_kernel_sm 1 -scheduler_kind 2 -partition_method 1 ... ... # high level architecture configuration -gpgpu_n_clusters 30 ... ... # Device Limits -gpgpu_kernel_launch_latency 0 ``` -cache allocate policy : 改成on miss, hit reserved才會正確 * 否則都是當成miss -scheduler_kind * 0:origin * 1:paver * 2:paver2 -partition_method * 0:BCS * 1:k-way * 2:recursive(not working) ### 2. Run ``` #pannotia/fw ./fw ../../../../data_dirs/pannotia/fw/data/256_16384.gr &> output_memstall & ``` ``` #pannotia/bc ./bc ../../../../data_dirs/pannotia/bc/data/1k_128k.gr &> output_paver_30groups_MST_ideal & ``` ``` #pannotia/color_max ./color_max ../../../../data_dirs/pannotia/color_max/data/ecology1.graph 1 &> output_paver_30groups_MST_ideal & ``` ``` #pannotai/sssp ./sssp ../../../../data_dirs/pannotia/sssp/data/USA-road-d.NW.gr 1 &> output_paver_30groups_MST_ideal & ``` ``` #pannotia/pagerank ./pagerank ../../../../data_dirs/pannotia/pagerank/data/coAuthorsDBLP.graph 1 &> output_paver2 & ``` ``` #pannotia/sssp ./sssp ../../../../data_dirs/pannotia/sssp/data/USA-road-d.NW.gr 0 &> output_memstall2 & ``` ``` #lonestar/mst ./run &> output_paver_30groups_MST_ideal & ``` ``` #lonestar/sssp ./sssp ../../../../../data_dirs/cuda/lonestargpu-2.0/inputs/USA-road-d.FLA.gr &> output_memstall2 & ``` ``` #mis ./mis data/ecology1.graph 1 &> output_memstall & ``` ## Git http://blog.gogojimmy.net/2012/02/29/git-scenario/ ``` git commit -am "comments" git push -u origin branch_name git pull origin branch_name ``` #### Log in ``` git config --global user.name "<使用者名字>" git config --global user.email "<電子信箱>" ``` check configure: ``` git config --list or cat ~/.gitconfig ``` #### 看狀態 ``` git status ``` #### 看歷史commit ``` git log ``` #### 加入新(untracked)檔案 ``` git add file_name ``` 一次加全部檔案： ``` git add . ``` #### 存檔, 會紀錄改過的檔案的訊息 ``` git commit -m "commit訊息" ``` 先git add .後略過編輯器提交 commit ``` git commit -am "commit訊息" ``` #### Branch 新增branch ``` git branch new_branch ``` 查看有哪些branch ``` git branch ``` 切換branch ``` git checkout target_branch ``` #### merge branch_name到目前的branch ``` git merge branch_name ``` ## Perfect Memory * access memory * 不access memory，"**去memory**" 馬上接 "**回cluster**" ![](https://i.imgur.com/5LjCpZQ.png) ![](https://i.imgur.com/ObHo7W6.png) ### memory stall * kinds * memory_cycle * memory access not done * shared_cycle * constant_cycle * texture_cycle * memory priority * ICNT_RC_FAIL = BK_CONF > COAL_STALL NO_RC_FAIL = 0, BK_CONF, MSHR_RC_FAIL, ICNT_RC_FAIL, COAL_STALL, TLB_STALL, DATA_PORT_STALL, WB_ICNT_RC_FAIL, WB_CACHE_RSRV_FAIL, N_MEM_STAGE_STALL_TYPE ## 相關論文 ### irregular benchmark * Graph-Waving architecture: Efficient execution of graph applications on GPUs * [link](https://www.sciencedirect.com/science/article/pii/S0743731520303889) * node to warp mapping, eliminate duplicate execution, shrink the warp size. * Fine-Grained Management of Thread Blocks for Irregular Applications * [link](https://ieeexplore.ieee.org/abstract/document/8988776?casa_token=mPwWuVZ8uekAAAAA:g3-72tUqQEa0RrzLppSWgs1erUiF5kwMhDruh5wyMgWfSJos8Q9pVCEiALDTIbiXQi0t7yPwGj4) 2019 * irregular application(graph)的狀態會隨著時間而改變(因為少數node關聯度很高) * 用前1個kernel的TB去預測下一個kernel的TB的狀態 * 及時追蹤每一個TB的現在的狀態 * 狀態包括Load-Store unit stall，Floating-Point Unit的使用度 * 根據這些狀態去調整SM中TB的數目，允許preemption * gain 17% and 13% improvements of throughput over previously proposed static and dynamic scheduling strategies, respectively * Dynamic Thread Block Launch: A Lightweight Execution Mechanism toSupport Irregular Applications on GPUs * [link](https://dl.acm.org/doi/epdf/10.1145/2872887.2750393) 2015 * 為了加速irregular application，這篇論文讓thread可以產生child thread block * 譬如：有些graph的每個thread會用for loop去traverse某個node的neighbor，每個node的neighbor不同時，會很分散。若是create一個child kernel或child thread block去concurrent執行所有的neighbor會很有用 * 1.21x speedup over the original ﬂat implementation and average 1.40x over the implementations using device-kernel launch functionality * A quantitative study of irregular programs on GPUs * [link](https://ieeexplore.ieee.org/abstract/document/6402918) * Pannotia: Understanding irregular GPGPU graph applications * [link](https://ieeexplore.ieee.org/abstract/document/6704684) * Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription * [link](https://ieeexplore.ieee.org/document/9139797) * working sets of irregular applications can often be divided into cold and hot data structures. * cold allocations are accessed seldom and sparsely * hot data structures have dense sequential access * propose a dynamic page placement strategy for irregular general purpose applications differentiate between hot and cold allocations * performance improvement of 22% to 78% * Optimizing non-coalesced memory access for irregular applications with GPU computing * [link](https://link.springer.com/content/pdf/10.1631/FITEE.1900262.pdf) 2020 * software solution with no hardware extensions to solve dynamic irregularities * data reordering * eliminate indirect memory access by copying to a new location and reordering * Index redirection * reduce overhead * for loop拿到kernel外面，run multiple streams of kernels(1 for count, others for remap) * For loop-carried applications, reordered data are cached for the next iteration or other kernels to eliminate possible redundant data reorganization * 7.0%–7.17% overall performance improvement * Managing DRAM Latency Divergence in Irregular GPGPU Applications * [link](https://www.cs.utah.edu/~rajeev/pubs/sc14.pdf) 2014 * different requests from a warp are serviced together, or in quick succession, without interference from other warps * warp-aware memory controllers * boost performance by 10.1% * Graph Processing on GPUs: A Survey * [link](https://www.dcs.warwick.ac.uk/~liganghe/papers/ACM-Computing-Surveys-2017.pdf) 2018 * graph processing * data layout * memory access pattern * workload mapping * GPU programming * new research challenges and opportunities * Graph processing on hybrid systems * Graph processing on new GPU architecture * Dynamic graph processing on GPUs * the graph structure can be updated frequently in runtime * dynamically maintain balanced workload mapping with the rapid changes of graphs * Machine-learning applications * Anatomy of GPU Memory System for Multi-Application Execution * [link](https://www.cs.utexas.edu/users/skeckler/pubs/MEMSYS_2015_Anatomy.pdf) 2015 * study the memory system of GPUs in a concurrently executing **multi-application** environment * **L2-MPKI** and **attained bandwidth** are the two knobs that can be used to drive memory scheduling decisions for achieving better performance * Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads * [link](https://arxiv.org/pdf/2007.07131.pdf) * 1.33x and 13% improvement in performance and energy savings respectively * On-the-Fly Elimination of Dynamic Irregularities for GPU Computing * [link](https://dl.acm.org/doi/pdf/10.1145/1950365.1950408) * On-GPU Thread-Data Remapping for Branch Divergence Reduction * [link](https://dl.acm.org/doi/10.1145/3242089) * Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations * [link](https://ieeexplore.ieee.org/abstract/document/7349653) * use compiler to transform the original code to the template * Spare register aware prefetching for graph algorithms on GPUs * [link](https://ieeexplore.ieee.org/abstract/document/6835970) * use spare register to store the prefetch edge ### memory access divergence * DaCache: Memory Divergence-Aware GPU Cache Management ### pre-launch (inter-kernel) * BlockMaestro: Enabling Programmer-Transparent Task-based Execution in GPU Systems * [link](https://ieeexplore.ieee.org/document/9499869) * 用bipartite graph找出有dependency(locality)的TBs，並讓後面的TB pre-launch以減少launch time * 用編譯器就能解決，所以不用改code * **input-dependent** is not working ### inter-kernel locality * Inter-kernel reuse-aware thread block scheduling * group TBs * 下一個kernel的group與上一個kernel對齊 * reverse * work stealing，一次一個 ### intra-kernel locality * PAVER: Locality Graph-Based Thread Block Scheduling for GPUs * [link](https://dl.acm.org/doi/fullHtml/10.1145/3451164) 2021 * 用graph theory將TBs的locality都找出來 * 用graph partition將graph切成groups，每個group分給一個SM * work stealing，偷不只一個以維持locality * **input-dependent** is not working * [note](https://hackmd.io/q-vSikPLSdmNqHPz1GwQwA?view) * performance speedup by 29%, 49.1%, and 41.2% * Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling * [link](https://ieeexplore.ieee.org/abstract/document/7898501) * run time時才計算出與目前SM中的TBs最適合的TB作為候選人 * **input-dependent** is not working * Improving GPGPU resource utilization through alternative thread block scheduling * [link](https://ieeexplore.ieee.org/abstract/document/6835937?casa_token=jQrF9KO5vLMAAAAA:1_H9ANBV8mdo4iuC_Nz5bmcPPx4aKsjx4l15Y_1WnhdE8RBbm2hvXO6kdiSmRKZfK2NCJIoG_AU) * workload type * TB increase, performance increase * TB increase, performance increase then saturate * TB increase, performance decrease * TB increase, performance increase then decrease * core activity * ACTIVE * INACTIVE * Idle * Mem_stall * Core_stall * * SM裡的TBs數目會影響performance，最大未必會最好 * <img src="https://i.imgur.com/udazh1W.png" width="40%" height="40%" /> * consecutive CTAs are assigned to the same cores ### analyze data locality * LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs ### warp scheduler * A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization ### multiple streams scheduler * Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications * Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels * Increasing GPU throughput using kernel interleaved thread block scheduling ### graph * Practical Optimal Cache Replacement for Graphs * 用transpose(CSC<->CSR)快速找到cache中下一次被存取時間最久的cache line * 適用traverse graph的應用(in/out) ## Time: 1/21/2022 ~ 1/26/2022 * goal * sssp_spmv, parboil: spmv, * node to warp mapping * data driven ## Time: 1/12/2022 ~ 1/20/2022 * progress * pagerank 5 graph testing * pagerank spmv * color max min ## Time: 10/21/2021 ~ 10/28/2021 * child thread block * change program? * warp formation * can do? * TB formation ## Time: 10/18/2021 ~ 10/25/2021 * 了解CSB * pagerank 問題 * 壓縮sparse matrix * cuSha ## Time: 10/14/2021 ~ 10/21/2021 * ~~color攤開第三層~~ * sssp跑NY * color跑大圖G3 * 改其他benchmark * ~~pagerank結果寫到spread sheet上~~ * pagerank問題 * atomicAdd ? * ITER拿掉 * 每個kernel看perfect memory / cache.nonflush * momory bound問題？ * outcome * color-max * 有快一點 * memory access減少，但miss rate上升 * 攤開node value會更慢，memory access會再會再減少，但miss ra會再te * graph很balance * page-rank * 慢很多 * memory access減少，但miss rate上升 * graph很不balance ## Issue * race condition (sssp : min) * possible solution : atomic or synchronization * 兩層 for loop 不能攤開 ## Time: 10/7/2021 ~ 10/14/2021 * 每個kernel印出remap後的cycle * remap的跑在GPU上看看 (debug比較快） * 解決branch divergence(sort 每個vertex的edge數） * gpgpu-sim_simulations/benchmarks/src/cuda/rodinia/3.1/cuda/bfs/ ## Time: 9/24/2021 ~ 9/30/2021 看memory coaleasing有沒有出問題 memory bound是bound在哪邊參考data reorder(no-coaleased) ## Time: 9/6/2021 - 9/12/2021 林孟賢：加上ideal cache的miss rate, 個別SM的cycle數(為了判斷 load balance), 每個kernel的miss rate 席致寧：group裡加上mst, 跑benchmark ## Time: 8/23/2021 - 9/5/2021 林孟賢：實做paver 席致寧：跑bechmark ## Time: 8/6/2021 - 8/12/2021 * embeddingbag 修改好embeddingbag cache access patern of embedding with 2 keys ``` -3221225472 57 => input(indices) -3221241536 1 -3221241472 7909 => weightFeat(weight) -3221241408 1 -3221241344 7909 => weightFeat(weight) -3221241152 1 -3221241088 7445 => weightFeat(weight) -3221241280 1 -3221241216 7445 => weightFeat(weight) -3221225536 1 -3221225600 57 => input(indices) -3221225664 1 -3221225728 57 -3221225792 1 -3221225856 57 -3221225920 1 -3221225984 57 ``` * 結論 * threads random access(跟input有關) weightFeat, 且locality會隨著dictionary增大而消失 * MIS * 印addr access number * shared/not shared histogram * 觀察 * [spread sheet](https://docs.google.com/spreadsheets/d/12SWUiwHGojaicbE4O3QcscHBK07e-__A8j6FICQJRYQ/edit#gid=1340093626 ) * 在shared和not shared的locality都很高的情況下，我們的cache分段方法沒有優勢。 * 如果跑在多個SM上能保留shared而讓，not shared的locality消失，那我們的方法或許會奏效。但我們的shared是根據inter kernel的locality去找的，在intra kernel是否有效還很難說 * 如果我們的方法(PCG)可以自由的hit, pending hit, sector miss, 那在4/6分之下可以達到和LRU差不多的結果 * Terminology * ideal: 只考慮某個cache line被**存取**幾次 * normal: 只考慮某個cache line被hit（包含第一次miss）幾次，不考慮pending hit和sector miss等 * shared: 可能表示**我們標的pc**或**高access次數的pc**所access到的cache line * not shared: 與shared相對 * both: shared與not shared都有access到的cache line * Outcome * ideal與LRU的比較 * kernel 2 * ideal, miss rate 0.243 <img src="https://i.imgur.com/l9ygFRx.png" width="50%" height="50%" /> * LRU, miss rate 0.287 <img src="https://i.imgur.com/lkn0XIW.png" width="50%" height="50%" /> * 結論 * LRU已經很接近ideal(total miss rate 0.54/0.48) * 相比於shared, not shared的cache line反倒是沒有很好的keep住 * bypass l1 cache * 比較慢 * high frequency ps has high priority to stay in cache(with LRU) * 比較慢 * ideal算法 * block addr要改成addr * sector miss要考慮進去 * miss rate算法 * tag_array算出來的是read的miss rate(剛好是我們要的) * 個別MISS或SCTOR_MISS與gpgpu-sim的有落差，但總和一樣(所以在我們的case下沒差) ## Time: 7/30/2021 - 8/5/2021 * cache切成兩部份 * shared(80%) * not shared(20%) * replacement policy * shared * LIFO * LRU * not shared * LRU * compared with ideal * cta scheduler reverse --- * cache line加上counter, 紀錄hit次數 * set l1 to tag_array * print cache access * s1 * 1 cta * 1 warp * sector miss要考慮 * future work * print cta access status ## Time: 7/22/2021 - 7/29/2021 * 將call/assign by reference/value實做出來 * 在vec-add上實做，結果符合預期 * ![](https://i.imgur.com/6h0zdO4.png) * ![](https://i.imgur.com/uN6IPIJ.png) ## Time: 7/19/2021 - 7/26/2021 * 7/16 * c的基本語法ok * bear沒有試成功，所以目前無法編譯需要make的cuda code * 7/19 * 開始做data member的部分 * 7/20 * data member的部分大致可以 * 發現member function有問題 * 7/21 * 修改data member的data structure以符合member function的操作 * 開始做member function * 7/22 * member function基本的case可以 * 發現member function在找data member時，若遇上inheritance會有問題 * 找到inherited的class * 待辦 * **declare** to **definition** (getDefinition) * 從全部視為pass by reference改成pass by reference/value * alias(變數的傳遞) * 考慮control flow(if, else等) ## Time: 7/12/2021 - 7/19/2021 ### 1. Successes Last Week  * 席致寧 * 7/9 ~ 7/16 * [[筆記] build and run gpgpu-sim](https://hackmd.io/WsOKV8MmTfS1-Y5gyrS2UQ?view) * 7/16 ~ 7/23 * c_Kernel從constant memory改成global memory * for loop 改成用 multiple stream * [[筆記] streams & memory](https://hackmd.io/WTA8Vz_8Slyveitd7JDYMw?view) * 7/23 ~ 7/30 * [BuildMI()](https://llvm.org/docs/CodeGenerator.html#the-machineinstr-class) : extend machine instruction * [[筆記]memory](https://hackmd.io/MFf9C4AHRU2z6oy6X9DRXg) * [[筆記]LLVM](https://hackmd.io/HKv_OZvXSmWvkLXHTu6BaA) * 7/30 ~ 8/6 * [NVPTXInstrInfo.td](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td) * 8/6 ~ 8/13 * compiling LLVM * 8/13 ~ 8/20 * [P-OPT: Practical Optimal Cache Replacement for Graph Analytics Architectures](https://hackmd.io/iCDyMqTNQ8a1DzwtSSV4Jg) * [Internal of LLVM](https://hackmd.io/bwlGi-ntTxCF1pTvB0N-YA) ### 2. Progress and Problems Last Week  * 跟**李睦樂**討論之後覺得理論上做的出來，但實做細節與clang API是否支援等現實層面的問題還在摸索中 * 限制條件為若有些kernel的parameter是runtime時才決定的 * 譬如：C = A+B的C與A在runtime時才得知是指到同一個obj，那我們在compile時就不會知道A的值是非read-only的 * 目前的想法為每一個變數與obj設置counter，當寫入時counter+1，最後看哪些變數符合條件 <font color= #DCDCDC> * 為每一個**變數**配置一個資料結構用來紀錄**write次數**, **類型(normal, array, pointer, reference)**, **指標(指到其他人的資料結構)** * 最後根據每一個變數**自己的資料結構**和**指到的資料結構**去判斷是否為constant * 超過2就非constant * 需要trace **call by value**, **call by reference**等實做方法(在找出哪些是我們要的變數上遇到了問題，因為 **=左右** 都可能存在不只一個變數) * 判斷 **= 右邊**的變數的level情況 * *: level-1 * &: level+1 * 二元運算子: 留下小的 * 用post order traversal的方法搭配stack找出level小的與 **= 左邊**的變數關聯 * 搭配children()用法 * http://web.mit.edu/freebsd/head/contrib/llvm/tools/clang/lib/AST/StmtPrinter.cpp </font> * write 指令包括 assign, compound assign, inc/dec * 被 write 的 variable 為 **=左邊** 的第一個variable(可能會有問題)  * 在sp和mst, **bear** 和 **compiledb** 無法取得compile_commands.json ### 3. Not Resolved Problems  * 在找出哪些是我們要的變數上遇到了問題，因為 **=左右** 都可能存在不只一個變數 ### 4. Goals for next week * 席致寧 * shaer.cc (/src/gpgpu-sim) * issue_warp * 在TB裡面建立一個array,蒐集ld.global的address,找不同TB裡的交集 * warp_inst_t (/src/abstract_hardware_model.cc) * class warp_inst_t() * 加array來存address * memory_space_t (/src/abstract_hardware_model.cc) * is_global() * 林孟賢 * assign一律考慮為pass by reference，並run在不同cuda benchmark裡，確保cuda code可以run成功，並找出可能的問題 * vecadd: ok * ~/test/llvm-project-orig/llvm-project/build/bin/get-constant-variables test_case/vector-add.cu -- clang++ --cuda-gpu-arch=sm_75 -L /usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread * ![](https://i.imgur.com/0vceiiJ.png) * pathfinder: * ~/test/llvm-project-orig/llvm-project/build/bin/get-constant-variables pathfinder.cu -- clang++ --cuda-gpu-arch=sm_75 -L /usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread * ![](https://i.imgur.com/h0Owq9L.png)