nycu-cas lab
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Help
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Persistent Caching ###### tags: `GPU` ###### members: @林孟賢 @席致寧 [paper](https://www.overleaf.com/3161447213kchpscspbmxk) [proposal](https://docs.google.com/document/d/1LPxwv7mzae4n1qHnJnDLL_UtjeJf0ngaChi5H_2DfHM/edit?usp=sharing) [report](https://drive.google.com/file/d/152gvtqASZaRv1jkt5Guig8FHHtJiFg0R/view?usp=sharing) [spreadsheet](https://docs.google.com/spreadsheets/d/197LhQm2fDLeWsN5sECN_cid8bA1nde6RrRPaRqrWZp8/edit#gid=0) [slide](https://docs.google.com/presentation/d/1Thz_tu13iYqtAW99xmj1XgWEpikkIbobYvLYku2fkGw/edit?usp=sharing) [oral test](https://docs.google.com/presentation/d/12lGzoFUw9xnw112_kMmgaDO5RW0ygAmvc0cVbPRxzM4/edit?usp=sharing) [reference ppaer](https://hackmd.io/867BaaD-QIe1ytWs7FXYpQ) [DAC paper bullet point](https://hackmd.io/ES09vnUQT-KgcvPX5h2fVw?both) ## Idea ![](https://i.imgur.com/vIoz63Q.png) ![](https://i.imgur.com/oMXxn4J.png) <!-- ![](https://i.imgur.com/CycLrc0.png)--> ![](https://i.imgur.com/kcoBzo5.png) ![](https://i.imgur.com/bxPxGry.png) * spare register([reference paper](https://ieeexplore.ieee.org/abstract/document/6835970)) * Due to resource constraints there may be unallocated registers in a GPU. * If a thread warp has terminated, then registers allocated to the warp are free. * register file access * ![](https://i.imgur.com/9QIySp9.png) * new version of gpgpu-sim contain **Operand collector** which contain a cross-bar, and can access all register file * ![](https://i.imgur.com/pdAiXtc.png) * compiler assist * programming interface like openMP * need user to specify the independent for loop * need compiler to calculate the usage of register numbers in each iteration. * **到達定植分析**可以找出**循環不便變量** * runtime * in runtime we need a table to record the index of the for loop in each thread ### Compare to Software Solution additional space required by software solution : * `length[]` * 紀錄一個warp裡面每條thread所需要的iteration * additional space : O(V) * `start[]` * 紀錄一個warp裡面每條thread的起始位置 * additional space : O(V) * `origin_thread[]` * 紀錄每條邊來自於那一條thread * additional space : O(E) additional space required by hardware solution : * `index_table` * 紀錄一個warp裡面每條thread的start 跟 end * additional space : O(V) 1. hardware solution是rum time生成table, 當某條warp完成運算後即可release掉空間,所以實際上花的空間不會那麼多 2. software solution 會有 preprocessing kernel overhead 3. software solution要先將data(start, length, ...)存在global memory,hardware solution則是extend hardware component來存run time生成的table 4. software solution的memory access次數會變多 ## Related paper list * hardware * Graph-Waving architecture: Efficient execution of graph applications on GPUs * map each node to a warp instead of thread * need to change the user’s code * need to eliminate duplicate scaler instruction * Dynamic Thread Block Launch: A Lightweight Execution Mechanism toSupport Irregular Applications on GPUs * use dynamic launch “thread block” to replace dynamic launch “kernel”, thus alleviate the overhead of the “kernel” and increase parallelism * not fully utilized the SIMD lane when degree of the node is small * GraphPEG: Accelerating Graph Processing on GPUs * use “hardware function” to collect needed edges to the “shared memory”, then process on the collected edge * solve all the aforemention problem, but add many additional hardwares * Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists * Active Thread Compaction for GPU Path Tracing * software * gunrock * Accelerating CUDA Graph Algorithms at Maximum Warp * [paper](https://stanford-ppl.github.io/website/papers/ppopp070a-hong.pdf) * Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths * [paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6877269) * Scalable GPU Graph Traversal * [paper](https://mgarland.org/files/papers/gpubfs.pdf) * benchmark * Pannotia: Understanding Irregular GPGPU Graph Applications * [paper](https://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf) ## problem of equivalence software solution * atomic * ## solve active/inactive ![](https://i.imgur.com/bRnCHU5.gif) ## graph [pannotia](https://github.com/pannotia/pannotia/wiki) [datasets](https://www.cc.gatech.edu/dimacs10/archive/coauthor.shtml) ## PTX ### SPMV * original code: ``` cvta.to.global.u64 %rd2, %rd13; cvta.to.global.u64 %rd17, %rd12; cvta.to.global.u64 %rd18, %rd10; mul.wide.s32 %rd19, %r15, 4; #move add.s64 %rd26, %rd17, %rd19; #move add.s64 %rd25, %rd18, %rd19; #move BB0_3: .pragma "nounroll"; ld.global.u32 %r14, [%rd26]; # PC=0x100 mul.wide.s32 %rd20, %r14, 4; add.s64 %rd21, %rd2, %rd20; ld.global.f32 %f4, [%rd21]; ld.global.f32 %f5, [%rd25]; mul.f32 %f6, %f5, %f4; atom.shared.add.f32 %f7, [%r3], %f6; add.s64 %rd26, %rd26, 4; add.s64 %rd25, %rd25, 4; add.s32 %r15, %r15, 1; setp.lt.s32 %p3, %r15, %r5; @%p3 bra BB0_3; # PC=0x158 ``` * modified code ``` cvta.to.global.u64 %rd2, %rd13; cvta.to.global.u64 %rd17, %rd12; cvta.to.global.u64 %rd18, %rd10; # PC=0x0e0 add.s32 %r1, %r5, %r15; # PC=0x0e8 BB0_3: .pragma "nounroll"; mul.wide.s32 %rd19, %r15, 4; # PC=0x0f0 add.s64 %rd26, %rd17, %rd19; add.s64 %rd25, %rd18, %rd19; ld.global.u32 %r14, [%rd26]; mul.wide.s32 %rd20, %r14, 4; add.s64 %rd21, %rd2, %rd20; ld.global.f32 %f4, [%rd21]; ld.global.f32 %f5, [%rd25]; mul.f32 %f6, %f5, %f4; atom.shared.add.f32 %f7, [%r3], %f6; @%p3 bra BB0_3; # PC=0x140 ``` ### Pagerank * original code : ```c= #pragma unroll 1 for (int edge = start; edge < end; edge++) { int nid = col[edge]; atomicAdd(&page_rank2[nid], page_rank1[tid] / (float)(end - start)); } ``` ```= mul.wide.s32 %rd16, %r16, 4; add.s64 %rd19, %rd14, %rd16; //0x0e8 create table BB0_5: .pragma "nounroll"; ld.global.u32 %r14, [%rd19]; //0x0f0 col[edge] mul.wide.s32 %rd17, %r14, 4; add.s64 %rd18, %rd2, %rd17; ld.global.f32 %f2, [%rd3]; div.rn.f32 %f3, %f2, %f1; atom.global.add.f32 %f4, [%rd18], %f3; add.s64 %rd19, %rd19, 4; //0x120 edge++ add.s32 %r16, %r16, 1; //0x128 loop counter setp.lt.s32 %p4, %r16, %r15; //0x130 loop counter < iteration @%p4 bra BB0_5; //0x138 branch ``` * instruction找到0x0e8就換方式執行(還沒建表格) : ```= sub.s32 %r13, %r15, %r16; //PC = 0x0d0 cvt.rn.f32.s32%f1, %r13; //PC = 0x0d8 BB0_5: .pragma "nounroll"; mul.wide.s32 %rd16, %r16, 4; //PC = 0x0e0 r16 : index,從table拿 add.s64 %rd19, %rd14, %rd16; //PC = 0x0e8 rd14 : base address ld.global.u32 %r14, [%rd19]; mul.wide.s32 %rd17, %r14, 4; add.s64 %rd18, %rd2, %rd17; ld.global.f32 %f2, [%rd3]; div.rn.f32 %f3, %f2, %f1; atom.global.add.f32 %f4, [%rd18], %f3; //PC = 0x118 add.s32 %r16, %r16, 1; setp.lt.s32 %p4, %r16, %r15; @%p4 bra BB0_5; //PC = 0x130 ``` * %r16 : start, %r15 : end, 建表格 * 根據表格抓rd19的值 * gpu-sim要得到的資訊 : * start index * end index * last codeblock PC (第11行) * exit PC (第15行) ```= //PC = 0x0c8 sub.s32 %r13, %r15, %r16; //PC = 0x0d0 cvt.rn.f32.s32%f1, %r13; //PC = 0x0d8 create table add.s32 %r1, %r15, %r16; //PC = 0x0e0 set index BB0_5: .pragma "nounroll"; mul.wide.s32 %rd16, %r16, 4; //PC = 0x0e8 r16 : index, 從table拿 add.s64 %rd19, %rd14, %rd16; //PC = 0x0f0 rd14 : base address ld.global.u32 %r14, [%rd19]; //PC = 0x0f8 rd19 redirect??? mul.wide.s32 %rd17, %r14, 4; //PC = 0x100 add.s64 %rd18, %rd2, %rd17; //PC = 0x108 ld.global.f32 %f2, [%rd3]; //PC = 0x110 rd3 redirect??? div.rn.f32 %f3, %f2, %f1; //PC = 0x118 atom.global.add.f32 %f4, [%rd18], %f3; //PC = 0x120 @%p4 bra BB0_5; //PC = 0x128 //PC = 0x130 ``` read reg: write reg: rd16, rd19, r14, rd17, rd18, f2, f3, f4 ### Pagerank_spmv ``` cvta.to.global.u64 %rd2, %rd12; cvta.to.global.u64 %rd17, %rd10; cvta.to.global.u64 %rd18, %rd11; mul.wide.s32 %rd19, %r15, 4; #move add.s64 %rd26, %rd17, %rd19; #move add.s64 %rd25, %rd18, %rd19; #move add.s32 %r1, %r4, %r15; #add BB2_3: .pragma "nounroll"; ld.global.u32 %r14, [%rd26]; #PC=0x4e0 mul.wide.s32 %rd20, %r14, 4; #PC=0x4e8 add.s64 %rd21, %rd2, %rd20; #PC=0x4f0 ld.global.f32 %f4, [%rd21]; #PC=0x4f8 ld.global.f32 %f5, [%rd25]; #PC=0x500 mul.f32 %f6, %f5, %f4; #PC=0x508 atom.shared.add.f32 %f7, [%r5], %f6; #PC=0x510 add.s64 %rd26, %rd26, 4; #PC=0x518 delete add.s64 %rd25, %rd25, 4; #PC=0x520 delete add.s32 %r15, %r15, 1; #PC=0x528 delete setp.lt.s32 %p3, %r15, %r4; #PC=0x530 delete @%p3 bra BB2_3; #PC=0x538 ld.shared.f32 %f10, [%r5]; BB2_5: cvta.to.global.u64 %rd22, %rd13; shl.b64 %rd23, %rd1, 2; add.s64 %rd24, %rd22, %rd23; ld.global.f32 %f8, [%rd24]; add.f32 %f9, %f10, %f8; st.global.f32 [%rd24], %f9; BB2_6: ret; ``` #### parallel ``` cvta.to.global.u64 %rd2, %rd12; cvta.to.global.u64 %rd17, %rd10; cvta.to.global.u64 %rd18, %rd11; #PC=0x4c0 add.s32 %r1, %r4, %r15; #PC=0x4c8 add BB2_3: .pragma "nounroll"; mul.wide.s32 %rd19, %r15, 4; #PC=0x4d0 move add.s64 %rd26, %rd17, %rd19; #PC=0x4d8 move add.s64 %rd25, %rd18, %rd19; #PC=0x4e0 move ld.global.u32 %r14, [%rd26]; #PC=0x4e8 mul.wide.s32 %rd20, %r14, 4; #PC=0x4f0 add.s64 %rd21, %rd2, %rd20; #PC=0x4f8 ld.global.f32 %f4, [%rd21]; #PC=0x500 ld.global.f32 %f5, [%rd25]; #PC=0x508 mul.f32 %f6, %f5, %f4; #PC=0x510 atom.shared.add.f32 %f7, [%r5], %f6; #PC=0x518 @%p3 bra BB2_3; #PC=0x520 ld.shared.f32 %f10, [%r5]; BB2_5: cvta.to.global.u64 %rd22, %rd13; shl.b64 %rd23, %rd1, 2; add.s64 %rd24, %rd22, %rd23; ld.global.f32 %f8, [%rd24]; add.f32 %f9, %f10, %f8; st.global.f32 [%rd24], %f9; BB2_6: ret; ``` ### BC ``` //PC=0x0d0 BB0_4: setp.ge.s32 %p4, %r22, %r21; //PC=0x0d8 @%p4 bra BB0_11; //PC=0x0e0 cvta.to.global.u64 %rd18, %rd11; //PC=0x0e8 cvta.to.global.u64 %rd19, %rd9; //PC=0x0f0 add.s32 %r5, %r13, 1; //PC=0x0f8 add.s64 %rd2, %rd18, %rd14; //PC=0x100 cvta.to.global.u64 %rd4, %rd12; //PC=0x108 create table add.s32 %r1, %r21, %r22; //PC=0x110 extend set index BB0_6: .pragma "nounroll"; mul.wide.s32 %rd21, %r22, 4; //PC=0x118 move, set register add.s64 %rd27, %rd19, %rd21; //PC=0x120 move ld.global.u32 %r7, [%rd27]; //PC=0x128 mul.wide.s32 %rd23, %r7, 4; //PC=0x130 add.s64 %rd6, %rd13, %rd23; //PC=0x138 ld.global.u32 %r23, [%rd6]; //PC=0x140 setp.gt.s32 %p5, %r23, -1; //PC=0x148 @%p5 bra BB0_8; //PC=0x150 mov.u32 %r19, 1; //PC=0x158 st.global.u32 [%rd4], %r19; //PC=0x160 *cont = 1; add.s32 %r20, %r13, 1; //PC=0x168 st.global.u32 [%rd6], %r20; //PC=0x170 d[w] = dist + 1; mov.u32 %r23, %r5; //PC=0x178 BB0_8: setp.ne.s32 %p6, %r23, %r5; //PC=0x180 break point r23: d[w], r5: @%p6 bra BB0_10; //PC=0x188 add.s64 %rd26, %rd18, %rd23; //PC=0x190 ld.global.f32 %f1, [%rd2]; //PC=0x198 atom.global.add.f32 %f2, [%rd26], %f1; //PC=0x1a0 BB0_10: add.s32 %r22, %r22, 1; //delete add.s64 %rd27, %rd27, 4; //delete setp.lt.s32 %p7, %r22, %r21; //delete @%p7 bra BB0_6; //PC=0x1a8 finish or not / destroy table BB0_11: ret; //PC=0x1b0 ``` ### BC_WL #### original ``` cvta.to.global.u64 %rd19, %rd12; cvta.to.global.u64 %rd20, %rd10; add.s32 %r6, %r14, 1; add.s64 %rd2, %rd19, %rd18; #PC=0x100 mul.wide.s32 %rd22, %r23, 4; #PC=0x108 move add.s64 %rd28, %rd20, %rd22; #PC=0x110 move cvta.to.global.u64 %rd4, %rd13; #PC=0x118 cvta.to.global.u64 %rd23, %rd11; #PC=0x120 add.s32 %r1, %r22, %r23; #PC=0x128 add BB0_5: .pragma "nounroll"; ld.global.u32 %r8, [%rd28]; #PC=0x130 mul.wide.s32 %rd24, %r8, 4; add.s64 %rd6, %rd23, %rd24; ld.global.u32 %r24, [%rd6]; setp.gt.s32 %p4, %r24, -1; @%p4 bra BB0_7; mov.u32 %r20, 1; st.global.u32 [%rd4], %r20; add.s32 %r21, %r14, 1; st.global.u32 [%rd6], %r21; mov.u32 %r24, %r6; BB0_7: setp.ne.s32 %p5, %r24, %r6; @%p5 bra BB0_9; add.s64 %rd27, %rd19, %rd24; ld.global.f32 %f1, [%rd2]; atom.global.add.f32 %f2, [%rd27], %f1; BB0_9: add.s32 %r23, %r23, 1; #remove add.s64 %rd28, %rd28, 4; #remove setp.lt.s32 %p6, %r23, %r22; #remove @%p6 bra BB0_5; ``` #### modified ``` cvta.to.global.u64 %rd19, %rd12; cvta.to.global.u64 %rd20, %rd10; add.s32 %r6, %r14, 1; #PC=0x100 add.s64 %rd2, %rd19, %rd18; #PC=0x108 cvta.to.global.u64 %rd4, %rd13; #PC=0x110 cvta.to.global.u64 %rd23, %rd11; #PC=0x118 add.s32 %r1, %r22, %r23; #PC=0x120 BB0_5: .pragma "nounroll"; mul.wide.s32 %rd22, %r23, 4; #PC=0x128 add.s64 %rd28, %rd20, %rd22; ld.global.u32 %r8, [%rd28]; mul.wide.s32 %rd24, %r8, 4; add.s64 %rd6, %rd23, %rd24; ld.global.u32 %r24, [%rd6]; setp.gt.s32 %p4, %r24, -1; @%p4 bra BB0_7; mov.u32 %r20, 1; st.global.u32 [%rd4], %r20; add.s32 %r21, %r14, 1; st.global.u32 [%rd6], %r21; mov.u32 %r24, %r6; BB0_7: setp.ne.s32 %p5, %r24, %r6; @%p5 bra BB0_9; add.s64 %rd27, %rd19, %rd24; ld.global.f32 %f1, [%rd2]; atom.global.add.f32 %f2, [%rd27], %f1; BB0_9: @%p6 bra BB0_5; #PC=0x1b8 ``` ### Color #### original original ``` BB0_4: mov.u32 %r33, -1; setp.ge.s32 %p4, %r2, %r30; @%p4 bra BB0_9; add.s32 %r5, %r30, -1; mov.u32 %r33, -1; cvta.to.global.u64 %rd17, %rd7; mul.wide.s32 %rd18, %r2, 4; //remove add.s64 %rd28, %rd17, %rd18; //remove cvta.to.global.u64 %rd3, %rd10; cvta.to.global.u64 %rd22, %rd8; mov.u32 %r31, %r2; //create table add.s32 %r1, %r31, %r30; //add, set index to table BB0_6: .pragma "nounroll"; mul.wide.s32 %rd18, %r31, 4; //move add.s64 %rd28, %rd17, %rd18; //move ld.global.u32 %r25, [%rd28]; //PC=0x130 mul.wide.s32 %rd20, %r25, 4; add.s64 %rd21, %rd12, %rd20; ld.global.u32 %r26, [%rd21]; setp.eq.s32 %p5, %r26, -1; setp.ne.s32 %p6, %r2, %r5; and.pred %p7, %p5, %p6; @!%p7 bra BB0_8; bra.uni BB0_7; BB0_7: mov.u32 %r27, 1; st.global.u32 [%rd3], %r27; ld.global.u32 %r28, [%rd28]; mul.wide.s32 %rd23, %r28, 4; add.s64 %rd24, %rd22, %rd23; ld.global.u32 %r29, [%rd24]; max.s32 %r33, %r29, %r33; BB0_8: add.s64 %rd28, %rd28, 4; //remove add.s32 %r31, %r31, 1; //remove setp.lt.s32 %p8, %r31, %r30; //remove r31 : index, r30 : end @%p8 bra BB0_6; BB0_9: cvta.to.global.u64 %rd25, %rd11; add.s64 %rd27, %rd25, %rd13; st.global.u32 [%rd27], %r33; BB0_10: ret; ``` modified ``` BB0_4: mov.u32 %r33, -1; setp.ge.s32 %p4, %r2, %r30; @%p4 bra BB0_9; add.s32 %r5, %r30, -1; mov.u32 %r33, -1; cvta.to.global.u64 %rd17, %rd7; cvta.to.global.u64 %rd3, %rd10; cvta.to.global.u64 %rd22, %rd8; mov.u32 %r31, %r2; //PC=0x118 create table add.s32 %r1, %r31, %r30; //PC=0x120 add, set index to table BB0_6: .pragma "nounroll"; mul.wide.s32 %rd18, %r31, 4; //PC=0x128 move add.s64 %rd28, %rd17, %rd18; //PC=0x130 move ld.global.u32 %r25, [%rd28]; //PC=0x138 mul.wide.s32 %rd20, %r25, 4; //PC=0x140 add.s64 %rd21, %rd12, %rd20; //PC=0x148 ld.global.u32 %r26, [%rd21]; //PC=0x150 setp.eq.s32 %p5, %r26, -1; //PC=0x158 setp.ne.s32 %p6, %r2, %r5; //PC=0x160 and.pred %p7, %p5, %p6; //PC=0x168 @!%p7 bra BB0_8; //PC=0x170 bra.uni BB0_7; //PC=0x178 BB0_7: mov.u32 %r27, 1; //PC=0x180 st.global.u32 [%rd3], %r27; //PC=0x188 ld.global.u32 %r28, [%rd28]; //PC=0x190 mul.wide.s32 %rd23, %r28, 4; //PC=0x198 add.s64 %rd24, %rd22, %rd23; //PC=0x1a0 ld.global.u32 %r29, [%rd24]; //PC=0x1a8 max.s32 %r33, %r29, %r33; //PC=0x1b0 BB0_8: @%p8 bra BB0_6; //PC=0x1b8 BB0_9: cvta.to.global.u64 %rd25, %rd11; add.s64 %rd27, %rd25, %rd13; st.global.u32 [%rd27], %r33; BB0_10: ret; ``` #### parallel original ``` add.s32 %r6, %r32, -1; cvta.to.global.u64 %rd18, %rd8; mul.wide.s32 %rd19, %r2, 4; //moved add.s64 %rd29, %rd18, %rd19; //moved cvta.to.global.u64 %rd4, %rd11; cvta.to.global.u64 %rd23, %rd9; mov.u32 %r33, %r2; // add.s32 %r1, %r32, %r33; //PC=, add BB0_6: .pragma "nounroll"; mul.wide.s32 %rd19, %r33, 4; //PC=, move add.s64 %rd29, %rd18, %rd19; //PC=, move ld.global.u32 %r22, [%rd29]; //PC=0x158 mul.wide.s32 %rd21, %r22, 4; //60 add.s64 %rd22, %rd13, %rd21; //68 ld.global.u32 %r23, [%rd22]; //70 setp.eq.s32 %p5, %r23, -1; //78 setp.ne.s32 %p6, %r2, %r6; //80 and.pred %p7, %p5, %p6; //88 @!%p7 bra BB0_8; //90 bra.uni BB0_7; //98 BB0_7: mov.u32 %r24, 1; //a0 st.global.u32 [%rd4], %r24; //a8 ld.global.u32 %r25, [%rd29]; //b0 mul.wide.s32 %rd24, %r25, 4; //b8 add.s64 %rd25, %rd23, %rd24; //c0 ld.global.u32 %r26, [%rd25]; //PC=0x1c8 atom.shared.max.s32 %r31, [%r5], %r26;//PC=0x1d0 BB0_8: add.s32 %r33, %r33, 1; //PC=0x1d8, remove add.s64 %rd29, %rd29, 4; //PC=0x1e0, remove setp.lt.s32 %p8, %r33, %r32; //PC=0x1e8, remove @%p8 bra BB0_6; //PC=0x1f0 ld.shared.u32 %r34, [%r5]; //PC=0x1f8 BB0_10: cvta.to.global.u64 %rd26, %rd12; add.s64 %rd28, %rd26, %rd17; st.global.u32 [%rd28], %r34; BB0_11: ret; ``` modified ``` add.s32 %r6, %r32, -1; cvta.to.global.u64 %rd18, %rd8; cvta.to.global.u64 %rd4, %rd11; cvta.to.global.u64 %rd23, %rd9; mov.u32 %r33, %r2; //PC=0x140 add.s32 %r1, %r32, %r33; //PC=0x148, add BB0_6: .pragma "nounroll"; mul.wide.s32 %rd19, %r33, 4; //PC=0x150, move add.s64 %rd29, %rd18, %rd19; //PC=0x158, move ld.global.u32 %r22, [%rd29]; //PC=0x160 mul.wide.s32 %rd21, %r22, 4; //PC=0x168 add.s64 %rd22, %rd13, %rd21; //PC=0x170 ld.global.u32 %r23, [%rd22]; //PC=0x178 setp.eq.s32 %p5, %r23, -1; //PC=0x180 setp.ne.s32 %p6, %r2, %r6; //PC=0x188 and.pred %p7, %p5, %p6; //PC=0x190 @!%p7 bra BB0_8; //PC=0x198 bra.uni BB0_7; //PC=0x1a0 BB0_7: mov.u32 %r24, 1; //PC=0x1a8 st.global.u32 [%rd4], %r24; //PC=0x1b0 ld.global.u32 %r25, [%rd29]; //PC=0x1b8 mul.wide.s32 %rd24, %r25, 4; //PC=0x1c0 add.s64 %rd25, %rd23, %rd24; //PC=0x1c8 ld.global.u32 %r26, [%rd25]; //PC=0x1d0 atom.shared.max.s32 %r31, [%r5], %r26;//PC=0x1d8 BB0_8: @%p8 bra BB0_6; //PC=0x1e0 ld.shared.u32 %r34, [%r5]; //PC=0x1e8 BB0_10: cvta.to.global.u64 %rd26, %rd12; add.s64 %rd28, %rd26, %rd17; st.global.u32 [%rd28], %r34; BB0_11: ret; ``` ### Color_wl original ``` add.s32 %r7, %r33, -1; cvta.to.global.u64 %rd18, %rd8; mul.wide.s32 %rd19, %r3, 4; #move add.s64 %rd29, %rd18, %rd19; #move cvta.to.global.u64 %rd3, %rd11; cvta.to.global.u64 %rd20, %rd10; cvta.to.global.u64 %rd23, %rd9; mov.u32 %r34, %r3; add.s32 %r1, %r33, %r34; #add BB0_5: .pragma "nounroll"; ld.global.u32 %r23, [%rd29]; #PC=0x348 mul.wide.s32 %rd21, %r23, 4; add.s64 %rd22, %rd20, %rd21; ld.global.u32 %r24, [%rd22]; setp.eq.s32 %p4, %r24, -1; setp.ne.s32 %p5, %r3, %r7; and.pred %p6, %p4, %p5; @!%p6 bra BB1_7; bra.uni BB1_6; BB1_6: mov.u32 %r25, 1; st.global.u32 [%rd3], %r25; ld.global.u32 %r26, [%rd29]; mul.wide.s32 %rd24, %r26, 4; add.s64 %rd25, %rd23, %rd24; ld.global.u32 %r27, [%rd25]; atom.shared.max.s32 %r32, [%r6], %r27; BB1_7: add.s32 %r34, %r34, 1; #remove add.s64 %rd29, %rd29, 4; #remove setp.lt.s32 %p7, %r34, %r33; #remove @%p7 bra BB1_5; ``` modified ``` add.s32 %r7, %r33, -1; cvta.to.global.u64 %rd18, %rd8; cvta.to.global.u64 %rd3, %rd11; cvta.to.global.u64 %rd20, %rd10; cvta.to.global.u64 %rd23, %rd9; mov.u32 %r34, %r3; #PC=0x140 add.s32 %r1, %r33, %r34; #PC=0x148 BB0_5: .pragma "nounroll"; mul.wide.s32 %rd19, %r34, 4; #PC=0x150 add.s64 %rd29, %rd18, %rd19; ld.global.u32 %r23, [%rd29]; mul.wide.s32 %rd21, %r23, 4; add.s64 %rd22, %rd20, %rd21; ld.global.u32 %r24, [%rd22]; setp.eq.s32 %p4, %r24, -1; setp.ne.s32 %p5, %r3, %r7; and.pred %p6, %p4, %p5; @!%p6 bra BB0_7; bra.uni BB0_6; BB0_6: mov.u32 %r25, 1; st.global.u32 [%rd3], %r25; ld.global.u32 %r26, [%rd29]; mul.wide.s32 %rd24, %r26, 4; add.s64 %rd25, %rd23, %rd24; ld.global.u32 %r27, [%rd25]; atom.shared.max.s32 %r32, [%r6], %r27; BB0_7: @%p7 bra BB0_5; #PC=0x1e0 ``` ### color max min ### origin ``` add.s32 %r5, %r52, -1; cvta.to.global.u64 %rd21, %rd6; mul.wide.s32 %rd22, %r2, 4; #move add.s64 %rd37, %rd21, %rd22; #move cvta.to.global.u64 %rd26, %rd9; cvta.to.global.u64 %rd27, %rd7; mov.u32 %r53, %r2; #PC=0x178 add.s32 %r1, %r52, %r53; #add BB0_6: .pragma "nounroll"; ld.global.u32 %r27, [%rd37]; #PC=0x180 mul.wide.s32 %rd24, %r27, 4; add.s64 %rd25, %rd12, %rd24; ld.global.u32 %r28, [%rd25]; setp.eq.s32 %p5, %r28, -1; setp.ne.s32 %p6, %r2, %r5; and.pred %p7, %p5, %p6; @!%p7 bra BB0_8; bra.uni BB0_7; BB0_7: mov.u32 %r29, 1; st.global.u32 [%rd26], %r29; ld.global.u32 %r30, [%rd37]; mul.wide.s32 %rd28, %r30, 4; add.s64 %rd29, %rd27, %rd28; ld.global.u32 %r31, [%rd29]; atom.shared.max.s32 %r36, [%r24], %r31; ld.global.u32 %r37, [%rd37]; mul.wide.s32 %rd30, %r37, 4; add.s64 %rd31, %rd27, %rd30; ld.global.u32 %r38, [%rd31]; atom.shared.min.s32 %r41, [%r26], %r38; BB0_8: add.s32 %r53, %r53, 1; #delete add.s64 %rd37, %rd37, 4; #delete setp.lt.s32 %p8, %r53, %r52; #delete @%p8 bra BB0_6; #PC=0x240 ld.shared.u32 %r55, [%r24]; ld.shared.u32 %r54, [%r26]; ``` ### parallel ``` add.s32 %r5, %r52, -1; cvta.to.global.u64 %rd21, %rd6; cvta.to.global.u64 %rd26, %rd9; cvta.to.global.u64 %rd27, %rd7; mov.u32 %r53, %r2; #PC=0x168 add.s32 %r1, %r52, %r53; #PC=0x170 add BB0_6: .pragma "nounroll"; mul.wide.s32 %rd22, %r53, 4; #PC=0x178 move add.s64 %rd37, %rd21, %rd22; #PC=0x180 move ld.global.u32 %r27, [%rd37]; #PC=0x mul.wide.s32 %rd24, %r27, 4; add.s64 %rd25, %rd12, %rd24; ld.global.u32 %r28, [%rd25]; setp.eq.s32 %p5, %r28, -1; setp.ne.s32 %p6, %r2, %r5; and.pred %p7, %p5, %p6; @!%p7 bra BB0_8; bra.uni BB0_7; BB0_7: mov.u32 %r29, 1; st.global.u32 [%rd26], %r29; ld.global.u32 %r30, [%rd37]; mul.wide.s32 %rd28, %r30, 4; add.s64 %rd29, %rd27, %rd28; ld.global.u32 %r31, [%rd29]; atom.shared.max.s32 %r36, [%r24], %r31; ld.global.u32 %r37, [%rd37]; mul.wide.s32 %rd30, %r37, 4; add.s64 %rd31, %rd27, %rd30; ld.global.u32 %r38, [%rd31]; atom.shared.min.s32 %r41, [%r26], %r38; BB0_8: @%p8 bra BB0_6; #PC=0x230 ld.shared.u32 %r55, [%r24]; ld.shared.u32 %r54, [%r26]; ``` ### mis #### origin ``` cvta.to.global.u64 %rd4, %rd10; cvta.to.global.u64 %rd19, %rd9; mul.wide.s32 %rd20, %r29, 4; #move add.s64 %rd28, %rd19, %rd20; #move BB1_6: .pragma "nounroll"; ld.global.u32 %r9, [%rd28]; #PC=0x210 mul.wide.s32 %rd21, %r9, 4; add.s64 %rd22, %rd1, %rd21; ld.global.u32 %r25, [%rd22]; setp.ne.s32 %p5, %r25, -1; @%p5 bra BB1_8; add.s64 %rd24, %rd4, %rd21; ld.global.u32 %r26, [%rd24]; atom.shared.min.s32 %r27, [%r6], %r26; BB1_8: add.s32 %r29, %r29, 1; #delete add.s64 %rd28, %rd28, 4; #delete setp.lt.s32 %p6, %r29, %r28; #delete @%p6 bra BB1_6; ld.shared.u32 %r30, [%r6]; BB1_10: cvta.to.global.u64 %rd25, %rd11; add.s64 %rd27, %rd25, %rd18; st.global.u32 [%rd27], %r30; BB1_11: ret; ``` #### modified ``` cvta.to.global.u64 %rd4, %rd10; cvta.to.global.u64 %rd19, %rd9; #PC=0x1f8 add.s32 %r1, %r28, %r29; #PC=0x200 add BB1_6: .pragma "nounroll"; mul.wide.s32 %rd20, %r29, 4; #PC=0x208 move add.s64 %rd28, %rd19, %rd20; #PC=0x210 move ld.global.u32 %r9, [%rd28]; #PC=0x218 mul.wide.s32 %rd21, %r9, 4; #PC=0x220 add.s64 %rd22, %rd1, %rd21; #PC=0x228 ld.global.u32 %r25, [%rd22]; #PC=0x230 setp.ne.s32 %p5, %r25, -1; #PC=0x238 @%p5 bra BB1_8; #PC=0x240 add.s64 %rd24, %rd4, %rd21; #PC=0x248 ld.global.u32 %r26, [%rd24]; #PC=0x250 atom.shared.min.s32 %r27, [%r6], %r26; #PC=0x258 BB1_8: @%p6 bra BB1_6; #PC=0x260 ld.shared.u32 %r30, [%r6]; #PC=0x268 BB1_10: cvta.to.global.u64 %rd25, %rd11; add.s64 %rd27, %rd25, %rd18; st.global.u32 [%rd27], %r30; BB1_11: ret; ``` ### mis_wl #### origin ``` cvta.to.global.u64 %rd2, %rd10; cvta.to.global.u64 %rd3, %rd11; cvta.to.global.u64 %rd20, %rd9; #PC=0x1f8 mul.wide.s32 %rd21, %r30, 4; #PC=0x200 move add.s64 %rd29, %rd20, %rd21; #PC=0x208 move add.s32 %r1, %r29, %r30; #add BB1_5: .pragma "nounroll"; ld.global.u32 %r10, [%rd29]; #PC=0x210 mul.wide.s32 %rd22, %r10, 4; add.s64 %rd23, %rd3, %rd22; ld.global.u32 %r26, [%rd23]; setp.ne.s32 %p4, %r26, -1; @%p4 bra BB1_7; add.s64 %rd25, %rd2, %rd22; ld.global.u32 %r27, [%rd25]; atom.shared.min.s32 %r28, [%r7], %r27; BB1_7: add.s32 %r30, %r30, 1; #remove add.s64 %rd29, %rd29, 4; #remove setp.lt.s32 %p5, %r30, %r29; #remove @%p5 bra BB1_5; ``` #### modified ``` cvta.to.global.u64 %rd2, %rd10; #PC=0x1e8 cvta.to.global.u64 %rd3, %rd11; #PC=0x1f0 cvta.to.global.u64 %rd20, %rd9; #PC=0x1f8 add.s32 %r1, %r29, %r30; #PC=0x200 BB1_5: .pragma "nounroll"; mul.wide.s32 %rd21, %r30, 4; #PC=0x208 add.s64 %rd29, %rd20, %rd21; ld.global.u32 %r10, [%rd29]; mul.wide.s32 %rd22, %r10, 4; add.s64 %rd23, %rd3, %rd22; ld.global.u32 %r26, [%rd23]; setp.ne.s32 %p4, %r26, -1; @%p4 bra BB1_7; add.s64 %rd25, %rd2, %rd22; ld.global.u32 %r27, [%rd25]; atom.shared.min.s32 %r28, [%r7], %r27; BB1_7: @%p5 bra BB1_5; #PC=0x260 ``` ### sssp #### origin ``` cvta.to.global.u64 %rd19, %rd11; cvta.to.global.u64 %rd20, %rd10; mul.wide.s32 %rd21, %r24, 4; #move add.s64 %rd28, %rd20, %rd21; #move add.s64 %rd27, %rd19, %rd21; #move BB0_3: .pragma "nounroll"; ld.global.u32 %r19, [%rd28]; #PC=0x100 mul.wide.s32 %rd22, %r19, 4; add.s64 %rd23, %rd17, %rd22; ld.global.u32 %r20, [%rd23]; ld.global.u32 %r21, [%rd27]; add.s32 %r22, %r20, %r21; atom.shared.min.s32 %r23, [%r6], %r22; add.s64 %rd28, %rd28, 4; #delete add.s64 %rd27, %rd27, 4; #delete add.s32 %r24, %r24, 1; #delete setp.lt.s32 %p3, %r24, %r4; #delete @%p3 bra BB0_3; ld.shared.u32 %r25, [%r6]; BB0_5: cvta.to.global.u64 %rd24, %rd13; shl.b64 %rd25, %rd1, 2; add.s64 %rd26, %rd24, %rd25; st.global.u32 [%rd26], %r25; BB0_6: ret; ``` #### modified ``` cvta.to.global.u64 %rd19, %rd11; cvta.to.global.u64 %rd20, %rd10; #PC=0x0e0 add.s32 %r1, %r4, %r24; #PC=0x0e8 add BB0_3: .pragma "nounroll"; mul.wide.s32 %rd21, %r24, 4; #PC=0x0f0 move add.s64 %rd28, %rd20, %rd21; #PC=0x0f8 move add.s64 %rd27, %rd19, %rd21; #PC=0x100 move ld.global.u32 %r19, [%rd28]; #PC=0x108 mul.wide.s32 %rd22, %r19, 4; #PC=0x110 add.s64 %rd23, %rd17, %rd22; #PC=0x118 ld.global.u32 %r20, [%rd23]; ld.global.u32 %r21, [%rd27]; add.s32 %r22, %r20, %r21; atom.shared.min.s32 %r23, [%r6], %r22; @%p3 bra BB0_3; #PC=0x140 ld.shared.u32 %r25, [%r6]; BB0_5: cvta.to.global.u64 %rd24, %rd13; shl.b64 %rd25, %rd1, 2; add.s64 %rd26, %rd24, %rd25; st.global.u32 [%rd26], %r25; BB0_6: ret; ``` --- ## outcome [spreadsheet](https://docs.google.com/spreadsheets/d/197LhQm2fDLeWsN5sECN_cid8bA1nde6RrRPaRqrWZp8/edit#gid=0) ### normal * pagerank * speedup : 2.02x * iterations : 26.6% * memory access number: 62.9% * BFS * speedup : 1.08x * iterations : 84.9% * memory access number: 41.4% * Color Max(pagerank graph) * speedup : 2.32x * iterations : 26.6% * memory access number: 15.2% * MIS(pagerank graph) * speedup : 1.91x * iterations : 26.6% * memory access number: 21.6% * SSSP * speedup : 1.97x * iterations : 26.6% * memory access number: 13.1% ### bypass l1 cache * Color Max * speedup : 1.49x * iterations : 99.9% * memory access number: 66.2% ### working list * BFS * gpgpu-sim's bug in global atomic increment * might because atom's outcome will not shown immediately * wl code has bug * speedup : x * iterations : % * memory access number: % --- ## Run Benchmarks [experiment spread sheet](https://docs.google.com/spreadsheets/d/12SWUiwHGojaicbE4O3QcscHBK07e-__A8j6FICQJRYQ/edit#gid=1338197220) [教授spread sheet](https://docs.google.com/spreadsheets/d/14ibl-NsZcrZzOVNC3gcKF7Ow2tnZt8DKzSlhPU888ak/edit#gid=283140300) ![](https://i.imgur.com/sPtCltQ.png) ### Loop Unroll ```cpp= #pragma unroll 1 for(int i=start; i<end; i++){ ... } ``` * `#pragma unroll N`會告訴compiler要unroll幾層loop * N=1只會unroll一層,即**turn off** loop unroll ### atomic operation * instruciton.cc * atom_impl * thread->m_last_dram_callback.function_wt * cuda-sim.cc * ptx_exec_inst * inst.add_callback(lane_id, last_callback().function_wt, last_callback().instruction, this, threads, true /*atomic*/); * abstract_hardware_model.cc * do_atomic * cb.function_wt(cb.instruction, cb.thread, cb.threads); * abstract_hardware_model.h * add_callback * dram_callback_t ### 1. gpgpusim.config ``` #lemonsien -gpgpu_perfect_mem 0 -gpgpu_concurrent_kernel_sm 1 -scheduler_kind 2 -partition_method 1 ... ... # high level architecture configuration -gpgpu_n_clusters 30 ... ... # Device Limits -gpgpu_kernel_launch_latency 0 ``` -cache allocate policy : 改成on miss, hit reserved才會正確 * 否則都是當成miss -scheduler_kind * 0:origin * 1:paver * 2:paver2 -partition_method * 0:BCS * 1:k-way * 2:recursive(not working) ### 2. Run ``` #pannotia/fw ./fw ../../../../data_dirs/pannotia/fw/data/256_16384.gr &> output_memstall & ``` ``` #pannotia/bc ./bc ../../../../data_dirs/pannotia/bc/data/1k_128k.gr &> output_paver_30groups_MST_ideal & ``` ``` #pannotia/color_max ./color_max ../../../../data_dirs/pannotia/color_max/data/ecology1.graph 1 &> output_paver_30groups_MST_ideal & ``` ``` #pannotai/sssp ./sssp ../../../../data_dirs/pannotia/sssp/data/USA-road-d.NW.gr 1 &> output_paver_30groups_MST_ideal & ``` ``` #pannotia/pagerank ./pagerank ../../../../data_dirs/pannotia/pagerank/data/coAuthorsDBLP.graph 1 &> output_paver2 & ``` ``` #pannotia/sssp ./sssp ../../../../data_dirs/pannotia/sssp/data/USA-road-d.NW.gr 0 &> output_memstall2 & ``` ``` #lonestar/mst ./run &> output_paver_30groups_MST_ideal & ``` ``` #lonestar/sssp ./sssp ../../../../../data_dirs/cuda/lonestargpu-2.0/inputs/USA-road-d.FLA.gr &> output_memstall2 & ``` ``` #mis ./mis data/ecology1.graph 1 &> output_memstall & ``` ## Git http://blog.gogojimmy.net/2012/02/29/git-scenario/ ``` git commit -am "comments" git push -u origin branch_name git pull origin branch_name ``` #### Log in ``` git config --global user.name "<使用者名字>" git config --global user.email "<電子信箱>" ``` check configure: ``` git config --list or cat ~/.gitconfig ``` #### 看狀態 ``` git status ``` #### 看歷史commit ``` git log ``` #### 加入新(untracked)檔案 ``` git add file_name ``` 一次加全部檔案: ``` git add . ``` #### 存檔, 會紀錄改過的檔案的訊息 ``` git commit -m "commit訊息" ``` 先git add .後略過編輯器提交 commit ``` git commit -am "commit訊息" ``` #### Branch 新增branch ``` git branch new_branch ``` 查看有哪些branch ``` git branch ``` 切換branch ``` git checkout target_branch ``` #### merge branch_name到目前的branch ``` git merge branch_name ``` ## Perfect Memory * access memory * 不access memory,"**去memory**" 馬上接 "**回cluster**" ![](https://i.imgur.com/5LjCpZQ.png) ![](https://i.imgur.com/ObHo7W6.png) ### memory stall * kinds * memory_cycle * memory access not done * shared_cycle * constant_cycle * texture_cycle * memory priority * ICNT_RC_FAIL = BK_CONF > COAL_STALL NO_RC_FAIL = 0, BK_CONF, MSHR_RC_FAIL, ICNT_RC_FAIL, COAL_STALL, TLB_STALL, DATA_PORT_STALL, WB_ICNT_RC_FAIL, WB_CACHE_RSRV_FAIL, N_MEM_STAGE_STALL_TYPE ## 相關論文 ### irregular benchmark * Graph-Waving architecture: Efficient execution of graph applications on GPUs * [link](https://www.sciencedirect.com/science/article/pii/S0743731520303889) * node to warp mapping, eliminate duplicate execution, shrink the warp size. * Fine-Grained Management of Thread Blocks for Irregular Applications * [link](https://ieeexplore.ieee.org/abstract/document/8988776?casa_token=mPwWuVZ8uekAAAAA:g3-72tUqQEa0RrzLppSWgs1erUiF5kwMhDruh5wyMgWfSJos8Q9pVCEiALDTIbiXQi0t7yPwGj4) 2019 * irregular application(graph)的狀態會隨著時間而改變(因為少數node關聯度很高) * 用前1個kernel的TB去預測下一個kernel的TB的狀態 * 及時追蹤每一個TB的現在的狀態 * 狀態包括Load-Store unit stall,Floating-Point Unit的使用度 * 根據這些狀態去調整SM中TB的數目,允許preemption * gain 17% and 13% improvements of throughput over previously proposed static and dynamic scheduling strategies, respectively * Dynamic Thread Block Launch: A Lightweight Execution Mechanism toSupport Irregular Applications on GPUs * [link](https://dl.acm.org/doi/epdf/10.1145/2872887.2750393) 2015 * 為了加速irregular application,這篇論文讓thread可以產生child thread block * 譬如:有些graph的每個thread會用for loop去traverse某個node的neighbor,每個node的neighbor不同時,會很分散。若是create一個child kernel或child thread block去concurrent執行所有的neighbor會很有用 * 1.21x speedup over the original flat implementation and average 1.40x over the implementations using device-kernel launch functionality * A quantitative study of irregular programs on GPUs * [link](https://ieeexplore.ieee.org/abstract/document/6402918) * Pannotia: Understanding irregular GPGPU graph applications * [link](https://ieeexplore.ieee.org/abstract/document/6704684) * Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription * [link](https://ieeexplore.ieee.org/document/9139797) * working sets of irregular applications can often be divided into cold and hot data structures. * cold allocations are accessed seldom and sparsely * hot data structures have dense sequential access * propose a dynamic page placement strategy for irregular general purpose applications differentiate between hot and cold allocations * performance improvement of 22% to 78% * Optimizing non-coalesced memory access for irregular applications with GPU computing * [link](https://link.springer.com/content/pdf/10.1631/FITEE.1900262.pdf) 2020 * software solution with no hardware extensions to solve dynamic irregularities * data reordering * eliminate indirect memory access by copying to a new location and reordering * Index redirection * reduce overhead * for loop拿到kernel外面,run multiple streams of kernels(1 for count, others for remap) * For loop-carried applications, reordered data are cached for the next iteration or other kernels to eliminate possible redundant data reorganization * 7.0%–7.17% overall performance improvement * Managing DRAM Latency Divergence in Irregular GPGPU Applications * [link](https://www.cs.utah.edu/~rajeev/pubs/sc14.pdf) 2014 * different requests from a warp are serviced together, or in quick succession, without interference from other warps * warp-aware memory controllers * boost performance by 10.1% * Graph Processing on GPUs: A Survey * [link](https://www.dcs.warwick.ac.uk/~liganghe/papers/ACM-Computing-Surveys-2017.pdf) 2018 * graph processing * data layout * memory access pattern * workload mapping * GPU programming * new research challenges and opportunities * Graph processing on hybrid systems * Graph processing on new GPU architecture * Dynamic graph processing on GPUs * the graph structure can be updated frequently in runtime * dynamically maintain balanced workload mapping with the rapid changes of graphs * Machine-learning applications * Anatomy of GPU Memory System for Multi-Application Execution * [link](https://www.cs.utexas.edu/users/skeckler/pubs/MEMSYS_2015_Anatomy.pdf) 2015 * study the memory system of GPUs in a concurrently executing **multi-application** environment * **L2-MPKI** and **attained bandwidth** are the two knobs that can be used to drive memory scheduling decisions for achieving better performance * Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads * [link](https://arxiv.org/pdf/2007.07131.pdf) * 1.33x and 13% improvement in performance and energy savings respectively * On-the-Fly Elimination of Dynamic Irregularities for GPU Computing * [link](https://dl.acm.org/doi/pdf/10.1145/1950365.1950408) * On-GPU Thread-Data Remapping for Branch Divergence Reduction * [link](https://dl.acm.org/doi/10.1145/3242089) * Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations * [link](https://ieeexplore.ieee.org/abstract/document/7349653) * use compiler to transform the original code to the template * Spare register aware prefetching for graph algorithms on GPUs * [link](https://ieeexplore.ieee.org/abstract/document/6835970) * use spare register to store the prefetch edge ### memory access divergence * DaCache: Memory Divergence-Aware GPU Cache Management ### pre-launch (inter-kernel) * BlockMaestro: Enabling Programmer-Transparent Task-based Execution in GPU Systems * [link](https://ieeexplore.ieee.org/document/9499869) * 用bipartite graph找出有dependency(locality)的TBs,並讓後面的TB pre-launch以減少launch time * 用編譯器就能解決,所以不用改code * **input-dependent** is not working ### inter-kernel locality * Inter-kernel reuse-aware thread block scheduling * group TBs * 下一個kernel的group與上一個kernel對齊 * reverse * work stealing,一次一個 ### intra-kernel locality * PAVER: Locality Graph-Based Thread Block Scheduling for GPUs * [link](https://dl.acm.org/doi/fullHtml/10.1145/3451164) 2021 * 用graph theory將TBs的locality都找出來 * 用graph partition將graph切成groups,每個group分給一個SM * work stealing,偷不只一個以維持locality * **input-dependent** is not working * [note](https://hackmd.io/q-vSikPLSdmNqHPz1GwQwA?view) * performance speedup by 29%, 49.1%, and 41.2% * Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling * [link](https://ieeexplore.ieee.org/abstract/document/7898501) * run time時才計算出與目前SM中的TBs最適合的TB作為候選人 * **input-dependent** is not working * Improving GPGPU resource utilization through alternative thread block scheduling * [link](https://ieeexplore.ieee.org/abstract/document/6835937?casa_token=jQrF9KO5vLMAAAAA:1_H9ANBV8mdo4iuC_Nz5bmcPPx4aKsjx4l15Y_1WnhdE8RBbm2hvXO6kdiSmRKZfK2NCJIoG_AU) * workload type * TB increase, performance increase * TB increase, performance increase then saturate * TB increase, performance decrease * TB increase, performance increase then decrease * core activity * ACTIVE * INACTIVE * Idle * Mem_stall * Core_stall * * SM裡的TBs數目會影響performance,最大未必會最好 * <img src="https://i.imgur.com/udazh1W.png" width="40%" height="40%" /> * consecutive CTAs are assigned to the same cores ### analyze data locality * LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs ### warp scheduler * A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization ### multiple streams scheduler * Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications * Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels * Increasing GPU throughput using kernel interleaved thread block scheduling ### graph * Practical Optimal Cache Replacement for Graphs * 用transpose(CSC<->CSR)快速找到cache中下一次被存取時間最久的cache line * 適用traverse graph的應用(in/out) ## Time: 1/21/2022 ~ 1/26/2022 * goal * sssp_spmv, parboil: spmv, * node to warp mapping * data driven ## Time: 1/12/2022 ~ 1/20/2022 * progress * pagerank 5 graph testing * pagerank spmv * color max min ## Time: 10/21/2021 ~ 10/28/2021 * child thread block * change program? * warp formation * can do? * TB formation ## Time: 10/18/2021 ~ 10/25/2021 * 了解CSB * pagerank 問題 * 壓縮sparse matrix * cuSha ## Time: 10/14/2021 ~ 10/21/2021 * ~~color攤開第三層~~ * sssp跑NY * color跑大圖G3 * 改其他benchmark * ~~pagerank結果寫到spread sheet上~~ * pagerank問題 * atomicAdd ? * ITER拿掉 * 每個kernel看perfect memory / cache.nonflush * momory bound問題? * outcome * color-max * 有快一點 * memory access減少,但miss rate上升 * 攤開node value會更慢,memory access會再會再減少,但miss ra會再te * graph很balance * page-rank * 慢很多 * memory access減少,但miss rate上升 * graph很不balance ## Issue * race condition (sssp : min) * possible solution : atomic or synchronization * 兩層 for loop 不能攤開 ## Time: 10/7/2021 ~ 10/14/2021 * 每個kernel印出remap後的cycle * remap的跑在GPU上看看 (debug比較快) * 解決branch divergence(sort 每個vertex的edge數) * gpgpu-sim_simulations/benchmarks/src/cuda/rodinia/3.1/cuda/bfs/ ## Time: 9/24/2021 ~ 9/30/2021 看memory coaleasing有沒有出問題 memory bound是bound在哪邊 參考data reorder(no-coaleased) ## Time: 9/6/2021 - 9/12/2021 林孟賢:加上ideal cache的miss rate, 個別SM的cycle數(為了判斷 load balance), 每個kernel的miss rate 席致寧:group裡加上mst, 跑benchmark ## Time: 8/23/2021 - 9/5/2021 林孟賢:實做paver 席致寧:跑bechmark ## Time: 8/6/2021 - 8/12/2021 * embeddingbag 修改好embeddingbag cache access patern of embedding with 2 keys ``` -3221225472 57 => input(indices) -3221241536 1 -3221241472 7909 => weightFeat(weight) -3221241408 1 -3221241344 7909 => weightFeat(weight) -3221241152 1 -3221241088 7445 => weightFeat(weight) -3221241280 1 -3221241216 7445 => weightFeat(weight) -3221225536 1 -3221225600 57 => input(indices) -3221225664 1 -3221225728 57 -3221225792 1 -3221225856 57 -3221225920 1 -3221225984 57 ``` * 結論 * threads random access(跟input有關) weightFeat, 且locality會隨著dictionary增大而消失 * MIS * 印addr access number * shared/not shared histogram * 觀察 * [spread sheet](https://docs.google.com/spreadsheets/d/12SWUiwHGojaicbE4O3QcscHBK07e-__A8j6FICQJRYQ/edit#gid=1340093626 ) * 在shared和not shared的locality都很高的情況下,我們的cache分段方法沒有優勢。 * 如果跑在多個SM上能保留shared而讓,not shared的locality消失,那我們的方法或許會奏效。但我們的shared是根據inter kernel的locality去找的,在intra kernel是否有效還很難說 * 如果我們的方法(PCG)可以自由的hit, pending hit, sector miss, 那在4/6分之下可以達到和LRU差不多的結果 * Terminology * ideal: 只考慮某個cache line被**存取**幾次 * normal: 只考慮某個cache line被hit(包含第一次miss)幾次,不考慮pending hit和sector miss等 * shared: 可能表示**我們標的pc**或**高access次數的pc**所access到的cache line * not shared: 與shared相對 * both: shared與not shared都有access到的cache line * Outcome * ideal與LRU的比較 * kernel 2 * ideal, miss rate 0.243 <img src="https://i.imgur.com/l9ygFRx.png" width="50%" height="50%" /> * LRU, miss rate 0.287 <img src="https://i.imgur.com/lkn0XIW.png" width="50%" height="50%" /> * 結論 * LRU已經很接近ideal(total miss rate 0.54/0.48) * 相比於shared, not shared的cache line反倒是沒有很好的keep住 * bypass l1 cache * 比較慢 * high frequency ps has high priority to stay in cache(with LRU) * 比較慢 * ideal算法 * block addr要改成addr * sector miss要考慮進去 * miss rate算法 * tag_array算出來的是read的miss rate(剛好是我們要的) * 個別MISS或SCTOR_MISS與gpgpu-sim的有落差,但總和一樣(所以在我們的case下沒差) ## Time: 7/30/2021 - 8/5/2021 * cache切成兩部份 * shared(80%) * not shared(20%) * replacement policy * shared * LIFO * LRU * not shared * LRU * compared with ideal * cta scheduler reverse --- * cache line加上counter, 紀錄hit次數 * set l1 to tag_array * print cache access * s1 * 1 cta * 1 warp * sector miss要考慮 * future work * print cta access status ## Time: 7/22/2021 - 7/29/2021 * 將call/assign by reference/value實做出來 * 在vec-add上實做,結果符合預期 * ![](https://i.imgur.com/6h0zdO4.png) * ![](https://i.imgur.com/uN6IPIJ.png) ## Time: 7/19/2021 - 7/26/2021 * 7/16 * c的基本語法ok * bear沒有試成功,所以目前無法編譯需要make的cuda code * 7/19 * 開始做data member的部分 * 7/20 * data member的部分大致可以 * 發現member function有問題 * 7/21 * 修改data member的data structure以符合member function的操作 * 開始做member function * 7/22 * member function基本的case可以 * 發現member function在找data member時,若遇上inheritance會有問題 * 找到inherited的class * 待辦 * **declare** to **definition** (getDefinition) * 從全部視為pass by reference改成pass by reference/value * alias(變數的傳遞) * 考慮control flow(if, else等) ## Time: 7/12/2021 - 7/19/2021 ### 1. Successes Last Week <!-- * What you managed/plan to accomplish this week - hopefully in line with goals - if not that's okay too. --> * 席致寧 * 7/9 ~ 7/16 * [[筆記] build and run gpgpu-sim](https://hackmd.io/WsOKV8MmTfS1-Y5gyrS2UQ?view) * 7/16 ~ 7/23 * c_Kernel從constant memory改成global memory * for loop 改成用 multiple stream * [[筆記] streams & memory](https://hackmd.io/WTA8Vz_8Slyveitd7JDYMw?view) * 7/23 ~ 7/30 * [BuildMI()](https://llvm.org/docs/CodeGenerator.html#the-machineinstr-class) : extend machine instruction * [[筆記]memory](https://hackmd.io/MFf9C4AHRU2z6oy6X9DRXg) * [[筆記]LLVM](https://hackmd.io/HKv_OZvXSmWvkLXHTu6BaA) * 7/30 ~ 8/6 * [NVPTXInstrInfo.td](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td) * 8/6 ~ 8/13 * compiling LLVM * 8/13 ~ 8/20 * [P-OPT: Practical Optimal Cache Replacement for Graph Analytics Architectures](https://hackmd.io/iCDyMqTNQ8a1DzwtSSV4Jg) * [Internal of LLVM](https://hackmd.io/bwlGi-ntTxCF1pTvB0N-YA) ### 2. Progress and Problems Last Week <!-- * What have you done last week ? * What problems encountered last week ? * What caused you to be slow on a particular task? What roadblocks did you encounter? --> * 跟**李睦樂**討論之後覺得理論上做的出來,但實做細節與clang API是否支援等現實層面的問題還在摸索中 * 限制條件為若有些kernel的parameter是runtime時才決定的 * 譬如:C = A+B的C與A在runtime時才得知是指到同一個obj,那我們在compile時就不會知道A的值是非read-only的 * 目前的想法為每一個變數與obj設置counter,當寫入時counter+1,最後看哪些變數符合條件 <font color= #DCDCDC> * 為每一個**變數**配置一個資料結構用來紀錄**write次數**, **類型(normal, array, pointer, reference)**, **指標(指到其他人的資料結構)** * 最後根據每一個變數**自己的資料結構**和**指到的資料結構**去判斷是否為constant * 超過2就非constant * 需要trace **call by value**, **call by reference**等實做方法(在找出哪些是我們要的變數上遇到了問題,因為 **=左右** 都可能存在不只一個變數) * 判斷 **= 右邊**的變數的level情況 * *: level-1 * &: level+1 * 二元運算子: 留下小的 * 用post order traversal的方法搭配stack找出level小的與 **= 左邊**的變數關聯 * 搭配children()用法 * http://web.mit.edu/freebsd/head/contrib/llvm/tools/clang/lib/AST/StmtPrinter.cpp </font> * write 指令包括 assign, compound assign, inc/dec * 被 write 的 variable 為 **=左邊** 的第一個variable(可能會有問題) <!--* 用bear -vvvv make這調指令以獲得compile_commands.json--> * 在sp和mst, **bear** 和 **compiledb** 無法取得compile_commands.json ### 3. Not Resolved Problems <!--* The checking list of problems you have not resolved yet.--> * 在找出哪些是我們要的變數上遇到了問題,因為 **=左右** 都可能存在不只一個變數 ### 4. Goals for next week * 席致寧 * shaer.cc (/src/gpgpu-sim) * issue_warp * 在TB裡面建立一個array,蒐集ld.global的address,找不同TB裡的交集 * warp_inst_t (/src/abstract_hardware_model.cc) * class warp_inst_t() * 加array來存address * memory_space_t (/src/abstract_hardware_model.cc) * is_global() * 林孟賢 * assign一律考慮為pass by reference,並run在不同cuda benchmark裡,確保cuda code可以run成功,並找出可能的問題 * vecadd: ok * ~/test/llvm-project-orig/llvm-project/build/bin/get-constant-variables test_case/vector-add.cu -- clang++ --cuda-gpu-arch=sm_75 -L /usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread * ![](https://i.imgur.com/0vceiiJ.png) * pathfinder: * ~/test/llvm-project-orig/llvm-project/build/bin/get-constant-variables pathfinder.cu -- clang++ --cuda-gpu-arch=sm_75 -L /usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread * ![](https://i.imgur.com/h0Owq9L.png)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully