---
# System prepended metadata

title: Persistent Caching
tags: [GPU]

---

# Persistent Caching
###### tags: `GPU`
###### members: @林孟賢 @席致寧
[paper](https://www.overleaf.com/3161447213kchpscspbmxk)
[proposal](https://docs.google.com/document/d/1LPxwv7mzae4n1qHnJnDLL_UtjeJf0ngaChi5H_2DfHM/edit?usp=sharing)
[report](https://drive.google.com/file/d/152gvtqASZaRv1jkt5Guig8FHHtJiFg0R/view?usp=sharing)
[spreadsheet](https://docs.google.com/spreadsheets/d/197LhQm2fDLeWsN5sECN_cid8bA1nde6RrRPaRqrWZp8/edit#gid=0)
[slide](https://docs.google.com/presentation/d/1Thz_tu13iYqtAW99xmj1XgWEpikkIbobYvLYku2fkGw/edit?usp=sharing)
[oral test](https://docs.google.com/presentation/d/12lGzoFUw9xnw112_kMmgaDO5RW0ygAmvc0cVbPRxzM4/edit?usp=sharing)
[reference ppaer](https://hackmd.io/867BaaD-QIe1ytWs7FXYpQ)
[DAC paper bullet point](https://hackmd.io/ES09vnUQT-KgcvPX5h2fVw?both)

## Idea
![](https://i.imgur.com/vIoz63Q.png)

![](https://i.imgur.com/oMXxn4J.png)

<!-- ![](https://i.imgur.com/CycLrc0.png)-->

![](https://i.imgur.com/kcoBzo5.png) 

![](https://i.imgur.com/bxPxGry.png)

* spare register([reference paper](https://ieeexplore.ieee.org/abstract/document/6835970))
    * Due to resource constraints there may be unallocated registers in a GPU.
    * If a thread warp has terminated, then registers allocated to the warp are free.

* register file access
    * ![](https://i.imgur.com/9QIySp9.png)
    * new version of gpgpu-sim contain **Operand collector** which contain a cross-bar, and can access all register file
    * ![](https://i.imgur.com/pdAiXtc.png)


* compiler assist
    * programming interface like openMP
    * need user to specify the independent for loop
    * need compiler to calculate the usage of register numbers in each iteration.
    * **到達定植分析**可以找出**循環不便變量**

* runtime
    * in runtime we need a table to record the index of the for loop in each thread

### Compare to Software Solution

additional space required by software solution : 
* `length[]` 
    * 紀錄一個warp裡面每條thread所需要的iteration 
    * additional space : O(V)
* `start[]`
    * 紀錄一個warp裡面每條thread的起始位置
    * additional space : O(V)
* `origin_thread[]`
    * 紀錄每條邊來自於那一條thread
    * additional space : O(E)

additional space required by hardware solution :
* `index_table`
    * 紀錄一個warp裡面每條thread的start 跟 end
    * additional space : O(V)

1. hardware solution是rum time生成table, 當某條warp完成運算後即可release掉空間，所以實際上花的空間不會那麼多
2. software solution 會有 preprocessing kernel overhead
3. software solution要先將data(start, length, ...)存在global memory，hardware solution則是extend hardware component來存run time生成的table
4. software solution的memory access次數會變多

## Related paper list
* hardware
    *   Graph-Waving architecture: Efficient execution of graph applications on GPUs
        *    map each node to a warp instead of thread
        *    need to change the user’s code
        *    need to eliminate duplicate scaler instruction
    *   Dynamic Thread Block Launch: A Lightweight Execution Mechanism toSupport Irregular Applications on GPUs
        *    use dynamic launch “thread block” to replace dynamic launch “kernel”, thus alleviate the overhead of the “kernel” and increase parallelism
        *    not fully utilized the SIMD lane when degree of the node is small
    *   GraphPEG: Accelerating Graph Processing on GPUs
        *    use “hardware function” to collect needed edges to the “shared memory”, then process on the collected edge
        *    solve all the aforemention problem, but add many additional hardwares
    *    Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists
    *    Active Thread Compaction for GPU Path Tracing

*   software
    *   gunrock
    *   Accelerating CUDA Graph Algorithms at Maximum Warp
        *    [paper](https://stanford-ppl.github.io/website/papers/ppopp070a-hong.pdf)
    *   Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths
        *    [paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6877269)
    *   Scalable GPU Graph Traversal
        * [paper](https://mgarland.org/files/papers/gpubfs.pdf)    
*   benchmark
    *   Pannotia: Understanding Irregular GPGPU Graph Applications
        *   [paper](https://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf) 

## problem of equivalence software solution
*    atomic
*    

## solve active/inactive
![](https://i.imgur.com/bRnCHU5.gif)

## graph
[pannotia](https://github.com/pannotia/pannotia/wiki)
[datasets](https://www.cc.gatech.edu/dimacs10/archive/coauthor.shtml)

## PTX
### SPMV
* original code:
```
cvta.to.global.u64 %rd2, %rd13;
cvta.to.global.u64 %rd17, %rd12;
cvta.to.global.u64 %rd18, %rd10;
mul.wide.s32 %rd19, %r15, 4;            #move
add.s64 %rd26, %rd17, %rd19;            #move
add.s64 %rd25, %rd18, %rd19;            #move

BB0_3:
.pragma "nounroll";
ld.global.u32 %r14, [%rd26];           # PC=0x100
mul.wide.s32 %rd20, %r14, 4;
add.s64 %rd21, %rd2, %rd20;
ld.global.f32 %f4, [%rd21];
ld.global.f32 %f5, [%rd25];
mul.f32 %f6, %f5, %f4;
atom.shared.add.f32 %f7, [%r3], %f6;
add.s64 %rd26, %rd26, 4;
add.s64 %rd25, %rd25, 4;
add.s32 %r15, %r15, 1;
setp.lt.s32     %p3, %r15, %r5;
@%p3 bra BB0_3;                        # PC=0x158
```
* modified code
```
cvta.to.global.u64 %rd2, %rd13;
cvta.to.global.u64 %rd17, %rd12;
cvta.to.global.u64 %rd18, %rd10;       # PC=0x0e0
add.s32 %r1, %r5, %r15;                # PC=0x0e8
BB0_3:
.pragma "nounroll";
mul.wide.s32 %rd19, %r15, 4;           # PC=0x0f0
add.s64 %rd26, %rd17, %rd19;
add.s64 %rd25, %rd18, %rd19;
ld.global.u32 %r14, [%rd26];
mul.wide.s32 %rd20, %r14, 4;
add.s64 %rd21, %rd2, %rd20;
ld.global.f32 %f4, [%rd21];
ld.global.f32 %f5, [%rd25];
mul.f32 %f6, %f5, %f4;
atom.shared.add.f32 %f7, [%r3], %f6;
@%p3 bra BB0_3;                        # PC=0x140
```

### Pagerank
* original code : 

```c=
#pragma unroll 1
for (int edge = start; edge < end; edge++) {
    int nid = col[edge];
    atomicAdd(&page_rank2[nid], page_rank1[tid] / (float)(end - start));
}
```

```=
mul.wide.s32 %rd16, %r16, 4;
add.s64 %rd19, %rd14, %rd16;            //0x0e8 create table

BB0_5:
.pragma "nounroll";
ld.global.u32 %r14, [%rd19];            //0x0f0 col[edge]
mul.wide.s32 %rd17, %r14, 4;
add.s64 %rd18, %rd2, %rd17;
ld.global.f32 %f2, [%rd3];
div.rn.f32 %f3, %f2, %f1;
atom.global.add.f32 %f4, [%rd18], %f3;
add.s64 %rd19, %rd19, 4;               //0x120 edge++
add.s32 %r16, %r16, 1;                 //0x128 loop counter
setp.lt.s32     %p4, %r16, %r15;       //0x130 loop counter < iteration
@%p4 bra BB0_5;                        //0x138 branch
```
* instruction找到0x0e8就換方式執行(還沒建表格) : 

```=
sub.s32 %r13, %r15, %r16;               //PC = 0x0d0
cvt.rn.f32.s32%f1, %r13;                //PC = 0x0d8

BB0_5:
.pragma "nounroll";
mul.wide.s32 %rd16, %r16, 4;            //PC = 0x0e0 r16 : index,從table拿
add.s64 %rd19, %rd14, %rd16;            //PC = 0x0e8 rd14 : base address
ld.global.u32 %r14, [%rd19];            
mul.wide.s32 %rd17, %r14, 4;
add.s64 %rd18, %rd2, %rd17;
ld.global.f32 %f2, [%rd3];
div.rn.f32 %f3, %f2, %f1;
atom.global.add.f32 %f4, [%rd18], %f3;  //PC = 0x118
add.s32 %r16, %r16, 1;                 
setp.lt.s32     %p4, %r16, %r15;       
@%p4 bra BB0_5;                         //PC = 0x130
```

* %r16 : start, %r15 : end, 建表格
* 根據表格抓rd19的值
* gpu-sim要得到的資訊 ：
    * start index
    * end index
    * last codeblock PC (第11行)
    * exit PC (第15行)

```=
                                        //PC = 0x0c8 
sub.s32 %r13, %r15, %r16;               //PC = 0x0d0 
cvt.rn.f32.s32%f1, %r13;                //PC = 0x0d8 create table

add.s32 %r1, %r15, %r16;                //PC = 0x0e0 set index
BB0_5:
.pragma "nounroll";
mul.wide.s32 %rd16, %r16, 4;            //PC = 0x0e8 r16 : index, 從table拿
add.s64 %rd19, %rd14, %rd16;            //PC = 0x0f0 rd14 : base address
ld.global.u32 %r14, [%rd19];            //PC = 0x0f8 rd19 redirect???           
mul.wide.s32 %rd17, %r14, 4;            //PC = 0x100
add.s64 %rd18, %rd2, %rd17;             //PC = 0x108
ld.global.f32 %f2, [%rd3];              //PC = 0x110 rd3 redirect???
div.rn.f32 %f3, %f2, %f1;               //PC = 0x118
atom.global.add.f32 %f4, [%rd18], %f3;  //PC = 0x120                    
@%p4 bra BB0_5;                         //PC = 0x128
                                        //PC = 0x130
```
read reg: 
write reg: rd16, rd19, r14, rd17, rd18, f2, f3, f4

### Pagerank_spmv
```
cvta.to.global.u64 %rd2, %rd12;
cvta.to.global.u64 %rd17, %rd10;
cvta.to.global.u64 %rd18, %rd11;
mul.wide.s32 %rd19, %r15, 4;            #move
add.s64 %rd26, %rd17, %rd19;            #move
add.s64 %rd25, %rd18, %rd19;            #move

add.s32 %r1, %r4, %r15;                #add

BB2_3:
.pragma "nounroll";
ld.global.u32 %r14, [%rd26];            #PC=0x4e0
mul.wide.s32 %rd20, %r14, 4;            #PC=0x4e8
add.s64 %rd21, %rd2, %rd20;             #PC=0x4f0
ld.global.f32 %f4, [%rd21];             #PC=0x4f8
ld.global.f32 %f5, [%rd25];             #PC=0x500
mul.f32 %f6, %f5, %f4;                  #PC=0x508
atom.shared.add.f32 %f7, [%r5], %f6;    #PC=0x510
add.s64 %rd26, %rd26, 4;                #PC=0x518 delete
add.s64 %rd25, %rd25, 4;                #PC=0x520 delete
add.s32 %r15, %r15, 1;                  #PC=0x528 delete
setp.lt.s32     %p3, %r15, %r4;         #PC=0x530 delete
@%p3 bra BB2_3;                         #PC=0x538

ld.shared.f32 %f10, [%r5];

BB2_5:
cvta.to.global.u64 %rd22, %rd13;
shl.b64 %rd23, %rd1, 2;
add.s64 %rd24, %rd22, %rd23;
ld.global.f32 %f8, [%rd24];
add.f32 %f9, %f10, %f8;
st.global.f32 [%rd24], %f9;

BB2_6:
ret;
```
#### parallel
```
cvta.to.global.u64 %rd2, %rd12;
cvta.to.global.u64 %rd17, %rd10;
cvta.to.global.u64 %rd18, %rd11;        #PC=0x4c0
add.s32 %r1, %r4, %r15;                 #PC=0x4c8 add

BB2_3:
.pragma "nounroll";

mul.wide.s32 %rd19, %r15, 4;            #PC=0x4d0 move
add.s64 %rd26, %rd17, %rd19;            #PC=0x4d8 move
add.s64 %rd25, %rd18, %rd19;            #PC=0x4e0 move

ld.global.u32 %r14, [%rd26];            #PC=0x4e8
mul.wide.s32 %rd20, %r14, 4;            #PC=0x4f0
add.s64 %rd21, %rd2, %rd20;             #PC=0x4f8
ld.global.f32 %f4, [%rd21];             #PC=0x500
ld.global.f32 %f5, [%rd25];             #PC=0x508
mul.f32 %f6, %f5, %f4;                  #PC=0x510
atom.shared.add.f32 %f7, [%r5], %f6;    #PC=0x518

@%p3 bra BB2_3;                         #PC=0x520

ld.shared.f32 %f10, [%r5];

BB2_5:
cvta.to.global.u64 %rd22, %rd13;
shl.b64 %rd23, %rd1, 2;
add.s64 %rd24, %rd22, %rd23;
ld.global.f32 %f8, [%rd24];
add.f32 %f9, %f10, %f8;
st.global.f32 [%rd24], %f9;

BB2_6:
ret;
```

### BC
```
                                        //PC=0x0d0 
BB0_4:
setp.ge.s32     %p4, %r22, %r21;        //PC=0x0d8 
@%p4 bra BB0_11;                        //PC=0x0e0

cvta.to.global.u64 %rd18, %rd11;        //PC=0x0e8
cvta.to.global.u64 %rd19, %rd9;         //PC=0x0f0
add.s32 %r5, %r13, 1;                   //PC=0x0f8
add.s64 %rd2, %rd18, %rd14;             //PC=0x100

cvta.to.global.u64 %rd4, %rd12;         //PC=0x108 create table

add.s32 %r1, %r21, %r22;                //PC=0x110 extend set index
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd21, %r22, 4;            //PC=0x118 move, set register
add.s64 %rd27, %rd19, %rd21;            //PC=0x120 move
ld.global.u32 %r7, [%rd27];             //PC=0x128
mul.wide.s32 %rd23, %r7, 4;             //PC=0x130
add.s64 %rd6, %rd13, %rd23;             //PC=0x138
ld.global.u32 %r23, [%rd6];             //PC=0x140
setp.gt.s32     %p5, %r23, -1;          //PC=0x148
@%p5 bra BB0_8;                         //PC=0x150

mov.u32 %r19, 1;                        //PC=0x158
st.global.u32 [%rd4], %r19;             //PC=0x160 *cont = 1;
add.s32 %r20, %r13, 1;                  //PC=0x168
st.global.u32 [%rd6], %r20;             //PC=0x170 d[w] = dist + 1;
mov.u32 %r23, %r5;                      //PC=0x178

BB0_8:
setp.ne.s32     %p6, %r23, %r5;         //PC=0x180 break point r23: d[w], r5:
@%p6 bra BB0_10;                        //PC=0x188

add.s64 %rd26, %rd18, %rd23;            //PC=0x190
ld.global.f32 %f1, [%rd2];              //PC=0x198
atom.global.add.f32 %f2, [%rd26], %f1;  //PC=0x1a0

BB0_10:
add.s32 %r22, %r22, 1;                    //delete
add.s64 %rd27, %rd27, 4;                  //delete
setp.lt.s32     %p7, %r22, %r21;          //delete
@%p7 bra BB0_6;                           //PC=0x1a8 finish or not / destroy table

BB0_11:                                    
ret;                                      //PC=0x1b0   
```

### BC_WL
#### original
```
cvta.to.global.u64 %rd19, %rd12;
cvta.to.global.u64 %rd20, %rd10;
add.s32 %r6, %r14, 1;
add.s64 %rd2, %rd19, %rd18;         #PC=0x100
mul.wide.s32 %rd22, %r23, 4;        #PC=0x108 move
add.s64 %rd28, %rd20, %rd22;        #PC=0x110 move
cvta.to.global.u64 %rd4, %rd13;     #PC=0x118
cvta.to.global.u64 %rd23, %rd11;    #PC=0x120

add.s32 %r1, %r22, %r23;            #PC=0x128 add

BB0_5:
.pragma "nounroll";
ld.global.u32 %r8, [%rd28];         #PC=0x130
mul.wide.s32 %rd24, %r8, 4;
add.s64 %rd6, %rd23, %rd24;
ld.global.u32 %r24, [%rd6];
setp.gt.s32     %p4, %r24, -1;
@%p4 bra BB0_7;

mov.u32 %r20, 1;
st.global.u32 [%rd4], %r20;
add.s32 %r21, %r14, 1;
st.global.u32 [%rd6], %r21;
mov.u32 %r24, %r6;

BB0_7:
setp.ne.s32     %p5, %r24, %r6;
@%p5 bra BB0_9;

add.s64 %rd27, %rd19, %rd24;
ld.global.f32 %f1, [%rd2];
atom.global.add.f32 %f2, [%rd27], %f1;

BB0_9:
add.s32 %r23, %r23, 1;                    #remove
add.s64 %rd28, %rd28, 4;                  #remove
setp.lt.s32     %p6, %r23, %r22;          #remove
@%p6 bra BB0_5;
```
#### modified
```
cvta.to.global.u64 %rd19, %rd12;
cvta.to.global.u64 %rd20, %rd10;
add.s32 %r6, %r14, 1;                   #PC=0x100
add.s64 %rd2, %rd19, %rd18;             #PC=0x108
cvta.to.global.u64 %rd4, %rd13;         #PC=0x110
cvta.to.global.u64 %rd23, %rd11;        #PC=0x118

add.s32 %r1, %r22, %r23;                #PC=0x120
BB0_5:
.pragma "nounroll";
mul.wide.s32 %rd22, %r23, 4;            #PC=0x128
add.s64 %rd28, %rd20, %rd22;
ld.global.u32 %r8, [%rd28];
mul.wide.s32 %rd24, %r8, 4;
add.s64 %rd6, %rd23, %rd24;
ld.global.u32 %r24, [%rd6];
setp.gt.s32     %p4, %r24, -1;
@%p4 bra BB0_7;

mov.u32 %r20, 1;
st.global.u32 [%rd4], %r20;
add.s32 %r21, %r14, 1;
st.global.u32 [%rd6], %r21;
mov.u32 %r24, %r6;

BB0_7:
setp.ne.s32     %p5, %r24, %r6;
@%p5 bra BB0_9;

add.s64 %rd27, %rd19, %rd24;
ld.global.f32 %f1, [%rd2];
atom.global.add.f32 %f2, [%rd27], %f1;

BB0_9:
@%p6 bra BB0_5;                          #PC=0x1b8
```

### Color
#### original
original
```
BB0_4:
mov.u32 %r33, -1;
setp.ge.s32     %p4, %r2, %r30;
@%p4 bra BB0_9;

add.s32 %r5, %r30, -1;
mov.u32 %r33, -1;
cvta.to.global.u64 %rd17, %rd7;
mul.wide.s32 %rd18, %r2, 4;       //remove
add.s64 %rd28, %rd17, %rd18;      //remove
cvta.to.global.u64 %rd3, %rd10;
cvta.to.global.u64 %rd22, %rd8;
mov.u32 %r31, %r2;                //create table

add.s32 %r1, %r31, %r30;          //add, set index to table
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd18, %r31, 4;      //move
add.s64 %rd28, %rd17, %rd18;      //move
ld.global.u32 %r25, [%rd28];      //PC=0x130
mul.wide.s32 %rd20, %r25, 4;
add.s64 %rd21, %rd12, %rd20;
ld.global.u32 %r26, [%rd21];
setp.eq.s32     %p5, %r26, -1;
setp.ne.s32     %p6, %r2, %r5;
and.pred %p7, %p5, %p6;
@!%p7 bra BB0_8;
bra.uni BB0_7;

BB0_7:
mov.u32 %r27, 1;
st.global.u32 [%rd3], %r27;
ld.global.u32 %r28, [%rd28];
mul.wide.s32 %rd23, %r28, 4;
add.s64 %rd24, %rd22, %rd23;
ld.global.u32 %r29, [%rd24];
max.s32 %r33, %r29, %r33;

BB0_8:                               
add.s64 %rd28, %rd28, 4;            //remove
add.s32 %r31, %r31, 1;              //remove
setp.lt.s32     %p8, %r31, %r30;    //remove r31 : index, r30 : end
@%p8 bra BB0_6;

BB0_9:
cvta.to.global.u64 %rd25, %rd11;
add.s64 %rd27, %rd25, %rd13;
st.global.u32 [%rd27], %r33;

BB0_10:
ret;
```

modified
```
BB0_4:
mov.u32 %r33, -1;
setp.ge.s32     %p4, %r2, %r30;
@%p4 bra BB0_9;

add.s32 %r5, %r30, -1;
mov.u32 %r33, -1;
cvta.to.global.u64 %rd17, %rd7;
cvta.to.global.u64 %rd3, %rd10;
cvta.to.global.u64 %rd22, %rd8;
mov.u32 %r31, %r2;                //PC=0x118 create table

add.s32 %r1, %r31, %r30;          //PC=0x120 add, set index to table
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd18, %r31, 4;      //PC=0x128 move
add.s64 %rd28, %rd17, %rd18;      //PC=0x130 move
ld.global.u32 %r25, [%rd28];      //PC=0x138     
mul.wide.s32 %rd20, %r25, 4;      //PC=0x140
add.s64 %rd21, %rd12, %rd20;      //PC=0x148
ld.global.u32 %r26, [%rd21];      //PC=0x150
setp.eq.s32     %p5, %r26, -1;    //PC=0x158
setp.ne.s32     %p6, %r2, %r5;    //PC=0x160
and.pred %p7, %p5, %p6;           //PC=0x168
@!%p7 bra BB0_8;                  //PC=0x170
bra.uni BB0_7;                    //PC=0x178

BB0_7:
mov.u32 %r27, 1;                  //PC=0x180
st.global.u32 [%rd3], %r27;       //PC=0x188
ld.global.u32 %r28, [%rd28];      //PC=0x190
mul.wide.s32 %rd23, %r28, 4;      //PC=0x198
add.s64 %rd24, %rd22, %rd23;      //PC=0x1a0
ld.global.u32 %r29, [%rd24];      //PC=0x1a8
max.s32 %r33, %r29, %r33;         //PC=0x1b0

BB0_8:
@%p8 bra BB0_6;                   //PC=0x1b8

BB0_9:
cvta.to.global.u64 %rd25, %rd11;
add.s64 %rd27, %rd25, %rd13;
st.global.u32 [%rd27], %r33;

BB0_10:
ret;
```
#### parallel
original
```
add.s32 %r6, %r32, -1;
cvta.to.global.u64 %rd18, %rd8;
mul.wide.s32 %rd19, %r2, 4;         //moved
add.s64 %rd29, %rd18, %rd19;        //moved
cvta.to.global.u64 %rd4, %rd11;
cvta.to.global.u64 %rd23, %rd9;
mov.u32 %r33, %r2;                  //

add.s32 %r1, %r32, %r33;            //PC=, add
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd19, %r33, 4;        //PC=, move
add.s64 %rd29, %rd18, %rd19;        //PC=, move
ld.global.u32 %r22, [%rd29];        //PC=0x158
mul.wide.s32 %rd21, %r22, 4;        //60
add.s64 %rd22, %rd13, %rd21;        //68
ld.global.u32 %r23, [%rd22];        //70
setp.eq.s32     %p5, %r23, -1;      //78
setp.ne.s32     %p6, %r2, %r6;      //80
and.pred %p7, %p5, %p6;             //88
@!%p7 bra BB0_8;                    //90
bra.uni BB0_7;                      //98

BB0_7:
mov.u32 %r24, 1;                    //a0
st.global.u32 [%rd4], %r24;         //a8
ld.global.u32 %r25, [%rd29];        //b0
mul.wide.s32 %rd24, %r25, 4;        //b8
add.s64 %rd25, %rd23, %rd24;        //c0
ld.global.u32 %r26, [%rd25];        //PC=0x1c8
atom.shared.max.s32 %r31, [%r5], %r26;//PC=0x1d0

BB0_8:
add.s32 %r33, %r33, 1;              //PC=0x1d8, remove
add.s64 %rd29, %rd29, 4;            //PC=0x1e0, remove
setp.lt.s32     %p8, %r33, %r32;    //PC=0x1e8, remove
@%p8 bra BB0_6;                     //PC=0x1f0

ld.shared.u32 %r34, [%r5];          //PC=0x1f8

BB0_10:
cvta.to.global.u64 %rd26, %rd12;
add.s64 %rd28, %rd26, %rd17;
st.global.u32 [%rd28], %r34;

BB0_11:
ret;
```

modified

```
add.s32 %r6, %r32, -1;
cvta.to.global.u64 %rd18, %rd8;
cvta.to.global.u64 %rd4, %rd11;
cvta.to.global.u64 %rd23, %rd9;
mov.u32 %r33, %r2;                  //PC=0x140

add.s32 %r1, %r32, %r33;            //PC=0x148, add
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd19, %r33, 4;        //PC=0x150, move
add.s64 %rd29, %rd18, %rd19;        //PC=0x158, move
ld.global.u32 %r22, [%rd29];        //PC=0x160
mul.wide.s32 %rd21, %r22, 4;        //PC=0x168
add.s64 %rd22, %rd13, %rd21;        //PC=0x170
ld.global.u32 %r23, [%rd22];        //PC=0x178
setp.eq.s32     %p5, %r23, -1;      //PC=0x180
setp.ne.s32     %p6, %r2, %r6;      //PC=0x188
and.pred %p7, %p5, %p6;             //PC=0x190
@!%p7 bra BB0_8;                    //PC=0x198
bra.uni BB0_7;                      //PC=0x1a0

BB0_7:
mov.u32 %r24, 1;                    //PC=0x1a8
st.global.u32 [%rd4], %r24;         //PC=0x1b0
ld.global.u32 %r25, [%rd29];        //PC=0x1b8
mul.wide.s32 %rd24, %r25, 4;        //PC=0x1c0
add.s64 %rd25, %rd23, %rd24;        //PC=0x1c8
ld.global.u32 %r26, [%rd25];        //PC=0x1d0
atom.shared.max.s32 %r31, [%r5], %r26;//PC=0x1d8

BB0_8:
@%p8 bra BB0_6;                     //PC=0x1e0

ld.shared.u32 %r34, [%r5];          //PC=0x1e8

BB0_10:
cvta.to.global.u64 %rd26, %rd12;
add.s64 %rd28, %rd26, %rd17;
st.global.u32 [%rd28], %r34;

BB0_11:
ret;
```
### Color_wl
original
```
add.s32 %r7, %r33, -1;
cvta.to.global.u64 %rd18, %rd8;
mul.wide.s32 %rd19, %r3, 4;             #move
add.s64 %rd29, %rd18, %rd19;            #move
cvta.to.global.u64 %rd3, %rd11;
cvta.to.global.u64 %rd20, %rd10;
cvta.to.global.u64 %rd23, %rd9;
mov.u32 %r34, %r3;

add.s32 %r1, %r33, %r34;                #add

BB0_5:
.pragma "nounroll";
ld.global.u32 %r23, [%rd29];            #PC=0x348
mul.wide.s32 %rd21, %r23, 4;
add.s64 %rd22, %rd20, %rd21;
ld.global.u32 %r24, [%rd22];
setp.eq.s32     %p4, %r24, -1;
setp.ne.s32     %p5, %r3, %r7;
and.pred %p6, %p4, %p5;
@!%p6 bra BB1_7;
bra.uni BB1_6;

BB1_6:
mov.u32 %r25, 1;
st.global.u32 [%rd3], %r25;
ld.global.u32 %r26, [%rd29];
mul.wide.s32 %rd24, %r26, 4;
add.s64 %rd25, %rd23, %rd24;
ld.global.u32 %r27, [%rd25];
atom.shared.max.s32 %r32, [%r6], %r27;

BB1_7:
add.s32 %r34, %r34, 1;                  #remove
add.s64 %rd29, %rd29, 4;                #remove
setp.lt.s32     %p7, %r34, %r33;        #remove
@%p7 bra BB1_5;
```
modified
```
add.s32 %r7, %r33, -1;
cvta.to.global.u64 %rd18, %rd8;
cvta.to.global.u64 %rd3, %rd11;
cvta.to.global.u64 %rd20, %rd10;
cvta.to.global.u64 %rd23, %rd9;
mov.u32 %r34, %r3;                          #PC=0x140

add.s32 %r1, %r33, %r34;                    #PC=0x148
BB0_5:
.pragma "nounroll";
mul.wide.s32 %rd19, %r34, 4;                #PC=0x150
add.s64 %rd29, %rd18, %rd19;
ld.global.u32 %r23, [%rd29];
mul.wide.s32 %rd21, %r23, 4;
add.s64 %rd22, %rd20, %rd21;
ld.global.u32 %r24, [%rd22];
setp.eq.s32     %p4, %r24, -1;
setp.ne.s32     %p5, %r3, %r7;
and.pred %p6, %p4, %p5;
@!%p6 bra BB0_7;
bra.uni BB0_6;

BB0_6:
mov.u32 %r25, 1;
st.global.u32 [%rd3], %r25;
ld.global.u32 %r26, [%rd29];
mul.wide.s32 %rd24, %r26, 4;
add.s64 %rd25, %rd23, %rd24;
ld.global.u32 %r27, [%rd25];
atom.shared.max.s32 %r32, [%r6], %r27;

BB0_7:
@%p7 bra BB0_5;                            #PC=0x1e0
```

### color max min
### origin
```
add.s32 %r5, %r52, -1;
cvta.to.global.u64 %rd21, %rd6;
mul.wide.s32 %rd22, %r2, 4;               #move
add.s64 %rd37, %rd21, %rd22;              #move
cvta.to.global.u64 %rd26, %rd9;
cvta.to.global.u64 %rd27, %rd7;
mov.u32 %r53, %r2;                        #PC=0x178

add.s32 %r1, %r52, %r53;                  #add

BB0_6:
.pragma "nounroll";
ld.global.u32 %r27, [%rd37];              #PC=0x180
mul.wide.s32 %rd24, %r27, 4;
add.s64 %rd25, %rd12, %rd24;
ld.global.u32 %r28, [%rd25];
setp.eq.s32     %p5, %r28, -1;
setp.ne.s32     %p6, %r2, %r5;
and.pred %p7, %p5, %p6;
@!%p7 bra BB0_8;
bra.uni BB0_7;

BB0_7:
mov.u32 %r29, 1;
st.global.u32 [%rd26], %r29;
ld.global.u32 %r30, [%rd37];
mul.wide.s32 %rd28, %r30, 4;
add.s64 %rd29, %rd27, %rd28;
ld.global.u32 %r31, [%rd29];
atom.shared.max.s32 %r36, [%r24], %r31;
ld.global.u32 %r37, [%rd37];
mul.wide.s32 %rd30, %r37, 4;
add.s64 %rd31, %rd27, %rd30;
ld.global.u32 %r38, [%rd31];
atom.shared.min.s32 %r41, [%r26], %r38;

BB0_8:
add.s32 %r53, %r53, 1;                     #delete
add.s64 %rd37, %rd37, 4;                   #delete
setp.lt.s32     %p8, %r53, %r52;           #delete
@%p8 bra BB0_6;                            #PC=0x240

ld.shared.u32 %r55, [%r24];
ld.shared.u32 %r54, [%r26];
```

### parallel
```
add.s32 %r5, %r52, -1;
cvta.to.global.u64 %rd21, %rd6;

cvta.to.global.u64 %rd26, %rd9;
cvta.to.global.u64 %rd27, %rd7;
mov.u32 %r53, %r2;                        #PC=0x168

add.s32 %r1, %r52, %r53;                  #PC=0x170 add

BB0_6:
.pragma "nounroll";

mul.wide.s32 %rd22, %r53, 4;              #PC=0x178 move
add.s64 %rd37, %rd21, %rd22;              #PC=0x180 move

ld.global.u32 %r27, [%rd37];              #PC=0x
mul.wide.s32 %rd24, %r27, 4;
add.s64 %rd25, %rd12, %rd24;
ld.global.u32 %r28, [%rd25];
setp.eq.s32     %p5, %r28, -1;
setp.ne.s32     %p6, %r2, %r5;
and.pred %p7, %p5, %p6;
@!%p7 bra BB0_8;
bra.uni BB0_7;

BB0_7:
mov.u32 %r29, 1;
st.global.u32 [%rd26], %r29;
ld.global.u32 %r30, [%rd37];
mul.wide.s32 %rd28, %r30, 4;
add.s64 %rd29, %rd27, %rd28;
ld.global.u32 %r31, [%rd29];
atom.shared.max.s32 %r36, [%r24], %r31;
ld.global.u32 %r37, [%rd37];
mul.wide.s32 %rd30, %r37, 4;
add.s64 %rd31, %rd27, %rd30;
ld.global.u32 %r38, [%rd31];
atom.shared.min.s32 %r41, [%r26], %r38;

BB0_8:
@%p8 bra BB0_6;                            #PC=0x230

ld.shared.u32 %r55, [%r24];
ld.shared.u32 %r54, [%r26];
```

### mis
#### origin
```
cvta.to.global.u64 %rd4, %rd10;
cvta.to.global.u64 %rd19, %rd9;
mul.wide.s32 %rd20, %r29, 4;            #move
add.s64 %rd28, %rd19, %rd20;            #move

BB1_6:
.pragma "nounroll";
ld.global.u32 %r9, [%rd28];             #PC=0x210
mul.wide.s32 %rd21, %r9, 4;
add.s64 %rd22, %rd1, %rd21;
ld.global.u32 %r25, [%rd22];
setp.ne.s32     %p5, %r25, -1;
@%p5 bra BB1_8;

add.s64 %rd24, %rd4, %rd21;
ld.global.u32 %r26, [%rd24];
atom.shared.min.s32 %r27, [%r6], %r26;

BB1_8:
add.s32 %r29, %r29, 1;                    #delete
add.s64 %rd28, %rd28, 4;                  #delete
setp.lt.s32     %p6, %r29, %r28;          #delete
@%p6 bra BB1_6;

ld.shared.u32 %r30, [%r6];

BB1_10:
cvta.to.global.u64 %rd25, %rd11;
add.s64 %rd27, %rd25, %rd18;
st.global.u32 [%rd27], %r30;

BB1_11:
ret;
```
#### modified
```
cvta.to.global.u64 %rd4, %rd10;
cvta.to.global.u64 %rd19, %rd9;         #PC=0x1f8

add.s32 %r1, %r28, %r29;                #PC=0x200 add 
BB1_6:
.pragma "nounroll";
mul.wide.s32 %rd20, %r29, 4;            #PC=0x208 move
add.s64 %rd28, %rd19, %rd20;            #PC=0x210 move
ld.global.u32 %r9, [%rd28];             #PC=0x218
mul.wide.s32 %rd21, %r9, 4;             #PC=0x220
add.s64 %rd22, %rd1, %rd21;             #PC=0x228
ld.global.u32 %r25, [%rd22];            #PC=0x230
setp.ne.s32     %p5, %r25, -1;          #PC=0x238
@%p5 bra BB1_8;                         #PC=0x240

add.s64 %rd24, %rd4, %rd21;             #PC=0x248
ld.global.u32 %r26, [%rd24];            #PC=0x250
atom.shared.min.s32 %r27, [%r6], %r26;  #PC=0x258

BB1_8:
@%p6 bra BB1_6;                         #PC=0x260

ld.shared.u32 %r30, [%r6];              #PC=0x268

BB1_10:
cvta.to.global.u64 %rd25, %rd11;
add.s64 %rd27, %rd25, %rd18;
st.global.u32 [%rd27], %r30;

BB1_11:
ret;
```
### mis_wl
#### origin
```
cvta.to.global.u64 %rd2, %rd10;
cvta.to.global.u64 %rd3, %rd11;
cvta.to.global.u64 %rd20, %rd9;         #PC=0x1f8
mul.wide.s32 %rd21, %r30, 4;            #PC=0x200 move
add.s64 %rd29, %rd20, %rd21;            #PC=0x208 move

add.s32 %r1, %r29, %r30;                #add

BB1_5:
.pragma "nounroll";
ld.global.u32 %r10, [%rd29];            #PC=0x210
mul.wide.s32 %rd22, %r10, 4;
add.s64 %rd23, %rd3, %rd22;
ld.global.u32 %r26, [%rd23];
setp.ne.s32     %p4, %r26, -1;
@%p4 bra BB1_7;

add.s64 %rd25, %rd2, %rd22;
ld.global.u32 %r27, [%rd25];
atom.shared.min.s32 %r28, [%r7], %r27;

BB1_7:
add.s32 %r30, %r30, 1;                  #remove
add.s64 %rd29, %rd29, 4;                #remove
setp.lt.s32     %p5, %r30, %r29;        #remove
@%p5 bra BB1_5;
```
#### modified
```
cvta.to.global.u64 %rd2, %rd10;         #PC=0x1e8
cvta.to.global.u64 %rd3, %rd11;         #PC=0x1f0
cvta.to.global.u64 %rd20, %rd9;         #PC=0x1f8

add.s32 %r1, %r29, %r30;                #PC=0x200
BB1_5:
.pragma "nounroll";
mul.wide.s32 %rd21, %r30, 4;            #PC=0x208
add.s64 %rd29, %rd20, %rd21;
ld.global.u32 %r10, [%rd29];
mul.wide.s32 %rd22, %r10, 4;
add.s64 %rd23, %rd3, %rd22;
ld.global.u32 %r26, [%rd23];
setp.ne.s32     %p4, %r26, -1;
@%p4 bra BB1_7;

add.s64 %rd25, %rd2, %rd22;
ld.global.u32 %r27, [%rd25];
atom.shared.min.s32 %r28, [%r7], %r27;

BB1_7:
@%p5 bra BB1_5;                         #PC=0x260
```

### sssp
#### origin
```
cvta.to.global.u64 %rd19, %rd11;
cvta.to.global.u64 %rd20, %rd10;
mul.wide.s32 %rd21, %r24, 4;            #move
add.s64 %rd28, %rd20, %rd21;            #move
add.s64 %rd27, %rd19, %rd21;            #move

BB0_3:
.pragma "nounroll";
ld.global.u32 %r19, [%rd28];            #PC=0x100
mul.wide.s32 %rd22, %r19, 4;
add.s64 %rd23, %rd17, %rd22;
ld.global.u32 %r20, [%rd23];
ld.global.u32 %r21, [%rd27];
add.s32 %r22, %r20, %r21;
atom.shared.min.s32 %r23, [%r6], %r22;
add.s64 %rd28, %rd28, 4;                #delete
add.s64 %rd27, %rd27, 4;                #delete
add.s32 %r24, %r24, 1;                  #delete
setp.lt.s32     %p3, %r24, %r4;         #delete
@%p3 bra BB0_3;

ld.shared.u32 %r25, [%r6];

BB0_5:
cvta.to.global.u64 %rd24, %rd13;
shl.b64 %rd25, %rd1, 2;
add.s64 %rd26, %rd24, %rd25;
st.global.u32 [%rd26], %r25;

BB0_6:
ret;
```
#### modified
```
cvta.to.global.u64 %rd19, %rd11;    
cvta.to.global.u64 %rd20, %rd10;        #PC=0x0e0

add.s32 %r1, %r4, %r24;                 #PC=0x0e8 add 
BB0_3:
.pragma "nounroll";
mul.wide.s32 %rd21, %r24, 4;            #PC=0x0f0 move
add.s64 %rd28, %rd20, %rd21;            #PC=0x0f8 move
add.s64 %rd27, %rd19, %rd21;            #PC=0x100 move

ld.global.u32 %r19, [%rd28];            #PC=0x108
mul.wide.s32 %rd22, %r19, 4;            #PC=0x110
add.s64 %rd23, %rd17, %rd22;            #PC=0x118
ld.global.u32 %r20, [%rd23];
ld.global.u32 %r21, [%rd27];
add.s32 %r22, %r20, %r21;
atom.shared.min.s32 %r23, [%r6], %r22;

@%p3 bra BB0_3;                         #PC=0x140

ld.shared.u32 %r25, [%r6];

BB0_5:
cvta.to.global.u64 %rd24, %rd13;
shl.b64 %rd25, %rd1, 2;
add.s64 %rd26, %rd24, %rd25;
st.global.u32 [%rd26], %r25;

BB0_6:
ret;
```
---

## outcome
[spreadsheet](https://docs.google.com/spreadsheets/d/197LhQm2fDLeWsN5sECN_cid8bA1nde6RrRPaRqrWZp8/edit#gid=0)
### normal
* pagerank
    * speedup : 2.02x
    * iterations : 26.6%
    * memory access number: 62.9%
* BFS
    * speedup : 1.08x
    * iterations : 84.9%
    * memory access number: 41.4%
* Color Max(pagerank graph)
    * speedup : 2.32x
    * iterations : 26.6%
    * memory access number: 15.2%
* MIS(pagerank graph)
    * speedup : 1.91x
    * iterations : 26.6%
    * memory access number: 21.6%
* SSSP
    * speedup : 1.97x
    * iterations : 26.6%
    * memory access number: 13.1%
### bypass l1 cache
* Color Max
    * speedup : 1.49x
    * iterations : 99.9%
    * memory access number: 66.2%

### working list
* BFS
    * gpgpu-sim's bug in global atomic increment
    * might because atom's outcome will not shown immediately
    * wl code has bug
    
    * speedup : x
    * iterations : %
    * memory access number: %
---

## Run Benchmarks 
[experiment spread sheet](https://docs.google.com/spreadsheets/d/12SWUiwHGojaicbE4O3QcscHBK07e-__A8j6FICQJRYQ/edit#gid=1338197220)
[教授spread sheet](https://docs.google.com/spreadsheets/d/14ibl-NsZcrZzOVNC3gcKF7Ow2tnZt8DKzSlhPU888ak/edit#gid=283140300)
![](https://i.imgur.com/sPtCltQ.png)

### Loop Unroll
```cpp=
#pragma unroll 1
for(int i=start; i<end; i++){
    ...
}
```
* `#pragma unroll N`會告訴compiler要unroll幾層loop
* N=1只會unroll一層,即**turn off** loop unroll

### atomic operation
* instruciton.cc
    * atom_impl
        * thread->m_last_dram_callback.function_wt
* cuda-sim.cc
    * ptx_exec_inst
        * inst.add_callback(lane_id, last_callback().function_wt, last_callback().instruction, this, threads, true /*atomic*/);
* abstract_hardware_model.cc
    * do_atomic
        * cb.function_wt(cb.instruction, cb.thread, cb.threads);
* abstract_hardware_model.h
    * add_callback
    * dram_callback_t

### 1. gpgpusim.config 
```
#lemonsien
-gpgpu_perfect_mem 0
-gpgpu_concurrent_kernel_sm 1
-scheduler_kind 2
-partition_method 1

...
...

# high level architecture configuration
-gpgpu_n_clusters 30 

...
...

# Device Limits
-gpgpu_kernel_launch_latency 0

```
-cache allocate policy : 改成on miss, hit reserved才會正確
* 否則都是當成miss

-scheduler_kind
* 0:origin
* 1:paver
* 2:paver2

-partition_method
* 0:BCS
* 1:k-way
* 2:recursive(not working)
### 2. Run
```
#pannotia/fw
./fw ../../../../data_dirs/pannotia/fw/data/256_16384.gr &> output_memstall  &
```
```
#pannotia/bc
./bc ../../../../data_dirs/pannotia/bc/data/1k_128k.gr &> output_paver_30groups_MST_ideal &
```
```
#pannotia/color_max
./color_max ../../../../data_dirs/pannotia/color_max/data/ecology1.graph 1 &> output_paver_30groups_MST_ideal &
```
```
#pannotai/sssp
./sssp ../../../../data_dirs/pannotia/sssp/data/USA-road-d.NW.gr 1 &> output_paver_30groups_MST_ideal &
```
```
#pannotia/pagerank
./pagerank ../../../../data_dirs/pannotia/pagerank/data/coAuthorsDBLP.graph 1 &> output_paver2 &
```
```
#pannotia/sssp
./sssp ../../../../data_dirs/pannotia/sssp/data/USA-road-d.NW.gr 0 &> output_memstall2 &
```
```
#lonestar/mst
./run &> output_paver_30groups_MST_ideal &
```
```
#lonestar/sssp
./sssp ../../../../../data_dirs/cuda/lonestargpu-2.0/inputs/USA-road-d.FLA.gr &> output_memstall2 &
```
```
#mis
./mis data/ecology1.graph 1 &> output_memstall &
```

## Git
http://blog.gogojimmy.net/2012/02/29/git-scenario/


```
git commit -am "comments"
git push -u origin branch_name
git pull origin branch_name                         
```


#### Log in 
```
git config --global user.name "<使用者名字>"
git config --global user.email "<電子信箱>"
```
check configure:
```
git config --list
or 
cat ~/.gitconfig
```
#### 看狀態
```
git status
```
#### 看歷史commit
```
git log
```
#### 加入新(untracked)檔案
```
git add file_name
```
一次加全部檔案：
```
git add .
```
#### 存檔, 會紀錄改過的檔案的訊息
```
git commit -m "commit訊息"
```
先git add .後略過編輯器提交 commit
```
git commit -am "commit訊息" 
```

#### Branch

新增branch
```
git branch new_branch
```
查看有哪些branch
```
git branch
```
切換branch
```
git checkout target_branch
```
#### merge branch_name到目前的branch
```
git merge branch_name
```

## Perfect Memory
* access memory
    * 不access memory，"**去memory**" 馬上接 "**回cluster**"

![](https://i.imgur.com/5LjCpZQ.png)

![](https://i.imgur.com/ObHo7W6.png)


### memory stall
* kinds
    * memory_cycle
        * memory access not done
    * shared_cycle
    * constant_cycle
    * texture_cycle

* memory priority
    * ICNT_RC_FAIL = BK_CONF > COAL_STALL

NO_RC_FAIL = 0,
BK_CONF,
MSHR_RC_FAIL,
ICNT_RC_FAIL,
COAL_STALL,
TLB_STALL,
DATA_PORT_STALL,
WB_ICNT_RC_FAIL,
WB_CACHE_RSRV_FAIL,
N_MEM_STAGE_STALL_TYPE


## 相關論文
### irregular benchmark
* Graph-Waving architecture: Efficient execution of graph applications on GPUs
    * [link](https://www.sciencedirect.com/science/article/pii/S0743731520303889)
    * node to warp mapping, eliminate duplicate execution, shrink the warp size.
* Fine-Grained Management of Thread Blocks for Irregular Applications
     * [link](https://ieeexplore.ieee.org/abstract/document/8988776?casa_token=mPwWuVZ8uekAAAAA:g3-72tUqQEa0RrzLppSWgs1erUiF5kwMhDruh5wyMgWfSJos8Q9pVCEiALDTIbiXQi0t7yPwGj4) 2019
     * irregular application(graph)的狀態會隨著時間而改變(因為少數node關聯度很高)
     * 用前1個kernel的TB去預測下一個kernel的TB的狀態
     * 及時追蹤每一個TB的現在的狀態
     * 狀態包括Load-Store unit stall，Floating-Point Unit的使用度
     * 根據這些狀態去調整SM中TB的數目，允許preemption
     * gain 17% and 13% improvements of throughput over previously proposed static and dynamic scheduling strategies, respectively
* Dynamic Thread Block Launch: A Lightweight Execution Mechanism toSupport Irregular Applications on GPUs
     * [link](https://dl.acm.org/doi/epdf/10.1145/2872887.2750393) 2015
     * 為了加速irregular application，這篇論文讓thread可以產生child thread block
     * 譬如：有些graph的每個thread會用for loop去traverse某個node的neighbor，每個node的neighbor不同時，會很分散。若是create一個child kernel或child thread block去concurrent執行所有的neighbor會很有用
     * 1.21x speedup over the original ﬂat implementation and average 1.40x over the implementations using device-kernel launch functionality
* A quantitative study of irregular programs on GPUs
    * [link](https://ieeexplore.ieee.org/abstract/document/6402918)
* Pannotia: Understanding irregular GPGPU graph applications
    * [link](https://ieeexplore.ieee.org/abstract/document/6704684)
* Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription
    * [link](https://ieeexplore.ieee.org/document/9139797)
    * working sets of irregular applications can often be divided into cold and hot data structures.
        * cold allocations are accessed seldom and sparsely
        * hot data structures have dense sequential access
    * propose a dynamic page placement strategy for irregular general purpose applications differentiate between hot and cold allocations
    * performance improvement of 22% to 78%
* Optimizing non-coalesced memory access for irregular applications with GPU computing
    * [link](https://link.springer.com/content/pdf/10.1631/FITEE.1900262.pdf) 2020
    * software solution with no hardware extensions to solve dynamic irregularities
    * data reordering
        * eliminate indirect memory access by copying to a new location and reordering
    * Index redirection
    * reduce overhead
        * for loop拿到kernel外面，run multiple streams of kernels(1 for count, others for remap)
    * For loop-carried applications, reordered data are cached for the next iteration or other kernels to eliminate possible redundant data reorganization
    * 7.0%–7.17% overall performance improvement
* Managing DRAM Latency Divergence in Irregular GPGPU Applications
    * [link](https://www.cs.utah.edu/~rajeev/pubs/sc14.pdf) 2014
    * different requests from a warp are serviced together, or in quick succession, without interference from other warps
    * warp-aware memory controllers
    * boost performance by 10.1%
* Graph Processing on GPUs: A Survey
    * [link](https://www.dcs.warwick.ac.uk/~liganghe/papers/ACM-Computing-Surveys-2017.pdf) 2018
    * graph processing
        * data layout
        * memory access pattern
        * workload mapping
        * GPU programming
    * new research challenges and opportunities
        * Graph processing on hybrid systems 
        * Graph processing on new GPU architecture
        * Dynamic graph processing on GPUs
            * the graph structure can be updated frequently in runtime
            * dynamically maintain balanced workload mapping with the rapid changes of graphs
        * Machine-learning applications
* Anatomy of GPU Memory System for Multi-Application Execution
    * [link](https://www.cs.utexas.edu/users/skeckler/pubs/MEMSYS_2015_Anatomy.pdf) 2015
    * study the memory system of GPUs in a concurrently executing **multi-application** environment
    * **L2-MPKI** and **attained bandwidth** are the two knobs that can be used to drive memory scheduling decisions for achieving better performance
* Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads
    * [link](https://arxiv.org/pdf/2007.07131.pdf)
    * 1.33x and 13% improvement in performance and energy savings respectively
* On-the-Fly Elimination of Dynamic Irregularities for GPU Computing
    * [link](https://dl.acm.org/doi/pdf/10.1145/1950365.1950408)
* On-GPU Thread-Data Remapping for Branch Divergence Reduction
    * [link](https://dl.acm.org/doi/10.1145/3242089)
* Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
    * [link](https://ieeexplore.ieee.org/abstract/document/7349653)
    * use compiler to transform the original code to the template
* Spare register aware prefetching for graph algorithms on GPUs
    * [link](https://ieeexplore.ieee.org/abstract/document/6835970)
    * use spare register to store the prefetch edge

### memory access divergence
* DaCache: Memory Divergence-Aware GPU Cache Management

### pre-launch (inter-kernel)
* BlockMaestro: Enabling Programmer-Transparent Task-based Execution in GPU Systems
    * [link](https://ieeexplore.ieee.org/document/9499869)
    * 用bipartite graph找出有dependency(locality)的TBs，並讓後面的TB pre-launch以減少launch time
    * 用編譯器就能解決，所以不用改code
    * **input-dependent** is not working
### inter-kernel locality
* Inter-kernel reuse-aware thread block scheduling
    * group TBs
    * 下一個kernel的group與上一個kernel對齊
    * reverse
    * work stealing，一次一個
### intra-kernel locality
* PAVER: Locality Graph-Based Thread Block Scheduling for GPUs
    * [link](https://dl.acm.org/doi/fullHtml/10.1145/3451164) 2021
    * 用graph theory將TBs的locality都找出來
    * 用graph partition將graph切成groups，每個group分給一個SM
    * work stealing，偷不只一個以維持locality
    * **input-dependent** is not working
    * [note](https://hackmd.io/q-vSikPLSdmNqHPz1GwQwA?view)
    * performance speedup by 29%, 49.1%, and 41.2%
* Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling
    * [link](https://ieeexplore.ieee.org/abstract/document/7898501)
    * run time時才計算出與目前SM中的TBs最適合的TB作為候選人
    * **input-dependent** is not working
* Improving GPGPU resource utilization through alternative thread block scheduling
    * [link](https://ieeexplore.ieee.org/abstract/document/6835937?casa_token=jQrF9KO5vLMAAAAA:1_H9ANBV8mdo4iuC_Nz5bmcPPx4aKsjx4l15Y_1WnhdE8RBbm2hvXO6kdiSmRKZfK2NCJIoG_AU)
    * workload type
        * TB increase, performance increase
        * TB increase, performance increase then saturate
        * TB increase, performance decrease
        * TB increase, performance increase then decrease
    * core activity
        * ACTIVE
        * INACTIVE
            * Idle
            * Mem_stall
            * Core_stall
            * 
    * SM裡的TBs數目會影響performance，最大未必會最好
    * <img src="https://i.imgur.com/udazh1W.png" width="40%" height="40%" />
    * consecutive CTAs are assigned to the same cores

### analyze data locality
* LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs
        
### warp scheduler
* A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization        
        
### multiple streams scheduler
* Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications
* Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels
* Increasing GPU throughput using kernel interleaved thread block scheduling

### graph
* Practical Optimal Cache Replacement for Graphs
    * 用transpose(CSC<->CSR)快速找到cache中下一次被存取時間最久的cache line
    * 適用traverse graph的應用(in/out)
## Time: 1/21/2022 ~ 1/26/2022
* goal
    * sssp_spmv, parboil: spmv,
    * node to warp mapping
    * data driven
## Time: 1/12/2022 ~ 1/20/2022
* progress
    * pagerank 5 graph testing
    * pagerank spmv
    * color max min

## Time: 10/21/2021 ~ 10/28/2021
* child thread block
    * change program? 
* warp formation
    * can do?
* TB formation


## Time: 10/18/2021 ~ 10/25/2021
* 了解CSB
* pagerank 問題
* 壓縮sparse matrix
* cuSha


## Time: 10/14/2021 ~ 10/21/2021
* ~~color攤開第三層~~
* sssp跑NY
* color跑大圖G3
* 改其他benchmark
* ~~pagerank結果寫到spread sheet上~~ 
* pagerank問題
    * atomicAdd ?
    * ITER拿掉
    * 每個kernel看perfect memory / cache.nonflush
* momory bound問題？
* outcome
    * color-max
        * 有快一點
        * memory access減少，但miss rate上升
        * 攤開node value會更慢，memory access會再會再減少，但miss ra會再te
        * graph很balance
    * page-rank
        * 慢很多
        * memory access減少，但miss rate上升
        * graph很不balance

## Issue
* race condition (sssp : min)
    * possible solution : atomic or synchronization
* 兩層 for loop 不能攤開

## Time: 10/7/2021 ~ 10/14/2021
* 每個kernel印出remap後的cycle
* remap的跑在GPU上看看 (debug比較快）
* 解決branch divergence(sort 每個vertex的edge數）
* gpgpu-sim_simulations/benchmarks/src/cuda/rodinia/3.1/cuda/bfs/


## Time: 9/24/2021 ~ 9/30/2021
看memory coaleasing有沒有出問題
memory bound是bound在哪邊
參考data reorder(no-coaleased)


## Time: 9/6/2021 - 9/12/2021
林孟賢：加上ideal cache的miss rate, 個別SM的cycle數(為了判斷 load balance), 每個kernel的miss rate
席致寧：group裡加上mst, 跑benchmark


## Time: 8/23/2021 - 9/5/2021    
林孟賢：實做paver
席致寧：跑bechmark


## Time: 8/6/2021 - 8/12/2021    
* embeddingbag
    修改好embeddingbag 
    cache access patern of embedding with 2 keys
    ```    
    -3221225472 57 => input(indices)
    -3221241536 1
    -3221241472 7909 => weightFeat(weight)
    -3221241408 1
    -3221241344 7909 => weightFeat(weight)
    -3221241152 1
    -3221241088 7445 => weightFeat(weight)
    -3221241280 1
    -3221241216 7445 => weightFeat(weight)
    -3221225536 1
    -3221225600 57 => input(indices)
    -3221225664 1
    -3221225728 57
    -3221225792 1
    -3221225856 57
    -3221225920 1
    -3221225984 57
    ```
    * 結論
        * threads random access(跟input有關) weightFeat, 且locality會隨著dictionary增大而消失
* MIS
    * 印addr access number
    * shared/not shared histogram
    * 觀察
        * [spread sheet](https://docs.google.com/spreadsheets/d/12SWUiwHGojaicbE4O3QcscHBK07e-__A8j6FICQJRYQ/edit#gid=1340093626 )
        * 在shared和not shared的locality都很高的情況下，我們的cache分段方法沒有優勢。
        * 如果跑在多個SM上能保留shared而讓，not shared的locality消失，那我們的方法或許會奏效。但我們的shared是根據inter kernel的locality去找的，在intra kernel是否有效還很難說
        * 如果我們的方法(PCG)可以自由的hit, pending hit, sector miss, 那在4/6分之下可以達到和LRU差不多的結果
    * Terminology
        * ideal: 只考慮某個cache line被**存取**幾次
        * normal: 只考慮某個cache line被hit（包含第一次miss）幾次，不考慮pending hit和sector miss等
        * shared: 可能表示**我們標的pc**或**高access次數的pc**所access到的cache line
        * not shared: 與shared相對
        * both: shared與not shared都有access到的cache line
    * Outcome
        * ideal與LRU的比較
            * kernel 2
                * ideal, miss rate 0.243
                <img src="https://i.imgur.com/l9ygFRx.png" width="50%" height="50%" />
                * LRU, miss rate 0.287
                <img src="https://i.imgur.com/lkn0XIW.png" width="50%" height="50%" />
        * 結論
            * LRU已經很接近ideal(total miss rate 0.54/0.48)
            * 相比於shared, not shared的cache line反倒是沒有很好的keep住
        * bypass l1 cache
            * 比較慢
        * high frequency ps has high priority to stay in cache(with LRU)
            * 比較慢
        

    * ideal算法
        * block addr要改成addr
        * sector miss要考慮進去
    * miss rate算法
        * tag_array算出來的是read的miss rate(剛好是我們要的)
        * 個別MISS或SCTOR_MISS與gpgpu-sim的有落差，但總和一樣(所以在我們的case下沒差)

## Time: 7/30/2021 - 8/5/2021
* cache切成兩部份
    * shared(80%)
    * not shared(20%)
* replacement policy
    * shared
        * LIFO
        * LRU
    * not shared
        * LRU
* compared with ideal
* cta scheduler reverse
---
* cache line加上counter, 紀錄hit次數
    * set l1 to tag_array
    * print cache access
        * s1
        * 1 cta
        * 1 warp
    * sector miss要考慮
* future work
    * print cta access status

## Time: 7/22/2021 - 7/29/2021
* 將call/assign by reference/value實做出來
* 在vec-add上實做，結果符合預期
* ![](https://i.imgur.com/6h0zdO4.png)
* ![](https://i.imgur.com/uN6IPIJ.png)

## Time: 7/19/2021 - 7/26/2021
* 7/16
    * c的基本語法ok
    * bear沒有試成功，所以目前無法編譯需要make的cuda code
* 7/19
    * 開始做data member的部分
* 7/20
    * data member的部分大致可以
    * 發現member function有問題
* 7/21
    * 修改data member的data structure以符合member function的操作
    * 開始做member function
* 7/22
    * member function基本的case可以
    * 發現member function在找data member時，若遇上inheritance會有問題
    * 找到inherited的class
* 待辦
    * **declare** to **definition** (getDefinition)
    * 從全部視為pass by reference改成pass by reference/value
    * alias(變數的傳遞)
    * 考慮control flow(if, else等)

## Time: 7/12/2021 - 7/19/2021

### 1. Successes Last Week
<!--
* What you managed/plan to accomplish this week - hopefully in line with goals - if not that's okay too.
-->

* 席致寧
    * 7/9 ~ 7/16
        * [[筆記] build and run gpgpu-sim](https://hackmd.io/WsOKV8MmTfS1-Y5gyrS2UQ?view)
    * 7/16 ~ 7/23
        * c_Kernel從constant memory改成global memory
        * for loop 改成用 multiple stream
        * [[筆記] streams & memory](https://hackmd.io/WTA8Vz_8Slyveitd7JDYMw?view)

    * 7/23 ~ 7/30
        * [BuildMI()](https://llvm.org/docs/CodeGenerator.html#the-machineinstr-class) : extend machine instruction
        * [[筆記]memory](https://hackmd.io/MFf9C4AHRU2z6oy6X9DRXg)
        * [[筆記]LLVM](https://hackmd.io/HKv_OZvXSmWvkLXHTu6BaA)
    * 7/30 ~ 8/6
        * [NVPTXInstrInfo.td](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td)
    * 8/6 ~ 8/13 
        * compiling LLVM
    * 8/13 ~ 8/20
        * [P-OPT: Practical Optimal Cache Replacement for Graph Analytics Architectures](https://hackmd.io/iCDyMqTNQ8a1DzwtSSV4Jg)
        * [Internal of LLVM](https://hackmd.io/bwlGi-ntTxCF1pTvB0N-YA)
### 2. Progress and Problems Last Week
<!-- 
* What have you done last week ?
* What problems encountered last week ?
* What caused you to be slow on a particular task? What roadblocks did you encounter?
-->
* 跟**李睦樂**討論之後覺得理論上做的出來，但實做細節與clang API是否支援等現實層面的問題還在摸索中
* 限制條件為若有些kernel的parameter是runtime時才決定的
    * 譬如：C = A+B的C與A在runtime時才得知是指到同一個obj，那我們在compile時就不會知道A的值是非read-only的
* 目前的想法為每一個變數與obj設置counter，當寫入時counter+1，最後看哪些變數符合條件
    <font color= #DCDCDC>
    * 為每一個**變數**配置一個資料結構用來紀錄**write次數**, **類型(normal, array, pointer, reference)**, **指標(指到其他人的資料結構)**
    * 最後根據每一個變數**自己的資料結構**和**指到的資料結構**去判斷是否為constant
        * 超過2就非constant
    * 需要trace **call by value**, **call by reference**等實做方法(在找出哪些是我們要的變數上遇到了問題，因為 **=左右** 都可能存在不只一個變數)
        * 判斷 **= 右邊**的變數的level情況
        * *: level-1
        * &: level+1
        * 二元運算子: 留下小的
        * 用post order traversal的方法搭配stack找出level小的與 **= 左邊**的變數關聯
        * 搭配children()用法
        * http://web.mit.edu/freebsd/head/contrib/llvm/tools/clang/lib/AST/StmtPrinter.cpp

    </font>
* write 指令包括 assign, compound assign, inc/dec
* 被 write 的 variable 為 **=左邊** 的第一個variable(可能會有問題)
<!--* 用bear -vvvv make這調指令以獲得compile_commands.json-->
* 在sp和mst, **bear** 和 **compiledb** 無法取得compile_commands.json
### 3. Not Resolved Problems
<!--* The checking list of problems you have not resolved yet.-->
* 在找出哪些是我們要的變數上遇到了問題，因為 **=左右** 都可能存在不只一個變數

### 4. Goals for next week

* 席致寧
    
    * shaer.cc (/src/gpgpu-sim)
        * issue_warp
        * 在TB裡面建立一個array,蒐集ld.global的address,找不同TB裡的交集
    * warp_inst_t (/src/abstract_hardware_model.cc)
        * class warp_inst_t()
        * 加array來存address
    * memory_space_t (/src/abstract_hardware_model.cc)
        * is_global() 

* 林孟賢
    * assign一律考慮為pass by reference，並run在不同cuda benchmark裡，確保cuda code可以run成功，並找出可能的問題
    * vecadd: ok
        * ~/test/llvm-project-orig/llvm-project/build/bin/get-constant-variables test_case/vector-add.cu -- clang++ --cuda-gpu-arch=sm_75 -L /usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread
        * ![](https://i.imgur.com/0vceiiJ.png)
    * pathfinder: 
        * ~/test/llvm-project-orig/llvm-project/build/bin/get-constant-variables pathfinder.cu -- clang++ --cuda-gpu-arch=sm_75 -L /usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread
        * ![](https://i.imgur.com/h0Owq9L.png)