# Persistent Caching
###### tags: `GPU`
###### members: @林孟賢 @席致寧
[paper](https://www.overleaf.com/3161447213kchpscspbmxk)
[proposal](https://docs.google.com/document/d/1LPxwv7mzae4n1qHnJnDLL_UtjeJf0ngaChi5H_2DfHM/edit?usp=sharing)
[report](https://drive.google.com/file/d/152gvtqASZaRv1jkt5Guig8FHHtJiFg0R/view?usp=sharing)
[spreadsheet](https://docs.google.com/spreadsheets/d/197LhQm2fDLeWsN5sECN_cid8bA1nde6RrRPaRqrWZp8/edit#gid=0)
[slide](https://docs.google.com/presentation/d/1Thz_tu13iYqtAW99xmj1XgWEpikkIbobYvLYku2fkGw/edit?usp=sharing)
[oral test](https://docs.google.com/presentation/d/12lGzoFUw9xnw112_kMmgaDO5RW0ygAmvc0cVbPRxzM4/edit?usp=sharing)
[reference ppaer](https://hackmd.io/867BaaD-QIe1ytWs7FXYpQ)
[DAC paper bullet point](https://hackmd.io/ES09vnUQT-KgcvPX5h2fVw?both)
## Idea


<!-- -->


* spare register([reference paper](https://ieeexplore.ieee.org/abstract/document/6835970))
* Due to resource constraints there may be unallocated registers in a GPU.
* If a thread warp has terminated, then registers allocated to the warp are free.
* register file access
* 
* new version of gpgpu-sim contain **Operand collector** which contain a cross-bar, and can access all register file
* 
* compiler assist
* programming interface like openMP
* need user to specify the independent for loop
* need compiler to calculate the usage of register numbers in each iteration.
* **到達定植分析**可以找出**循環不便變量**
* runtime
* in runtime we need a table to record the index of the for loop in each thread
### Compare to Software Solution
additional space required by software solution :
* `length[]`
* 紀錄一個warp裡面每條thread所需要的iteration
* additional space : O(V)
* `start[]`
* 紀錄一個warp裡面每條thread的起始位置
* additional space : O(V)
* `origin_thread[]`
* 紀錄每條邊來自於那一條thread
* additional space : O(E)
additional space required by hardware solution :
* `index_table`
* 紀錄一個warp裡面每條thread的start 跟 end
* additional space : O(V)
1. hardware solution是rum time生成table, 當某條warp完成運算後即可release掉空間,所以實際上花的空間不會那麼多
2. software solution 會有 preprocessing kernel overhead
3. software solution要先將data(start, length, ...)存在global memory,hardware solution則是extend hardware component來存run time生成的table
4. software solution的memory access次數會變多
## Related paper list
* hardware
* Graph-Waving architecture: Efficient execution of graph applications on GPUs
* map each node to a warp instead of thread
* need to change the user’s code
* need to eliminate duplicate scaler instruction
* Dynamic Thread Block Launch: A Lightweight Execution Mechanism toSupport Irregular Applications on GPUs
* use dynamic launch “thread block” to replace dynamic launch “kernel”, thus alleviate the overhead of the “kernel” and increase parallelism
* not fully utilized the SIMD lane when degree of the node is small
* GraphPEG: Accelerating Graph Processing on GPUs
* use “hardware function” to collect needed edges to the “shared memory”, then process on the collected edge
* solve all the aforemention problem, but add many additional hardwares
* Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists
* Active Thread Compaction for GPU Path Tracing
* software
* gunrock
* Accelerating CUDA Graph Algorithms at Maximum Warp
* [paper](https://stanford-ppl.github.io/website/papers/ppopp070a-hong.pdf)
* Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths
* [paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6877269)
* Scalable GPU Graph Traversal
* [paper](https://mgarland.org/files/papers/gpubfs.pdf)
* benchmark
* Pannotia: Understanding Irregular GPGPU Graph Applications
* [paper](https://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf)
## problem of equivalence software solution
* atomic
*
## solve active/inactive

## graph
[pannotia](https://github.com/pannotia/pannotia/wiki)
[datasets](https://www.cc.gatech.edu/dimacs10/archive/coauthor.shtml)
## PTX
### SPMV
* original code:
```
cvta.to.global.u64 %rd2, %rd13;
cvta.to.global.u64 %rd17, %rd12;
cvta.to.global.u64 %rd18, %rd10;
mul.wide.s32 %rd19, %r15, 4; #move
add.s64 %rd26, %rd17, %rd19; #move
add.s64 %rd25, %rd18, %rd19; #move
BB0_3:
.pragma "nounroll";
ld.global.u32 %r14, [%rd26]; # PC=0x100
mul.wide.s32 %rd20, %r14, 4;
add.s64 %rd21, %rd2, %rd20;
ld.global.f32 %f4, [%rd21];
ld.global.f32 %f5, [%rd25];
mul.f32 %f6, %f5, %f4;
atom.shared.add.f32 %f7, [%r3], %f6;
add.s64 %rd26, %rd26, 4;
add.s64 %rd25, %rd25, 4;
add.s32 %r15, %r15, 1;
setp.lt.s32 %p3, %r15, %r5;
@%p3 bra BB0_3; # PC=0x158
```
* modified code
```
cvta.to.global.u64 %rd2, %rd13;
cvta.to.global.u64 %rd17, %rd12;
cvta.to.global.u64 %rd18, %rd10; # PC=0x0e0
add.s32 %r1, %r5, %r15; # PC=0x0e8
BB0_3:
.pragma "nounroll";
mul.wide.s32 %rd19, %r15, 4; # PC=0x0f0
add.s64 %rd26, %rd17, %rd19;
add.s64 %rd25, %rd18, %rd19;
ld.global.u32 %r14, [%rd26];
mul.wide.s32 %rd20, %r14, 4;
add.s64 %rd21, %rd2, %rd20;
ld.global.f32 %f4, [%rd21];
ld.global.f32 %f5, [%rd25];
mul.f32 %f6, %f5, %f4;
atom.shared.add.f32 %f7, [%r3], %f6;
@%p3 bra BB0_3; # PC=0x140
```
### Pagerank
* original code :
```c=
#pragma unroll 1
for (int edge = start; edge < end; edge++) {
int nid = col[edge];
atomicAdd(&page_rank2[nid], page_rank1[tid] / (float)(end - start));
}
```
```=
mul.wide.s32 %rd16, %r16, 4;
add.s64 %rd19, %rd14, %rd16; //0x0e8 create table
BB0_5:
.pragma "nounroll";
ld.global.u32 %r14, [%rd19]; //0x0f0 col[edge]
mul.wide.s32 %rd17, %r14, 4;
add.s64 %rd18, %rd2, %rd17;
ld.global.f32 %f2, [%rd3];
div.rn.f32 %f3, %f2, %f1;
atom.global.add.f32 %f4, [%rd18], %f3;
add.s64 %rd19, %rd19, 4; //0x120 edge++
add.s32 %r16, %r16, 1; //0x128 loop counter
setp.lt.s32 %p4, %r16, %r15; //0x130 loop counter < iteration
@%p4 bra BB0_5; //0x138 branch
```
* instruction找到0x0e8就換方式執行(還沒建表格) :
```=
sub.s32 %r13, %r15, %r16; //PC = 0x0d0
cvt.rn.f32.s32%f1, %r13; //PC = 0x0d8
BB0_5:
.pragma "nounroll";
mul.wide.s32 %rd16, %r16, 4; //PC = 0x0e0 r16 : index,從table拿
add.s64 %rd19, %rd14, %rd16; //PC = 0x0e8 rd14 : base address
ld.global.u32 %r14, [%rd19];
mul.wide.s32 %rd17, %r14, 4;
add.s64 %rd18, %rd2, %rd17;
ld.global.f32 %f2, [%rd3];
div.rn.f32 %f3, %f2, %f1;
atom.global.add.f32 %f4, [%rd18], %f3; //PC = 0x118
add.s32 %r16, %r16, 1;
setp.lt.s32 %p4, %r16, %r15;
@%p4 bra BB0_5; //PC = 0x130
```
* %r16 : start, %r15 : end, 建表格
* 根據表格抓rd19的值
* gpu-sim要得到的資訊 :
* start index
* end index
* last codeblock PC (第11行)
* exit PC (第15行)
```=
//PC = 0x0c8
sub.s32 %r13, %r15, %r16; //PC = 0x0d0
cvt.rn.f32.s32%f1, %r13; //PC = 0x0d8 create table
add.s32 %r1, %r15, %r16; //PC = 0x0e0 set index
BB0_5:
.pragma "nounroll";
mul.wide.s32 %rd16, %r16, 4; //PC = 0x0e8 r16 : index, 從table拿
add.s64 %rd19, %rd14, %rd16; //PC = 0x0f0 rd14 : base address
ld.global.u32 %r14, [%rd19]; //PC = 0x0f8 rd19 redirect???
mul.wide.s32 %rd17, %r14, 4; //PC = 0x100
add.s64 %rd18, %rd2, %rd17; //PC = 0x108
ld.global.f32 %f2, [%rd3]; //PC = 0x110 rd3 redirect???
div.rn.f32 %f3, %f2, %f1; //PC = 0x118
atom.global.add.f32 %f4, [%rd18], %f3; //PC = 0x120
@%p4 bra BB0_5; //PC = 0x128
//PC = 0x130
```
read reg:
write reg: rd16, rd19, r14, rd17, rd18, f2, f3, f4
### Pagerank_spmv
```
cvta.to.global.u64 %rd2, %rd12;
cvta.to.global.u64 %rd17, %rd10;
cvta.to.global.u64 %rd18, %rd11;
mul.wide.s32 %rd19, %r15, 4; #move
add.s64 %rd26, %rd17, %rd19; #move
add.s64 %rd25, %rd18, %rd19; #move
add.s32 %r1, %r4, %r15; #add
BB2_3:
.pragma "nounroll";
ld.global.u32 %r14, [%rd26]; #PC=0x4e0
mul.wide.s32 %rd20, %r14, 4; #PC=0x4e8
add.s64 %rd21, %rd2, %rd20; #PC=0x4f0
ld.global.f32 %f4, [%rd21]; #PC=0x4f8
ld.global.f32 %f5, [%rd25]; #PC=0x500
mul.f32 %f6, %f5, %f4; #PC=0x508
atom.shared.add.f32 %f7, [%r5], %f6; #PC=0x510
add.s64 %rd26, %rd26, 4; #PC=0x518 delete
add.s64 %rd25, %rd25, 4; #PC=0x520 delete
add.s32 %r15, %r15, 1; #PC=0x528 delete
setp.lt.s32 %p3, %r15, %r4; #PC=0x530 delete
@%p3 bra BB2_3; #PC=0x538
ld.shared.f32 %f10, [%r5];
BB2_5:
cvta.to.global.u64 %rd22, %rd13;
shl.b64 %rd23, %rd1, 2;
add.s64 %rd24, %rd22, %rd23;
ld.global.f32 %f8, [%rd24];
add.f32 %f9, %f10, %f8;
st.global.f32 [%rd24], %f9;
BB2_6:
ret;
```
#### parallel
```
cvta.to.global.u64 %rd2, %rd12;
cvta.to.global.u64 %rd17, %rd10;
cvta.to.global.u64 %rd18, %rd11; #PC=0x4c0
add.s32 %r1, %r4, %r15; #PC=0x4c8 add
BB2_3:
.pragma "nounroll";
mul.wide.s32 %rd19, %r15, 4; #PC=0x4d0 move
add.s64 %rd26, %rd17, %rd19; #PC=0x4d8 move
add.s64 %rd25, %rd18, %rd19; #PC=0x4e0 move
ld.global.u32 %r14, [%rd26]; #PC=0x4e8
mul.wide.s32 %rd20, %r14, 4; #PC=0x4f0
add.s64 %rd21, %rd2, %rd20; #PC=0x4f8
ld.global.f32 %f4, [%rd21]; #PC=0x500
ld.global.f32 %f5, [%rd25]; #PC=0x508
mul.f32 %f6, %f5, %f4; #PC=0x510
atom.shared.add.f32 %f7, [%r5], %f6; #PC=0x518
@%p3 bra BB2_3; #PC=0x520
ld.shared.f32 %f10, [%r5];
BB2_5:
cvta.to.global.u64 %rd22, %rd13;
shl.b64 %rd23, %rd1, 2;
add.s64 %rd24, %rd22, %rd23;
ld.global.f32 %f8, [%rd24];
add.f32 %f9, %f10, %f8;
st.global.f32 [%rd24], %f9;
BB2_6:
ret;
```
### BC
```
//PC=0x0d0
BB0_4:
setp.ge.s32 %p4, %r22, %r21; //PC=0x0d8
@%p4 bra BB0_11; //PC=0x0e0
cvta.to.global.u64 %rd18, %rd11; //PC=0x0e8
cvta.to.global.u64 %rd19, %rd9; //PC=0x0f0
add.s32 %r5, %r13, 1; //PC=0x0f8
add.s64 %rd2, %rd18, %rd14; //PC=0x100
cvta.to.global.u64 %rd4, %rd12; //PC=0x108 create table
add.s32 %r1, %r21, %r22; //PC=0x110 extend set index
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd21, %r22, 4; //PC=0x118 move, set register
add.s64 %rd27, %rd19, %rd21; //PC=0x120 move
ld.global.u32 %r7, [%rd27]; //PC=0x128
mul.wide.s32 %rd23, %r7, 4; //PC=0x130
add.s64 %rd6, %rd13, %rd23; //PC=0x138
ld.global.u32 %r23, [%rd6]; //PC=0x140
setp.gt.s32 %p5, %r23, -1; //PC=0x148
@%p5 bra BB0_8; //PC=0x150
mov.u32 %r19, 1; //PC=0x158
st.global.u32 [%rd4], %r19; //PC=0x160 *cont = 1;
add.s32 %r20, %r13, 1; //PC=0x168
st.global.u32 [%rd6], %r20; //PC=0x170 d[w] = dist + 1;
mov.u32 %r23, %r5; //PC=0x178
BB0_8:
setp.ne.s32 %p6, %r23, %r5; //PC=0x180 break point r23: d[w], r5:
@%p6 bra BB0_10; //PC=0x188
add.s64 %rd26, %rd18, %rd23; //PC=0x190
ld.global.f32 %f1, [%rd2]; //PC=0x198
atom.global.add.f32 %f2, [%rd26], %f1; //PC=0x1a0
BB0_10:
add.s32 %r22, %r22, 1; //delete
add.s64 %rd27, %rd27, 4; //delete
setp.lt.s32 %p7, %r22, %r21; //delete
@%p7 bra BB0_6; //PC=0x1a8 finish or not / destroy table
BB0_11:
ret; //PC=0x1b0
```
### BC_WL
#### original
```
cvta.to.global.u64 %rd19, %rd12;
cvta.to.global.u64 %rd20, %rd10;
add.s32 %r6, %r14, 1;
add.s64 %rd2, %rd19, %rd18; #PC=0x100
mul.wide.s32 %rd22, %r23, 4; #PC=0x108 move
add.s64 %rd28, %rd20, %rd22; #PC=0x110 move
cvta.to.global.u64 %rd4, %rd13; #PC=0x118
cvta.to.global.u64 %rd23, %rd11; #PC=0x120
add.s32 %r1, %r22, %r23; #PC=0x128 add
BB0_5:
.pragma "nounroll";
ld.global.u32 %r8, [%rd28]; #PC=0x130
mul.wide.s32 %rd24, %r8, 4;
add.s64 %rd6, %rd23, %rd24;
ld.global.u32 %r24, [%rd6];
setp.gt.s32 %p4, %r24, -1;
@%p4 bra BB0_7;
mov.u32 %r20, 1;
st.global.u32 [%rd4], %r20;
add.s32 %r21, %r14, 1;
st.global.u32 [%rd6], %r21;
mov.u32 %r24, %r6;
BB0_7:
setp.ne.s32 %p5, %r24, %r6;
@%p5 bra BB0_9;
add.s64 %rd27, %rd19, %rd24;
ld.global.f32 %f1, [%rd2];
atom.global.add.f32 %f2, [%rd27], %f1;
BB0_9:
add.s32 %r23, %r23, 1; #remove
add.s64 %rd28, %rd28, 4; #remove
setp.lt.s32 %p6, %r23, %r22; #remove
@%p6 bra BB0_5;
```
#### modified
```
cvta.to.global.u64 %rd19, %rd12;
cvta.to.global.u64 %rd20, %rd10;
add.s32 %r6, %r14, 1; #PC=0x100
add.s64 %rd2, %rd19, %rd18; #PC=0x108
cvta.to.global.u64 %rd4, %rd13; #PC=0x110
cvta.to.global.u64 %rd23, %rd11; #PC=0x118
add.s32 %r1, %r22, %r23; #PC=0x120
BB0_5:
.pragma "nounroll";
mul.wide.s32 %rd22, %r23, 4; #PC=0x128
add.s64 %rd28, %rd20, %rd22;
ld.global.u32 %r8, [%rd28];
mul.wide.s32 %rd24, %r8, 4;
add.s64 %rd6, %rd23, %rd24;
ld.global.u32 %r24, [%rd6];
setp.gt.s32 %p4, %r24, -1;
@%p4 bra BB0_7;
mov.u32 %r20, 1;
st.global.u32 [%rd4], %r20;
add.s32 %r21, %r14, 1;
st.global.u32 [%rd6], %r21;
mov.u32 %r24, %r6;
BB0_7:
setp.ne.s32 %p5, %r24, %r6;
@%p5 bra BB0_9;
add.s64 %rd27, %rd19, %rd24;
ld.global.f32 %f1, [%rd2];
atom.global.add.f32 %f2, [%rd27], %f1;
BB0_9:
@%p6 bra BB0_5; #PC=0x1b8
```
### Color
#### original
original
```
BB0_4:
mov.u32 %r33, -1;
setp.ge.s32 %p4, %r2, %r30;
@%p4 bra BB0_9;
add.s32 %r5, %r30, -1;
mov.u32 %r33, -1;
cvta.to.global.u64 %rd17, %rd7;
mul.wide.s32 %rd18, %r2, 4; //remove
add.s64 %rd28, %rd17, %rd18; //remove
cvta.to.global.u64 %rd3, %rd10;
cvta.to.global.u64 %rd22, %rd8;
mov.u32 %r31, %r2; //create table
add.s32 %r1, %r31, %r30; //add, set index to table
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd18, %r31, 4; //move
add.s64 %rd28, %rd17, %rd18; //move
ld.global.u32 %r25, [%rd28]; //PC=0x130
mul.wide.s32 %rd20, %r25, 4;
add.s64 %rd21, %rd12, %rd20;
ld.global.u32 %r26, [%rd21];
setp.eq.s32 %p5, %r26, -1;
setp.ne.s32 %p6, %r2, %r5;
and.pred %p7, %p5, %p6;
@!%p7 bra BB0_8;
bra.uni BB0_7;
BB0_7:
mov.u32 %r27, 1;
st.global.u32 [%rd3], %r27;
ld.global.u32 %r28, [%rd28];
mul.wide.s32 %rd23, %r28, 4;
add.s64 %rd24, %rd22, %rd23;
ld.global.u32 %r29, [%rd24];
max.s32 %r33, %r29, %r33;
BB0_8:
add.s64 %rd28, %rd28, 4; //remove
add.s32 %r31, %r31, 1; //remove
setp.lt.s32 %p8, %r31, %r30; //remove r31 : index, r30 : end
@%p8 bra BB0_6;
BB0_9:
cvta.to.global.u64 %rd25, %rd11;
add.s64 %rd27, %rd25, %rd13;
st.global.u32 [%rd27], %r33;
BB0_10:
ret;
```
modified
```
BB0_4:
mov.u32 %r33, -1;
setp.ge.s32 %p4, %r2, %r30;
@%p4 bra BB0_9;
add.s32 %r5, %r30, -1;
mov.u32 %r33, -1;
cvta.to.global.u64 %rd17, %rd7;
cvta.to.global.u64 %rd3, %rd10;
cvta.to.global.u64 %rd22, %rd8;
mov.u32 %r31, %r2; //PC=0x118 create table
add.s32 %r1, %r31, %r30; //PC=0x120 add, set index to table
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd18, %r31, 4; //PC=0x128 move
add.s64 %rd28, %rd17, %rd18; //PC=0x130 move
ld.global.u32 %r25, [%rd28]; //PC=0x138
mul.wide.s32 %rd20, %r25, 4; //PC=0x140
add.s64 %rd21, %rd12, %rd20; //PC=0x148
ld.global.u32 %r26, [%rd21]; //PC=0x150
setp.eq.s32 %p5, %r26, -1; //PC=0x158
setp.ne.s32 %p6, %r2, %r5; //PC=0x160
and.pred %p7, %p5, %p6; //PC=0x168
@!%p7 bra BB0_8; //PC=0x170
bra.uni BB0_7; //PC=0x178
BB0_7:
mov.u32 %r27, 1; //PC=0x180
st.global.u32 [%rd3], %r27; //PC=0x188
ld.global.u32 %r28, [%rd28]; //PC=0x190
mul.wide.s32 %rd23, %r28, 4; //PC=0x198
add.s64 %rd24, %rd22, %rd23; //PC=0x1a0
ld.global.u32 %r29, [%rd24]; //PC=0x1a8
max.s32 %r33, %r29, %r33; //PC=0x1b0
BB0_8:
@%p8 bra BB0_6; //PC=0x1b8
BB0_9:
cvta.to.global.u64 %rd25, %rd11;
add.s64 %rd27, %rd25, %rd13;
st.global.u32 [%rd27], %r33;
BB0_10:
ret;
```
#### parallel
original
```
add.s32 %r6, %r32, -1;
cvta.to.global.u64 %rd18, %rd8;
mul.wide.s32 %rd19, %r2, 4; //moved
add.s64 %rd29, %rd18, %rd19; //moved
cvta.to.global.u64 %rd4, %rd11;
cvta.to.global.u64 %rd23, %rd9;
mov.u32 %r33, %r2; //
add.s32 %r1, %r32, %r33; //PC=, add
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd19, %r33, 4; //PC=, move
add.s64 %rd29, %rd18, %rd19; //PC=, move
ld.global.u32 %r22, [%rd29]; //PC=0x158
mul.wide.s32 %rd21, %r22, 4; //60
add.s64 %rd22, %rd13, %rd21; //68
ld.global.u32 %r23, [%rd22]; //70
setp.eq.s32 %p5, %r23, -1; //78
setp.ne.s32 %p6, %r2, %r6; //80
and.pred %p7, %p5, %p6; //88
@!%p7 bra BB0_8; //90
bra.uni BB0_7; //98
BB0_7:
mov.u32 %r24, 1; //a0
st.global.u32 [%rd4], %r24; //a8
ld.global.u32 %r25, [%rd29]; //b0
mul.wide.s32 %rd24, %r25, 4; //b8
add.s64 %rd25, %rd23, %rd24; //c0
ld.global.u32 %r26, [%rd25]; //PC=0x1c8
atom.shared.max.s32 %r31, [%r5], %r26;//PC=0x1d0
BB0_8:
add.s32 %r33, %r33, 1; //PC=0x1d8, remove
add.s64 %rd29, %rd29, 4; //PC=0x1e0, remove
setp.lt.s32 %p8, %r33, %r32; //PC=0x1e8, remove
@%p8 bra BB0_6; //PC=0x1f0
ld.shared.u32 %r34, [%r5]; //PC=0x1f8
BB0_10:
cvta.to.global.u64 %rd26, %rd12;
add.s64 %rd28, %rd26, %rd17;
st.global.u32 [%rd28], %r34;
BB0_11:
ret;
```
modified
```
add.s32 %r6, %r32, -1;
cvta.to.global.u64 %rd18, %rd8;
cvta.to.global.u64 %rd4, %rd11;
cvta.to.global.u64 %rd23, %rd9;
mov.u32 %r33, %r2; //PC=0x140
add.s32 %r1, %r32, %r33; //PC=0x148, add
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd19, %r33, 4; //PC=0x150, move
add.s64 %rd29, %rd18, %rd19; //PC=0x158, move
ld.global.u32 %r22, [%rd29]; //PC=0x160
mul.wide.s32 %rd21, %r22, 4; //PC=0x168
add.s64 %rd22, %rd13, %rd21; //PC=0x170
ld.global.u32 %r23, [%rd22]; //PC=0x178
setp.eq.s32 %p5, %r23, -1; //PC=0x180
setp.ne.s32 %p6, %r2, %r6; //PC=0x188
and.pred %p7, %p5, %p6; //PC=0x190
@!%p7 bra BB0_8; //PC=0x198
bra.uni BB0_7; //PC=0x1a0
BB0_7:
mov.u32 %r24, 1; //PC=0x1a8
st.global.u32 [%rd4], %r24; //PC=0x1b0
ld.global.u32 %r25, [%rd29]; //PC=0x1b8
mul.wide.s32 %rd24, %r25, 4; //PC=0x1c0
add.s64 %rd25, %rd23, %rd24; //PC=0x1c8
ld.global.u32 %r26, [%rd25]; //PC=0x1d0
atom.shared.max.s32 %r31, [%r5], %r26;//PC=0x1d8
BB0_8:
@%p8 bra BB0_6; //PC=0x1e0
ld.shared.u32 %r34, [%r5]; //PC=0x1e8
BB0_10:
cvta.to.global.u64 %rd26, %rd12;
add.s64 %rd28, %rd26, %rd17;
st.global.u32 [%rd28], %r34;
BB0_11:
ret;
```
### Color_wl
original
```
add.s32 %r7, %r33, -1;
cvta.to.global.u64 %rd18, %rd8;
mul.wide.s32 %rd19, %r3, 4; #move
add.s64 %rd29, %rd18, %rd19; #move
cvta.to.global.u64 %rd3, %rd11;
cvta.to.global.u64 %rd20, %rd10;
cvta.to.global.u64 %rd23, %rd9;
mov.u32 %r34, %r3;
add.s32 %r1, %r33, %r34; #add
BB0_5:
.pragma "nounroll";
ld.global.u32 %r23, [%rd29]; #PC=0x348
mul.wide.s32 %rd21, %r23, 4;
add.s64 %rd22, %rd20, %rd21;
ld.global.u32 %r24, [%rd22];
setp.eq.s32 %p4, %r24, -1;
setp.ne.s32 %p5, %r3, %r7;
and.pred %p6, %p4, %p5;
@!%p6 bra BB1_7;
bra.uni BB1_6;
BB1_6:
mov.u32 %r25, 1;
st.global.u32 [%rd3], %r25;
ld.global.u32 %r26, [%rd29];
mul.wide.s32 %rd24, %r26, 4;
add.s64 %rd25, %rd23, %rd24;
ld.global.u32 %r27, [%rd25];
atom.shared.max.s32 %r32, [%r6], %r27;
BB1_7:
add.s32 %r34, %r34, 1; #remove
add.s64 %rd29, %rd29, 4; #remove
setp.lt.s32 %p7, %r34, %r33; #remove
@%p7 bra BB1_5;
```
modified
```
add.s32 %r7, %r33, -1;
cvta.to.global.u64 %rd18, %rd8;
cvta.to.global.u64 %rd3, %rd11;
cvta.to.global.u64 %rd20, %rd10;
cvta.to.global.u64 %rd23, %rd9;
mov.u32 %r34, %r3; #PC=0x140
add.s32 %r1, %r33, %r34; #PC=0x148
BB0_5:
.pragma "nounroll";
mul.wide.s32 %rd19, %r34, 4; #PC=0x150
add.s64 %rd29, %rd18, %rd19;
ld.global.u32 %r23, [%rd29];
mul.wide.s32 %rd21, %r23, 4;
add.s64 %rd22, %rd20, %rd21;
ld.global.u32 %r24, [%rd22];
setp.eq.s32 %p4, %r24, -1;
setp.ne.s32 %p5, %r3, %r7;
and.pred %p6, %p4, %p5;
@!%p6 bra BB0_7;
bra.uni BB0_6;
BB0_6:
mov.u32 %r25, 1;
st.global.u32 [%rd3], %r25;
ld.global.u32 %r26, [%rd29];
mul.wide.s32 %rd24, %r26, 4;
add.s64 %rd25, %rd23, %rd24;
ld.global.u32 %r27, [%rd25];
atom.shared.max.s32 %r32, [%r6], %r27;
BB0_7:
@%p7 bra BB0_5; #PC=0x1e0
```
### color max min
### origin
```
add.s32 %r5, %r52, -1;
cvta.to.global.u64 %rd21, %rd6;
mul.wide.s32 %rd22, %r2, 4; #move
add.s64 %rd37, %rd21, %rd22; #move
cvta.to.global.u64 %rd26, %rd9;
cvta.to.global.u64 %rd27, %rd7;
mov.u32 %r53, %r2; #PC=0x178
add.s32 %r1, %r52, %r53; #add
BB0_6:
.pragma "nounroll";
ld.global.u32 %r27, [%rd37]; #PC=0x180
mul.wide.s32 %rd24, %r27, 4;
add.s64 %rd25, %rd12, %rd24;
ld.global.u32 %r28, [%rd25];
setp.eq.s32 %p5, %r28, -1;
setp.ne.s32 %p6, %r2, %r5;
and.pred %p7, %p5, %p6;
@!%p7 bra BB0_8;
bra.uni BB0_7;
BB0_7:
mov.u32 %r29, 1;
st.global.u32 [%rd26], %r29;
ld.global.u32 %r30, [%rd37];
mul.wide.s32 %rd28, %r30, 4;
add.s64 %rd29, %rd27, %rd28;
ld.global.u32 %r31, [%rd29];
atom.shared.max.s32 %r36, [%r24], %r31;
ld.global.u32 %r37, [%rd37];
mul.wide.s32 %rd30, %r37, 4;
add.s64 %rd31, %rd27, %rd30;
ld.global.u32 %r38, [%rd31];
atom.shared.min.s32 %r41, [%r26], %r38;
BB0_8:
add.s32 %r53, %r53, 1; #delete
add.s64 %rd37, %rd37, 4; #delete
setp.lt.s32 %p8, %r53, %r52; #delete
@%p8 bra BB0_6; #PC=0x240
ld.shared.u32 %r55, [%r24];
ld.shared.u32 %r54, [%r26];
```
### parallel
```
add.s32 %r5, %r52, -1;
cvta.to.global.u64 %rd21, %rd6;
cvta.to.global.u64 %rd26, %rd9;
cvta.to.global.u64 %rd27, %rd7;
mov.u32 %r53, %r2; #PC=0x168
add.s32 %r1, %r52, %r53; #PC=0x170 add
BB0_6:
.pragma "nounroll";
mul.wide.s32 %rd22, %r53, 4; #PC=0x178 move
add.s64 %rd37, %rd21, %rd22; #PC=0x180 move
ld.global.u32 %r27, [%rd37]; #PC=0x
mul.wide.s32 %rd24, %r27, 4;
add.s64 %rd25, %rd12, %rd24;
ld.global.u32 %r28, [%rd25];
setp.eq.s32 %p5, %r28, -1;
setp.ne.s32 %p6, %r2, %r5;
and.pred %p7, %p5, %p6;
@!%p7 bra BB0_8;
bra.uni BB0_7;
BB0_7:
mov.u32 %r29, 1;
st.global.u32 [%rd26], %r29;
ld.global.u32 %r30, [%rd37];
mul.wide.s32 %rd28, %r30, 4;
add.s64 %rd29, %rd27, %rd28;
ld.global.u32 %r31, [%rd29];
atom.shared.max.s32 %r36, [%r24], %r31;
ld.global.u32 %r37, [%rd37];
mul.wide.s32 %rd30, %r37, 4;
add.s64 %rd31, %rd27, %rd30;
ld.global.u32 %r38, [%rd31];
atom.shared.min.s32 %r41, [%r26], %r38;
BB0_8:
@%p8 bra BB0_6; #PC=0x230
ld.shared.u32 %r55, [%r24];
ld.shared.u32 %r54, [%r26];
```
### mis
#### origin
```
cvta.to.global.u64 %rd4, %rd10;
cvta.to.global.u64 %rd19, %rd9;
mul.wide.s32 %rd20, %r29, 4; #move
add.s64 %rd28, %rd19, %rd20; #move
BB1_6:
.pragma "nounroll";
ld.global.u32 %r9, [%rd28]; #PC=0x210
mul.wide.s32 %rd21, %r9, 4;
add.s64 %rd22, %rd1, %rd21;
ld.global.u32 %r25, [%rd22];
setp.ne.s32 %p5, %r25, -1;
@%p5 bra BB1_8;
add.s64 %rd24, %rd4, %rd21;
ld.global.u32 %r26, [%rd24];
atom.shared.min.s32 %r27, [%r6], %r26;
BB1_8:
add.s32 %r29, %r29, 1; #delete
add.s64 %rd28, %rd28, 4; #delete
setp.lt.s32 %p6, %r29, %r28; #delete
@%p6 bra BB1_6;
ld.shared.u32 %r30, [%r6];
BB1_10:
cvta.to.global.u64 %rd25, %rd11;
add.s64 %rd27, %rd25, %rd18;
st.global.u32 [%rd27], %r30;
BB1_11:
ret;
```
#### modified
```
cvta.to.global.u64 %rd4, %rd10;
cvta.to.global.u64 %rd19, %rd9; #PC=0x1f8
add.s32 %r1, %r28, %r29; #PC=0x200 add
BB1_6:
.pragma "nounroll";
mul.wide.s32 %rd20, %r29, 4; #PC=0x208 move
add.s64 %rd28, %rd19, %rd20; #PC=0x210 move
ld.global.u32 %r9, [%rd28]; #PC=0x218
mul.wide.s32 %rd21, %r9, 4; #PC=0x220
add.s64 %rd22, %rd1, %rd21; #PC=0x228
ld.global.u32 %r25, [%rd22]; #PC=0x230
setp.ne.s32 %p5, %r25, -1; #PC=0x238
@%p5 bra BB1_8; #PC=0x240
add.s64 %rd24, %rd4, %rd21; #PC=0x248
ld.global.u32 %r26, [%rd24]; #PC=0x250
atom.shared.min.s32 %r27, [%r6], %r26; #PC=0x258
BB1_8:
@%p6 bra BB1_6; #PC=0x260
ld.shared.u32 %r30, [%r6]; #PC=0x268
BB1_10:
cvta.to.global.u64 %rd25, %rd11;
add.s64 %rd27, %rd25, %rd18;
st.global.u32 [%rd27], %r30;
BB1_11:
ret;
```
### mis_wl
#### origin
```
cvta.to.global.u64 %rd2, %rd10;
cvta.to.global.u64 %rd3, %rd11;
cvta.to.global.u64 %rd20, %rd9; #PC=0x1f8
mul.wide.s32 %rd21, %r30, 4; #PC=0x200 move
add.s64 %rd29, %rd20, %rd21; #PC=0x208 move
add.s32 %r1, %r29, %r30; #add
BB1_5:
.pragma "nounroll";
ld.global.u32 %r10, [%rd29]; #PC=0x210
mul.wide.s32 %rd22, %r10, 4;
add.s64 %rd23, %rd3, %rd22;
ld.global.u32 %r26, [%rd23];
setp.ne.s32 %p4, %r26, -1;
@%p4 bra BB1_7;
add.s64 %rd25, %rd2, %rd22;
ld.global.u32 %r27, [%rd25];
atom.shared.min.s32 %r28, [%r7], %r27;
BB1_7:
add.s32 %r30, %r30, 1; #remove
add.s64 %rd29, %rd29, 4; #remove
setp.lt.s32 %p5, %r30, %r29; #remove
@%p5 bra BB1_5;
```
#### modified
```
cvta.to.global.u64 %rd2, %rd10; #PC=0x1e8
cvta.to.global.u64 %rd3, %rd11; #PC=0x1f0
cvta.to.global.u64 %rd20, %rd9; #PC=0x1f8
add.s32 %r1, %r29, %r30; #PC=0x200
BB1_5:
.pragma "nounroll";
mul.wide.s32 %rd21, %r30, 4; #PC=0x208
add.s64 %rd29, %rd20, %rd21;
ld.global.u32 %r10, [%rd29];
mul.wide.s32 %rd22, %r10, 4;
add.s64 %rd23, %rd3, %rd22;
ld.global.u32 %r26, [%rd23];
setp.ne.s32 %p4, %r26, -1;
@%p4 bra BB1_7;
add.s64 %rd25, %rd2, %rd22;
ld.global.u32 %r27, [%rd25];
atom.shared.min.s32 %r28, [%r7], %r27;
BB1_7:
@%p5 bra BB1_5; #PC=0x260
```
### sssp
#### origin
```
cvta.to.global.u64 %rd19, %rd11;
cvta.to.global.u64 %rd20, %rd10;
mul.wide.s32 %rd21, %r24, 4; #move
add.s64 %rd28, %rd20, %rd21; #move
add.s64 %rd27, %rd19, %rd21; #move
BB0_3:
.pragma "nounroll";
ld.global.u32 %r19, [%rd28]; #PC=0x100
mul.wide.s32 %rd22, %r19, 4;
add.s64 %rd23, %rd17, %rd22;
ld.global.u32 %r20, [%rd23];
ld.global.u32 %r21, [%rd27];
add.s32 %r22, %r20, %r21;
atom.shared.min.s32 %r23, [%r6], %r22;
add.s64 %rd28, %rd28, 4; #delete
add.s64 %rd27, %rd27, 4; #delete
add.s32 %r24, %r24, 1; #delete
setp.lt.s32 %p3, %r24, %r4; #delete
@%p3 bra BB0_3;
ld.shared.u32 %r25, [%r6];
BB0_5:
cvta.to.global.u64 %rd24, %rd13;
shl.b64 %rd25, %rd1, 2;
add.s64 %rd26, %rd24, %rd25;
st.global.u32 [%rd26], %r25;
BB0_6:
ret;
```
#### modified
```
cvta.to.global.u64 %rd19, %rd11;
cvta.to.global.u64 %rd20, %rd10; #PC=0x0e0
add.s32 %r1, %r4, %r24; #PC=0x0e8 add
BB0_3:
.pragma "nounroll";
mul.wide.s32 %rd21, %r24, 4; #PC=0x0f0 move
add.s64 %rd28, %rd20, %rd21; #PC=0x0f8 move
add.s64 %rd27, %rd19, %rd21; #PC=0x100 move
ld.global.u32 %r19, [%rd28]; #PC=0x108
mul.wide.s32 %rd22, %r19, 4; #PC=0x110
add.s64 %rd23, %rd17, %rd22; #PC=0x118
ld.global.u32 %r20, [%rd23];
ld.global.u32 %r21, [%rd27];
add.s32 %r22, %r20, %r21;
atom.shared.min.s32 %r23, [%r6], %r22;
@%p3 bra BB0_3; #PC=0x140
ld.shared.u32 %r25, [%r6];
BB0_5:
cvta.to.global.u64 %rd24, %rd13;
shl.b64 %rd25, %rd1, 2;
add.s64 %rd26, %rd24, %rd25;
st.global.u32 [%rd26], %r25;
BB0_6:
ret;
```
---
## outcome
[spreadsheet](https://docs.google.com/spreadsheets/d/197LhQm2fDLeWsN5sECN_cid8bA1nde6RrRPaRqrWZp8/edit#gid=0)
### normal
* pagerank
* speedup : 2.02x
* iterations : 26.6%
* memory access number: 62.9%
* BFS
* speedup : 1.08x
* iterations : 84.9%
* memory access number: 41.4%
* Color Max(pagerank graph)
* speedup : 2.32x
* iterations : 26.6%
* memory access number: 15.2%
* MIS(pagerank graph)
* speedup : 1.91x
* iterations : 26.6%
* memory access number: 21.6%
* SSSP
* speedup : 1.97x
* iterations : 26.6%
* memory access number: 13.1%
### bypass l1 cache
* Color Max
* speedup : 1.49x
* iterations : 99.9%
* memory access number: 66.2%
### working list
* BFS
* gpgpu-sim's bug in global atomic increment
* might because atom's outcome will not shown immediately
* wl code has bug
* speedup : x
* iterations : %
* memory access number: %
---
## Run Benchmarks
[experiment spread sheet](https://docs.google.com/spreadsheets/d/12SWUiwHGojaicbE4O3QcscHBK07e-__A8j6FICQJRYQ/edit#gid=1338197220)
[教授spread sheet](https://docs.google.com/spreadsheets/d/14ibl-NsZcrZzOVNC3gcKF7Ow2tnZt8DKzSlhPU888ak/edit#gid=283140300)

### Loop Unroll
```cpp=
#pragma unroll 1
for(int i=start; i<end; i++){
...
}
```
* `#pragma unroll N`會告訴compiler要unroll幾層loop
* N=1只會unroll一層,即**turn off** loop unroll
### atomic operation
* instruciton.cc
* atom_impl
* thread->m_last_dram_callback.function_wt
* cuda-sim.cc
* ptx_exec_inst
* inst.add_callback(lane_id, last_callback().function_wt, last_callback().instruction, this, threads, true /*atomic*/);
* abstract_hardware_model.cc
* do_atomic
* cb.function_wt(cb.instruction, cb.thread, cb.threads);
* abstract_hardware_model.h
* add_callback
* dram_callback_t
### 1. gpgpusim.config
```
#lemonsien
-gpgpu_perfect_mem 0
-gpgpu_concurrent_kernel_sm 1
-scheduler_kind 2
-partition_method 1
...
...
# high level architecture configuration
-gpgpu_n_clusters 30
...
...
# Device Limits
-gpgpu_kernel_launch_latency 0
```
-cache allocate policy : 改成on miss, hit reserved才會正確
* 否則都是當成miss
-scheduler_kind
* 0:origin
* 1:paver
* 2:paver2
-partition_method
* 0:BCS
* 1:k-way
* 2:recursive(not working)
### 2. Run
```
#pannotia/fw
./fw ../../../../data_dirs/pannotia/fw/data/256_16384.gr &> output_memstall &
```
```
#pannotia/bc
./bc ../../../../data_dirs/pannotia/bc/data/1k_128k.gr &> output_paver_30groups_MST_ideal &
```
```
#pannotia/color_max
./color_max ../../../../data_dirs/pannotia/color_max/data/ecology1.graph 1 &> output_paver_30groups_MST_ideal &
```
```
#pannotai/sssp
./sssp ../../../../data_dirs/pannotia/sssp/data/USA-road-d.NW.gr 1 &> output_paver_30groups_MST_ideal &
```
```
#pannotia/pagerank
./pagerank ../../../../data_dirs/pannotia/pagerank/data/coAuthorsDBLP.graph 1 &> output_paver2 &
```
```
#pannotia/sssp
./sssp ../../../../data_dirs/pannotia/sssp/data/USA-road-d.NW.gr 0 &> output_memstall2 &
```
```
#lonestar/mst
./run &> output_paver_30groups_MST_ideal &
```
```
#lonestar/sssp
./sssp ../../../../../data_dirs/cuda/lonestargpu-2.0/inputs/USA-road-d.FLA.gr &> output_memstall2 &
```
```
#mis
./mis data/ecology1.graph 1 &> output_memstall &
```
## Git
http://blog.gogojimmy.net/2012/02/29/git-scenario/
```
git commit -am "comments"
git push -u origin branch_name
git pull origin branch_name
```
#### Log in
```
git config --global user.name "<使用者名字>"
git config --global user.email "<電子信箱>"
```
check configure:
```
git config --list
or
cat ~/.gitconfig
```
#### 看狀態
```
git status
```
#### 看歷史commit
```
git log
```
#### 加入新(untracked)檔案
```
git add file_name
```
一次加全部檔案:
```
git add .
```
#### 存檔, 會紀錄改過的檔案的訊息
```
git commit -m "commit訊息"
```
先git add .後略過編輯器提交 commit
```
git commit -am "commit訊息"
```
#### Branch
新增branch
```
git branch new_branch
```
查看有哪些branch
```
git branch
```
切換branch
```
git checkout target_branch
```
#### merge branch_name到目前的branch
```
git merge branch_name
```
## Perfect Memory
* access memory
* 不access memory,"**去memory**" 馬上接 "**回cluster**"


### memory stall
* kinds
* memory_cycle
* memory access not done
* shared_cycle
* constant_cycle
* texture_cycle
* memory priority
* ICNT_RC_FAIL = BK_CONF > COAL_STALL
NO_RC_FAIL = 0,
BK_CONF,
MSHR_RC_FAIL,
ICNT_RC_FAIL,
COAL_STALL,
TLB_STALL,
DATA_PORT_STALL,
WB_ICNT_RC_FAIL,
WB_CACHE_RSRV_FAIL,
N_MEM_STAGE_STALL_TYPE
## 相關論文
### irregular benchmark
* Graph-Waving architecture: Efficient execution of graph applications on GPUs
* [link](https://www.sciencedirect.com/science/article/pii/S0743731520303889)
* node to warp mapping, eliminate duplicate execution, shrink the warp size.
* Fine-Grained Management of Thread Blocks for Irregular Applications
* [link](https://ieeexplore.ieee.org/abstract/document/8988776?casa_token=mPwWuVZ8uekAAAAA:g3-72tUqQEa0RrzLppSWgs1erUiF5kwMhDruh5wyMgWfSJos8Q9pVCEiALDTIbiXQi0t7yPwGj4) 2019
* irregular application(graph)的狀態會隨著時間而改變(因為少數node關聯度很高)
* 用前1個kernel的TB去預測下一個kernel的TB的狀態
* 及時追蹤每一個TB的現在的狀態
* 狀態包括Load-Store unit stall,Floating-Point Unit的使用度
* 根據這些狀態去調整SM中TB的數目,允許preemption
* gain 17% and 13% improvements of throughput over previously proposed static and dynamic scheduling strategies, respectively
* Dynamic Thread Block Launch: A Lightweight Execution Mechanism toSupport Irregular Applications on GPUs
* [link](https://dl.acm.org/doi/epdf/10.1145/2872887.2750393) 2015
* 為了加速irregular application,這篇論文讓thread可以產生child thread block
* 譬如:有些graph的每個thread會用for loop去traverse某個node的neighbor,每個node的neighbor不同時,會很分散。若是create一個child kernel或child thread block去concurrent執行所有的neighbor會很有用
* 1.21x speedup over the original flat implementation and average 1.40x over the implementations using device-kernel launch functionality
* A quantitative study of irregular programs on GPUs
* [link](https://ieeexplore.ieee.org/abstract/document/6402918)
* Pannotia: Understanding irregular GPGPU graph applications
* [link](https://ieeexplore.ieee.org/abstract/document/6704684)
* Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription
* [link](https://ieeexplore.ieee.org/document/9139797)
* working sets of irregular applications can often be divided into cold and hot data structures.
* cold allocations are accessed seldom and sparsely
* hot data structures have dense sequential access
* propose a dynamic page placement strategy for irregular general purpose applications differentiate between hot and cold allocations
* performance improvement of 22% to 78%
* Optimizing non-coalesced memory access for irregular applications with GPU computing
* [link](https://link.springer.com/content/pdf/10.1631/FITEE.1900262.pdf) 2020
* software solution with no hardware extensions to solve dynamic irregularities
* data reordering
* eliminate indirect memory access by copying to a new location and reordering
* Index redirection
* reduce overhead
* for loop拿到kernel外面,run multiple streams of kernels(1 for count, others for remap)
* For loop-carried applications, reordered data are cached for the next iteration or other kernels to eliminate possible redundant data reorganization
* 7.0%–7.17% overall performance improvement
* Managing DRAM Latency Divergence in Irregular GPGPU Applications
* [link](https://www.cs.utah.edu/~rajeev/pubs/sc14.pdf) 2014
* different requests from a warp are serviced together, or in quick succession, without interference from other warps
* warp-aware memory controllers
* boost performance by 10.1%
* Graph Processing on GPUs: A Survey
* [link](https://www.dcs.warwick.ac.uk/~liganghe/papers/ACM-Computing-Surveys-2017.pdf) 2018
* graph processing
* data layout
* memory access pattern
* workload mapping
* GPU programming
* new research challenges and opportunities
* Graph processing on hybrid systems
* Graph processing on new GPU architecture
* Dynamic graph processing on GPUs
* the graph structure can be updated frequently in runtime
* dynamically maintain balanced workload mapping with the rapid changes of graphs
* Machine-learning applications
* Anatomy of GPU Memory System for Multi-Application Execution
* [link](https://www.cs.utexas.edu/users/skeckler/pubs/MEMSYS_2015_Anatomy.pdf) 2015
* study the memory system of GPUs in a concurrently executing **multi-application** environment
* **L2-MPKI** and **attained bandwidth** are the two knobs that can be used to drive memory scheduling decisions for achieving better performance
* Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads
* [link](https://arxiv.org/pdf/2007.07131.pdf)
* 1.33x and 13% improvement in performance and energy savings respectively
* On-the-Fly Elimination of Dynamic Irregularities for GPU Computing
* [link](https://dl.acm.org/doi/pdf/10.1145/1950365.1950408)
* On-GPU Thread-Data Remapping for Branch Divergence Reduction
* [link](https://dl.acm.org/doi/10.1145/3242089)
* Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
* [link](https://ieeexplore.ieee.org/abstract/document/7349653)
* use compiler to transform the original code to the template
* Spare register aware prefetching for graph algorithms on GPUs
* [link](https://ieeexplore.ieee.org/abstract/document/6835970)
* use spare register to store the prefetch edge
### memory access divergence
* DaCache: Memory Divergence-Aware GPU Cache Management
### pre-launch (inter-kernel)
* BlockMaestro: Enabling Programmer-Transparent Task-based Execution in GPU Systems
* [link](https://ieeexplore.ieee.org/document/9499869)
* 用bipartite graph找出有dependency(locality)的TBs,並讓後面的TB pre-launch以減少launch time
* 用編譯器就能解決,所以不用改code
* **input-dependent** is not working
### inter-kernel locality
* Inter-kernel reuse-aware thread block scheduling
* group TBs
* 下一個kernel的group與上一個kernel對齊
* reverse
* work stealing,一次一個
### intra-kernel locality
* PAVER: Locality Graph-Based Thread Block Scheduling for GPUs
* [link](https://dl.acm.org/doi/fullHtml/10.1145/3451164) 2021
* 用graph theory將TBs的locality都找出來
* 用graph partition將graph切成groups,每個group分給一個SM
* work stealing,偷不只一個以維持locality
* **input-dependent** is not working
* [note](https://hackmd.io/q-vSikPLSdmNqHPz1GwQwA?view)
* performance speedup by 29%, 49.1%, and 41.2%
* Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling
* [link](https://ieeexplore.ieee.org/abstract/document/7898501)
* run time時才計算出與目前SM中的TBs最適合的TB作為候選人
* **input-dependent** is not working
* Improving GPGPU resource utilization through alternative thread block scheduling
* [link](https://ieeexplore.ieee.org/abstract/document/6835937?casa_token=jQrF9KO5vLMAAAAA:1_H9ANBV8mdo4iuC_Nz5bmcPPx4aKsjx4l15Y_1WnhdE8RBbm2hvXO6kdiSmRKZfK2NCJIoG_AU)
* workload type
* TB increase, performance increase
* TB increase, performance increase then saturate
* TB increase, performance decrease
* TB increase, performance increase then decrease
* core activity
* ACTIVE
* INACTIVE
* Idle
* Mem_stall
* Core_stall
*
* SM裡的TBs數目會影響performance,最大未必會最好
* <img src="https://i.imgur.com/udazh1W.png" width="40%" height="40%" />
* consecutive CTAs are assigned to the same cores
### analyze data locality
* LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs
### warp scheduler
* A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization
### multiple streams scheduler
* Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications
* Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels
* Increasing GPU throughput using kernel interleaved thread block scheduling
### graph
* Practical Optimal Cache Replacement for Graphs
* 用transpose(CSC<->CSR)快速找到cache中下一次被存取時間最久的cache line
* 適用traverse graph的應用(in/out)
## Time: 1/21/2022 ~ 1/26/2022
* goal
* sssp_spmv, parboil: spmv,
* node to warp mapping
* data driven
## Time: 1/12/2022 ~ 1/20/2022
* progress
* pagerank 5 graph testing
* pagerank spmv
* color max min
## Time: 10/21/2021 ~ 10/28/2021
* child thread block
* change program?
* warp formation
* can do?
* TB formation
## Time: 10/18/2021 ~ 10/25/2021
* 了解CSB
* pagerank 問題
* 壓縮sparse matrix
* cuSha
## Time: 10/14/2021 ~ 10/21/2021
* ~~color攤開第三層~~
* sssp跑NY
* color跑大圖G3
* 改其他benchmark
* ~~pagerank結果寫到spread sheet上~~
* pagerank問題
* atomicAdd ?
* ITER拿掉
* 每個kernel看perfect memory / cache.nonflush
* momory bound問題?
* outcome
* color-max
* 有快一點
* memory access減少,但miss rate上升
* 攤開node value會更慢,memory access會再會再減少,但miss ra會再te
* graph很balance
* page-rank
* 慢很多
* memory access減少,但miss rate上升
* graph很不balance
## Issue
* race condition (sssp : min)
* possible solution : atomic or synchronization
* 兩層 for loop 不能攤開
## Time: 10/7/2021 ~ 10/14/2021
* 每個kernel印出remap後的cycle
* remap的跑在GPU上看看 (debug比較快)
* 解決branch divergence(sort 每個vertex的edge數)
* gpgpu-sim_simulations/benchmarks/src/cuda/rodinia/3.1/cuda/bfs/
## Time: 9/24/2021 ~ 9/30/2021
看memory coaleasing有沒有出問題
memory bound是bound在哪邊
參考data reorder(no-coaleased)
## Time: 9/6/2021 - 9/12/2021
林孟賢:加上ideal cache的miss rate, 個別SM的cycle數(為了判斷 load balance), 每個kernel的miss rate
席致寧:group裡加上mst, 跑benchmark
## Time: 8/23/2021 - 9/5/2021
林孟賢:實做paver
席致寧:跑bechmark
## Time: 8/6/2021 - 8/12/2021
* embeddingbag
修改好embeddingbag
cache access patern of embedding with 2 keys
```
-3221225472 57 => input(indices)
-3221241536 1
-3221241472 7909 => weightFeat(weight)
-3221241408 1
-3221241344 7909 => weightFeat(weight)
-3221241152 1
-3221241088 7445 => weightFeat(weight)
-3221241280 1
-3221241216 7445 => weightFeat(weight)
-3221225536 1
-3221225600 57 => input(indices)
-3221225664 1
-3221225728 57
-3221225792 1
-3221225856 57
-3221225920 1
-3221225984 57
```
* 結論
* threads random access(跟input有關) weightFeat, 且locality會隨著dictionary增大而消失
* MIS
* 印addr access number
* shared/not shared histogram
* 觀察
* [spread sheet](https://docs.google.com/spreadsheets/d/12SWUiwHGojaicbE4O3QcscHBK07e-__A8j6FICQJRYQ/edit#gid=1340093626 )
* 在shared和not shared的locality都很高的情況下,我們的cache分段方法沒有優勢。
* 如果跑在多個SM上能保留shared而讓,not shared的locality消失,那我們的方法或許會奏效。但我們的shared是根據inter kernel的locality去找的,在intra kernel是否有效還很難說
* 如果我們的方法(PCG)可以自由的hit, pending hit, sector miss, 那在4/6分之下可以達到和LRU差不多的結果
* Terminology
* ideal: 只考慮某個cache line被**存取**幾次
* normal: 只考慮某個cache line被hit(包含第一次miss)幾次,不考慮pending hit和sector miss等
* shared: 可能表示**我們標的pc**或**高access次數的pc**所access到的cache line
* not shared: 與shared相對
* both: shared與not shared都有access到的cache line
* Outcome
* ideal與LRU的比較
* kernel 2
* ideal, miss rate 0.243
<img src="https://i.imgur.com/l9ygFRx.png" width="50%" height="50%" />
* LRU, miss rate 0.287
<img src="https://i.imgur.com/lkn0XIW.png" width="50%" height="50%" />
* 結論
* LRU已經很接近ideal(total miss rate 0.54/0.48)
* 相比於shared, not shared的cache line反倒是沒有很好的keep住
* bypass l1 cache
* 比較慢
* high frequency ps has high priority to stay in cache(with LRU)
* 比較慢
* ideal算法
* block addr要改成addr
* sector miss要考慮進去
* miss rate算法
* tag_array算出來的是read的miss rate(剛好是我們要的)
* 個別MISS或SCTOR_MISS與gpgpu-sim的有落差,但總和一樣(所以在我們的case下沒差)
## Time: 7/30/2021 - 8/5/2021
* cache切成兩部份
* shared(80%)
* not shared(20%)
* replacement policy
* shared
* LIFO
* LRU
* not shared
* LRU
* compared with ideal
* cta scheduler reverse
---
* cache line加上counter, 紀錄hit次數
* set l1 to tag_array
* print cache access
* s1
* 1 cta
* 1 warp
* sector miss要考慮
* future work
* print cta access status
## Time: 7/22/2021 - 7/29/2021
* 將call/assign by reference/value實做出來
* 在vec-add上實做,結果符合預期
* 
* 
## Time: 7/19/2021 - 7/26/2021
* 7/16
* c的基本語法ok
* bear沒有試成功,所以目前無法編譯需要make的cuda code
* 7/19
* 開始做data member的部分
* 7/20
* data member的部分大致可以
* 發現member function有問題
* 7/21
* 修改data member的data structure以符合member function的操作
* 開始做member function
* 7/22
* member function基本的case可以
* 發現member function在找data member時,若遇上inheritance會有問題
* 找到inherited的class
* 待辦
* **declare** to **definition** (getDefinition)
* 從全部視為pass by reference改成pass by reference/value
* alias(變數的傳遞)
* 考慮control flow(if, else等)
## Time: 7/12/2021 - 7/19/2021
### 1. Successes Last Week
<!--
* What you managed/plan to accomplish this week - hopefully in line with goals - if not that's okay too.
-->
* 席致寧
* 7/9 ~ 7/16
* [[筆記] build and run gpgpu-sim](https://hackmd.io/WsOKV8MmTfS1-Y5gyrS2UQ?view)
* 7/16 ~ 7/23
* c_Kernel從constant memory改成global memory
* for loop 改成用 multiple stream
* [[筆記] streams & memory](https://hackmd.io/WTA8Vz_8Slyveitd7JDYMw?view)
* 7/23 ~ 7/30
* [BuildMI()](https://llvm.org/docs/CodeGenerator.html#the-machineinstr-class) : extend machine instruction
* [[筆記]memory](https://hackmd.io/MFf9C4AHRU2z6oy6X9DRXg)
* [[筆記]LLVM](https://hackmd.io/HKv_OZvXSmWvkLXHTu6BaA)
* 7/30 ~ 8/6
* [NVPTXInstrInfo.td](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td)
* 8/6 ~ 8/13
* compiling LLVM
* 8/13 ~ 8/20
* [P-OPT: Practical Optimal Cache Replacement for Graph Analytics Architectures](https://hackmd.io/iCDyMqTNQ8a1DzwtSSV4Jg)
* [Internal of LLVM](https://hackmd.io/bwlGi-ntTxCF1pTvB0N-YA)
### 2. Progress and Problems Last Week
<!--
* What have you done last week ?
* What problems encountered last week ?
* What caused you to be slow on a particular task? What roadblocks did you encounter?
-->
* 跟**李睦樂**討論之後覺得理論上做的出來,但實做細節與clang API是否支援等現實層面的問題還在摸索中
* 限制條件為若有些kernel的parameter是runtime時才決定的
* 譬如:C = A+B的C與A在runtime時才得知是指到同一個obj,那我們在compile時就不會知道A的值是非read-only的
* 目前的想法為每一個變數與obj設置counter,當寫入時counter+1,最後看哪些變數符合條件
<font color= #DCDCDC>
* 為每一個**變數**配置一個資料結構用來紀錄**write次數**, **類型(normal, array, pointer, reference)**, **指標(指到其他人的資料結構)**
* 最後根據每一個變數**自己的資料結構**和**指到的資料結構**去判斷是否為constant
* 超過2就非constant
* 需要trace **call by value**, **call by reference**等實做方法(在找出哪些是我們要的變數上遇到了問題,因為 **=左右** 都可能存在不只一個變數)
* 判斷 **= 右邊**的變數的level情況
* *: level-1
* &: level+1
* 二元運算子: 留下小的
* 用post order traversal的方法搭配stack找出level小的與 **= 左邊**的變數關聯
* 搭配children()用法
* http://web.mit.edu/freebsd/head/contrib/llvm/tools/clang/lib/AST/StmtPrinter.cpp
</font>
* write 指令包括 assign, compound assign, inc/dec
* 被 write 的 variable 為 **=左邊** 的第一個variable(可能會有問題)
<!--* 用bear -vvvv make這調指令以獲得compile_commands.json-->
* 在sp和mst, **bear** 和 **compiledb** 無法取得compile_commands.json
### 3. Not Resolved Problems
<!--* The checking list of problems you have not resolved yet.-->
* 在找出哪些是我們要的變數上遇到了問題,因為 **=左右** 都可能存在不只一個變數
### 4. Goals for next week
* 席致寧
* shaer.cc (/src/gpgpu-sim)
* issue_warp
* 在TB裡面建立一個array,蒐集ld.global的address,找不同TB裡的交集
* warp_inst_t (/src/abstract_hardware_model.cc)
* class warp_inst_t()
* 加array來存address
* memory_space_t (/src/abstract_hardware_model.cc)
* is_global()
* 林孟賢
* assign一律考慮為pass by reference,並run在不同cuda benchmark裡,確保cuda code可以run成功,並找出可能的問題
* vecadd: ok
* ~/test/llvm-project-orig/llvm-project/build/bin/get-constant-variables test_case/vector-add.cu -- clang++ --cuda-gpu-arch=sm_75 -L /usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread
* 
* pathfinder:
* ~/test/llvm-project-orig/llvm-project/build/bin/get-constant-variables pathfinder.cu -- clang++ --cuda-gpu-arch=sm_75 -L /usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread
* 