Team Ordbogen (Jacob, Mads & Peter)

## Team Ordbogen Notes We are working on a specialized 1.58-bit kernel to gain the full potential of our research between the joint research between teams at Ordbogen A/S and University of Southern Denmark [BitNet b1.58 Reloaded](https://arxiv.org/pdf/2407.09527). In short; employing a quantization aware training scheme, where the weight-matrix only can take on ternary values {-1, 0, 1}, we can reduce the matrix multiplication in Linear Layers to simple addition. Further, we can represent each original default 16-bit value with 2-bits, allowing us to pack them in sets of four into the INT8 structure. This allows us to reduce the memory footprint of Linear layers with 75%, but will require an unpacking in an online manner during computation within the kernel. ## Oct 29 ### First focus points Implement new funny stuff and investigate the potential of NVIDIA NSight Compute. - Peter has started refactoring the current kernel - Setting up NVIDIA Nsight Compute both locally and on our own dev servers - initial profiling ## Oct 30 - Analyzing kernels with nsight - Got basic test and validation framework up and running - Got NSight Compute and NSight System up and running on a server with a NVIDIA A2 GPU - NSight Compute reports: - **66%** compute throughput and **66%** memory throughput. - Estimates **63%** speedup with effecient memory access - Estimates **28%** speedup with coalesced loads - Estimates **17%** speedup with fused instructions - Iteration 1: Load weights in 32-bit integers instead of 8. **2%** faster than orig. - Iteration 2: Read two fp16 at a time. **19%** faster than orig. - Iteration 3: Changed block size to 512 to maximize occupacy. **22%** faster than orig. - Iteration 4: Cache 16 fp16 in shared memory. **33%** faster than orig. - Currently at **79%** compute throughput and **30%** memory throughput. Example output from test app: ![example](https://hackmd.io/_uploads/ry-kpAlZ1e.png) ## Oct 31 ### TODOs - Iteration 5: Cache 64 fp16 in shared memory as well as weights. Ought to reduce the number of memory loads - Iteration 6: Try to utilize tensor cores