## Team Ordbogen Notes
We are working on a specialized 1.58-bit kernel to gain the full potential of our research between the joint research between teams at Ordbogen A/S and University of Southern Denmark [BitNet b1.58 Reloaded](https://arxiv.org/pdf/2407.09527). In short; employing a quantization aware training scheme, where the weight-matrix only can take on ternary values {-1, 0, 1}, we can reduce the matrix multiplication in Linear Layers to simple addition. Further, we can represent each original default 16-bit value with 2-bits, allowing us to pack them in sets of four into the INT8 structure. This allows us to reduce the memory footprint of Linear layers with 75%, but will require an unpacking in an online manner during computation within the kernel.
## Oct 29
### First focus points
Implement new funny stuff and investigate the potential of NVIDIA NSight Compute.
- Peter has started refactoring the current kernel
- Setting up NVIDIA Nsight Compute both locally and on our own dev servers - initial profiling
## Oct 30
- Analyzing kernels with nsight
- Got basic test and validation framework up and running
- Got NSight Compute and NSight System up and running on a server with a NVIDIA A2 GPU
- NSight Compute reports:
- **66%** compute throughput and **66%** memory throughput.
- Estimates **63%** speedup with effecient memory access
- Estimates **28%** speedup with coalesced loads
- Estimates **17%** speedup with fused instructions
- Iteration 1: Load weights in 32-bit integers instead of 8. **2%** faster than orig.
- Iteration 2: Read two fp16 at a time. **19%** faster than orig.
- Iteration 3: Changed block size to 512 to maximize occupacy. **22%** faster than orig.
- Iteration 4: Cache 16 fp16 in shared memory. **33%** faster than orig.
- Currently at **79%** compute throughput and **30%** memory throughput.
Example output from test app:

## Oct 31
### TODOs
- Iteration 5: Cache 64 fp16 in shared memory as well as weights. Ought to reduce the number of memory loads
- Iteration 6: Try to utilize tensor cores