# CSE141L Lab 3 Caching Optimizations Worksheet1 Name: __________________________ Student ID: ____________________ # Instructions * Complete this worksheet while reading/working through the lab write up. The worksheet doesn't make sense without the lab. * The point values are listed for each question. Altering the size of the cells will cost you 1 point. The write up portion of the lab is 30% of your total point for the lab as shown in the lab's README.md ## Cache and dataset characteristics #### P1 (4pt) Find out the dimensions (number of data elements) of the following tensors/vectors used in `fc_layer_t::activate` for the cifar100 dataset and fill the following table | Tensor/Vector | Number of Data Elements | |---------------------|---------------------------------------| | `in` | _____________________________________ | | `out` | _____________________________________ | | `weights` | _____________________________________ | | `activation_input` | _____________________________________ | #### P2 (4pt) Calculate the size (in Bytes) of the following tensors/vectors used in `fc_layer_t::activate` for the cifar100 dataset and fill the following table | Tensor/Vector | Size in Bytes | |---------------------|---------------------------------------| | `in` | _____________________________________ | | `out` | _____________________________________ | | `weights` | _____________________________________ | | `activation_input` | _____________________________________ | #### P3 (4pt) How much of each of these data structures used in `fc_layer_t::activate()` will fit in the L1 and L2 cache (Note The cache size for our machine is L1-dCache: 32kb, L2: 256kb, L3: 8Mb)? | tensor | % that'll fit in L1| % that'll fit in L2 | |---------------------|--------------------|---------------------| | `in` | __________________ | ___________________ | | `out` | __________________ | ___________________ | | `weights` | __________________ | ___________________ | | `activation_input` | __________________ | ___________________ | ## Understanding Tensor_t Given `tensor_t<double> foo(tdsize(4,3,5,7))`, answer the following (double are 8 bytes)(Hint: Look at lecture slides) : #### P1 (1pt) How many elements are there in `foo`? #### P2 (1pt) What's the linear index of element (1,1,1,1)? #### P3 (1pt) How far apart are elements that differ by 1 in each dimension? |dim. | distance in bytes | distance in linear index | |-----|------|------| | x | | | | y | | | | z | || | b | || ## Tier 1: Reordering and Tiling loops in `fc_layer_t::activate` #### P1 (3pt) Fill out the following table. Report the Misses per Instruction by using the performance counters (there should be a column for "MPI" in the reported data when running with L1/2/3.cfg ) | Cache-Level | Miss rate - Base | Miss rate - loop reordering | Miss rate - Tiling | |---------------| --------------| --------| -- | | L1 | | L2 | | L3 | #### P2 (4pt) Change the order of loops from `b i n` to `b n i` in `fc_layer_t::activate` and report the speedup. Speedup after loop reordering : _______________ #### P3 (4pt) Block the loop `n` in `fc_layer_t::activate` with the tile sizes 1, 2, 4, 8, 16 and fill out the table below. | Dataset | Step size | Blocked implementation time | Speedup vs step size == 1| |----------|-----------|----------------------------------------|---------| | cifar100 | 1 | ______________________________________ | _______ | | cifar100 | 2 | ______________________________________ | _______ | | cifar100 | 4 | ______________________________________ | _______ | | cifar100 | 8 | ______________________________________ | _______ | | cifar100 | 16 | ______________________________________ | _______ | #### P4 (4pt) In a single line graph, plot the speed up against the different block sizes for blocking the loop `n` in `fc_layer_t::activate`. Block size is the independent vairable. ``` Your graph here ``` #### P5 (4pt) Consider the blocksize which gave maximum speedup in the previous question P4 and fill out the following table 1. Base implementation time : _____________________________________ 2. Implementation time of your optimized solution : _____________________________________ 3. Base implementation L1 miss rate : _____________________________________ 4. Your fastest solution L1 misse rate : _____________________________________ #### P6 (3pt) Insert the memory access patterns (take screenshots from moneta) for loop orders b-i-n, b-n-i and nn-b-n-i. Do this for weights tensor and pass the runtime options that set scale to 4 and reps to 1 in config.env. The dataset should be cifar100 (which should be the default). Leave the cache lines and block size fields as they are but set the max accesses to 2 million. In the file opt_cnn.cpp, there is an example of where and what to put to run the moneta please take a look at it. ``` memory access pattern with loop order b-i-n ``` ``` memory access pattern with loop order b-n-i ``` ``` memory access pattern with loop order nn-b-n-i ```