[Memory and Storage Systems] Assignment II: Valgrind & Pytorch profiler

# [Memory and Storage Systems] Assignment II: Valgrind & Pytorch profiler * Student name: 楊承翰 * Student ID: 310512072 * Department: 電控碩 --- ## Q1. Memcheck All errors observed from the log: ### 1. illegal writes ``` ==20306== Invalid write of size 4 ### Error: illegal writes ==20306== at 0x1091C0: main (memleak.c:49) ==20306== Address 0x4a96068 is 0 bytes after a block of size 40 alloc'd ==20306== at 0x484884F: malloc (vg_replace_malloc.c:393) ==20306== by 0x10919E: main (memleak.c:46) ==20306== ==20306== Invalid read of size 4 ### Error: illegal writes ==20306== at 0x1091ED: main (memleak.c:54) ==20306== Address 0x4a96068 is 0 bytes after a block of size 40 alloc'd ==20306== at 0x484884F: malloc (vg_replace_malloc.c:393) ==20306== by 0x10919E: main (memleak.c:46) ==20306== ``` The part of the error message indicates that the program was attempting to **write 4 bytes of data to an invalid memory location**. Furthermore, we could observe from the second lines of these two error messages that the program attempted to write to the two addresses `0x1091C0` and `0x1091ED`, which were 0 bytes after a block of memory with size of 40 bytes was allocated using the `malloc` function. This may be caused by the invalid memory allocation logic (maybe specified at line 46 in `memleak.c`) in the program. ### 2. use of uninitialised values ``` ==20306== Conditional jump or move depends on uninitialised value(s) ### Error: Use of uninitialised values ==20306== at 0x48E1B56: __vfprintf_internal (vfprintf-internal.c:1516) ==20306== by 0x48CB81E: printf (printf.c:33) ==20306== by 0x109214: main (memleak.c:57) ==20306== ==20306== Use of uninitialised value of size 8 ### Error: Use of uninitialised values ==20306== at 0x48C533B: _itoa_word (_itoa.c:177) ==20306== by 0x48E0B3D: __vfprintf_internal (vfprintf-internal.c:1516) ==20306== by 0x48CB81E: printf (printf.c:33) ==20306== by 0x109214: main (memleak.c:57) ==20306== ==20306== Conditional jump or move depends on uninitialised value(s) ### Error: Use of uninitialised values ==20306== at 0x48C534C: _itoa_word (_itoa.c:177) ==20306== by 0x48E0B3D: __vfprintf_internal (vfprintf-internal.c:1516) ==20306== by 0x48CB81E: printf (printf.c:33) ==20306== by 0x109214: main (memleak.c:57) ==20306== ==20306== Conditional jump or move depends on uninitialised value(s) ### Error: Use of uninitialised values ==20306== at 0x48E1643: __vfprintf_internal (vfprintf-internal.c:1516) ==20306== by 0x48CB81E: printf (printf.c:33) ==20306== by 0x109214: main (memleak.c:57) ==20306== ==20306== Conditional jump or move depends on uninitialised value(s) ### Error: Use of uninitialised values ==20306== at 0x48E0C85: __vfprintf_internal (vfprintf-internal.c:1516) ==20306== by 0x48CB81E: printf (printf.c:33) ==20306== by 0x109214: main (memleak.c:57) ==20306== ``` At line 57 in `memleak.c`, there are some errors about uninitialised variables used in this program. `__vfprintf_internal` refers to an internal function defined in the C standard library that can be called by the `printf` function. The error occurs within the `__vfprintf_internal` function **probably because some uninitialised variables or invalid data types are passed to the `printf` function**. The second part of the error message indicates that another internal function `_itoa_word` was invoked to convert integer values to strings for output of the `printf` function, which implies that **the uninitialised variable(s) may be in 8-byte integer type**, `long long` in C++, for example. To sum up, the above error message reports that a variable of `long long` type was declared but not initialised beforehand, and was used at line 57 within the `printf` function, which thus leads to this error. Here is the example code that produces this error: ```cpp= void main() { long long x; printf("x = %ld\n", x); } ``` ### 3. fishy argument values ``` ==20306== Argument 'size' of function malloc has a fishy (possibly negative) value: -40 ### Error: Fishy argument values ==20306== at 0x484884F: malloc (vg_replace_malloc.c:393) ==20306== by 0x109220: main (memleak.c:61) ==20306== ``` This error message indicates that **the `size` argument of the `malloc` function at line 61 in `memleak.c` was given a fishy (possibly negative) value**. From the first line of the error message we could easily find the `size` argument is assigned as `-40`, which is not acceptable. The following program may encounter the same error: ```cpp= void main() { int *arr = (int *) malloc(-40); } ``` ### 4. illegal frees ``` ==20306== Invalid free() / delete / delete[] / realloc() ### Error: invalid frees ==20306== at 0x484B0C4: free (vg_replace_malloc.c:884) ==20306== by 0x10924A: main (memleak.c:65) ==20306== Address 0x4a964f0 is 0 bytes inside a block of size 40 free'd ==20306== at 0x484B0C4: free (vg_replace_malloc.c:884) ==20306== by 0x10923E: main (memleak.c:64) ==20306== Block was alloc'd at ==20306== at 0x484884F: malloc (vg_replace_malloc.c:393) ==20306== by 0x10922E: main (memleak.c:63) ==20306== ``` According to the Valgrind user manual: > Memcheck keeps track of the blocks allocated by your program with `malloc/new`, so it can know exactly whether or not the argument to `free/delete` is legitimate or not. The error message notes that a block of data with size 40 bytes was allocated using the `malloc` function at line 63 in `memleak.c`, and then was deallocated using the `free` function at line 64. Hence, this block can no longer be accessed after line 64. However, the `free` function was called again at line 65, which is invalid in this situation, since there's no data can be freed. An example code was given to demonstrate this situation: ```cpp= void main() { int *arr = (int *) malloc(40); free(arr); // OK free(arr); // INVALID!! } ``` ## Q2. Cachegrind ### 1. instruction references The first section of the two log files: ``` ### good: ==15022== I refs: 30,156,318 ==15022== I1 misses: 1,080 ==15022== LLi misses: 1,069 ==15022== I1 miss rate: 0.00% ==15022== LLi miss rate: 0.00% ### bad: ==15024== I refs: 30,156,336 ==15024== I1 misses: 1,080 ==15024== LLi misses: 1,069 ==15024== I1 miss rate: 0.00% ==15024== LLi miss rate: 0.00% ``` We obtained **very similar results** from the analyses on `good` and `bad`. In the output of `good` file, the program had 30,156,318 instruction cache references (`I refs`) with only 1,080 instruction cache misses (`I1 misses`) and 1,069 instruction cache misses at the last level cache (`LLi misses`), resulting in a miss rate of approximate 0.00% ### 2. data references The second section of the two log files: ``` ### good: ==15022== D refs: 14,053,535 (12,039,350 rd + 2,014,185 wr) ==15022== D1 misses: 127,120 ( 64,038 rd + 63,082 wr) ==15022== LLd misses: 64,333 ( 1,296 rd + 63,037 wr) ==15022== D1 miss rate: 0.9% ( 0.5% + 3.1% ) ==15022== LLd miss rate: 0.5% ( 0.0% + 3.1% ) ### bad: ==15024== D refs: 14,053,536 (12,039,351 rd + 2,014,185 wr) ==15024== D1 misses: 2,002,121 ( 1,001,539 rd + 1,000,582 wr) ==15024== LLd misses: 64,333 ( 1,296 rd + 63,037 wr) ==15024== D1 miss rate: 14.2% ( 8.3% + 49.7% ) ==15024== LLd miss rate: 0.5% ( 0.0% + 3.1% ) ``` In the output of `good` file, there were 14,053,535 data references (`D refs`) with 127,120 data cache misses (`D1 misses`), including 64,038 read misses (`rd`) and 63,082 write misses (`wr`), leading to a miss rate of 0.9%. Also, there were 64,333 last level data cache misses (`LLd misses`), with 1,296 read misses (`rd`) and 63,037 write misses (`wr`), resulting in a miss rate of 0.5%. In comparison, there exists **significant difference in `D1 misses` as well as `D1 miss rate`** between two executable files. The L1 data cache misses (`D1 misses`) occurrs when the CPU cannot find the desired data in the L1 cache. In that case, the CPU will try searching the next level of cache. If the data is still not found, then a last level data cache miss (`LLd misses`) will be reported. We could observe that there's significant difference in `D1 misses` between two programs. However, the difference in `LLd misses` remains equivalent. This indicates that in both cases, most of the recently-accessed data is well stored in the cache, resulting in low last level data cache misses. Nevertheless, compared to the `good` case, **the `bad` case somehow stored the recently-accessed data in the last level cache rather than in the L1 cache**, leading to the relatively high L1 data cache misses. In my opinion, the `bad` case refers to the lack of ability to retain hot data in the L1 cache. ### 3. last level references The last section of the two log files: ``` ### good: ==15022== LL refs: 128,200 ( 65,118 rd + 63,082 wr) ==15022== LL misses: 65,402 ( 2,365 rd + 63,037 wr) ==15022== LL miss rate: 0.1% ( 0.0% + 3.1% ) ### bad: ==15024== LL refs: 2,003,201 ( 1,002,619 rd + 1,000,582 wr) ==15024== LL misses: 65,402 ( 2,365 rd + 63,037 wr) ==15024== LL miss rate: 0.1% ( 0.0% + 3.1% ) ``` This section represents the last level cache references (`LL refs`) information. In the output of the `good` log file, there were 128,200 cache references in total, with 65,402 misses, including 2,365 read misses and 63,037 write misses. The miss rate (`LL miss rate`) here is GMR(Global Miss Rate) because the last level misses divided by the total number of instruction references and data references is roughly 0.148%. As mentioned, the last level references were generated based on the L1 cache misses. Most of the data that does not appear in the L1 cache can be found in the last level cache, thus the numbers of the last level cache misses in both cases are equivalent. As a result, **the difference in last level cache references (`LL refs`) was mainly caused by the L1 misses**. ## Q3. Massif The snapshot of the output graph: ![](https://hackmd.io/_uploads/rkOIG6lBh.png) In the analysis, a total of 76 snapshots were taken, and the details of some snapshots (listed in "Detailed snapshots") can be viewed below. According to the "Detailed snapshots" array, the peak value of heap bytes was observed at the 65th snapshot. The detail of this snapshot: ``` -------------------------------------------------------------------------------- n time(B) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 60 225,568 225,568 225,000 568 0 61 229,576 229,576 229,000 576 0 62 231,584 231,584 231,000 584 0 63 235,592 235,592 235,000 592 0 64 239,600 239,600 239,000 600 0 65 239,600 239,600 239,000 600 0 ... (ignored) ``` At the 65th snapshot, **a total of 239,600 bytes were allocated in the program, while a total of 239,000 were used by the heap**. ## Q4. Callgrind ### Q4-1 The snapshot of the function summary: ![](https://hackmd.io/_uploads/ByFL2lbHn.png) Since we expect to find the bottleneck function, the list has been arranged in descending order according to inclusive time. From the aspect of function name and hierarchy, the first 5 functions are all associated with the main function, and they barely occupied the overall execution time (self time close to 0%). Similarly, the execution on the function in 6th place (`run_bfs`) took little time. However, the function in 7th place called **`verify_bfs_tree`** shows an unneglectable portion of execution time. The call graph of the function: ![](https://hackmd.io/_uploads/BkTFxbbH2.png) The total execution time of this hierarchy took 54.52% of the overall program execution time, and the `verify_bfs_tree` function took 48.34% of the overall execution time. Hence, we could conclude that the `verify_bfs_tree` function was the most expensive in this hierarchy. Furthermore, the function was called 64 times in total, which was the most frequently-called function in this hierarchy. ### Q4-2 The sanpshot of the function summary (in descending order according to the number of called times): ![](https://hackmd.io/_uploads/rkx7jJzH3.png) The functions that has been called the most are `mod_mac_y` and `mod_mac`. The `mod_mac` function was called by the `mod_mac_y` function, and the `mod_mac_y` function was called by the `mrg_orig_step` function and the `mrg_apply_transition`. The complete hierarchy could be referred from the call graph: ![](https://hackmd.io/_uploads/rkjznkzrn.png) ## Q5. Pytorch profiler ### Q5-1 The snapshot of the terminal output: ![](https://hackmd.io/_uploads/ByYm5ezHh.png) Here, the operations were sorted in descending order according to the self execution time on the CPU. The top 3 operations occupying most of the execution time are `aten::mkldnn_convolution`, `aten::max_pool2d_with_indices`, and `model_inference`, respectively. Here is an explanation of these functions: 1. `aten::mkldnn_convolution`: This function represents the **convolution operation** performed using the MKL-DNN (Math Kernel Library for Deep Neural Networks) library. MKL-DNN is an optimized library for deep learning operations, and this function is responsible for executing the convolutional layer in the ResNet18 model. 2. `aten::max_pool2d_with_indices`: This function corresponds to the **max pooling operation with indices**. Max pooling is a common operation in convolutional neural networks (CNNs) that reduces the spatial dimensions of the input by selecting the maximum value within each pooling region. 3. `aten::native_batch_norm`: This function represents the native **batch normalization operation** in PyTorch. Batch normalization is a technique commonly used in deep learning to improve the training and convergence of neural networks. It normalizes the input by subtracting the mean and dividing by the standard deviation, which helps in stabilizing the network's learning process. ### Q5-2 The entire chrome screen displaying the result of Chrome trace viewer: ![](https://hackmd.io/_uploads/SJNBYxzB2.png) The top 2 functions that appeared the most were `aten::conv2d` and `aten::convolution`.