# Linux2021 Homework2(bitcpy) contributed by < `Jings1017` > ###### tags: `linux2021` > [GitHub](https://github.com/Jings1017/Bitcpy) ### 1. 解釋程式碼運作原理 ```c= #include <stdint.h> void bitcpy(void *_dest, /* Address of the buffer to write to */ size_t _write, /* Bit offset to start writing to */ const void *_src, /* Address of the buffer to read from */ size_t _read, /* Bit offset to start reading from */ size_t count) { size_t read_lhs = _read & 7; size_t read_rhs = 8 - read_lhs; const uint8_t *source = (const uint8_t *) _src + (_read / 8); size_t write_lhs = _write & 7; size_t write_rhs = 8 - write_lhs; uint8_t *dest = (uint8_t *) _dest + (_write / 8); static const uint8_t read_mask[] = { 0x00, /* == 0 00000000b */ 0x80, /* == 1 10000000b */ 0xC0, /* == 2 11000000b */ 0xE0, /* == 3 11100000b */ 0xF0, /* == 4 11110000b */ 0xF8, /* == 5 11111000b */ 0xFC, /* == 6 11111100b */ 0xFE, /* == 7 11111110b */ 0xFF /* == 8 11111111b */ }; static const uint8_t write_mask[] = { 0xFF, /* == 0 11111111b */ 0x7F, /* == 1 01111111b */ 0x3F, /* == 2 00111111b */ 0x1F, /* == 3 00011111b */ 0x0F, /* == 4 00001111b */ 0x07, /* == 5 00000111b */ 0x03, /* == 6 00000011b */ 0x01, /* == 7 00000001b */ 0x00 /* == 8 00000000b */ }; while (count > 0) { uint8_t data = *source++; size_t bitsize = (count > 8) ? 8 : count; if (read_lhs > 0) { data <<= read_lhs; if (bitsize > read_rhs) data |= (*source >> read_rhs); } if (bitsize < 8) data &= read_mask[bitsize]; uint8_t original = *dest; uint8_t mask = read_mask[write_lhs]; if (bitsize > write_rhs) { /* Cross multiple bytes */ *dest++ = (original & mask) | (data >> write_lhs); original = *dest & write_mask[bitsize - write_rhs]; *dest = original | (data << write_rhs); } else { // Since write_lhs + bitsize is never >= 8, no out-of-bound access. mask |= write_mask[bitsize - write_lhs]; *dest++ = (original & mask) | (data >> write_lhs); } count -= bitsize; } } ``` 先從參數看起 : * `_dest` : 表示欲貼上的起始位址 * `_write` : _dest 之偏移量 * `_src` : 表示欲複製的起始位址 * `_read` : _src 的偏移量 * `_count` : 表示欲複製的位元數 :::info 在程式中,策略是從 `_src` 中一次搬動一個 byte 到 `_dest` 中,若 bit offset 非 8 的倍數時,則需要做處理。 處理方式又分為讀和寫,讀的時候使用 `read_lhs` 及 `read_rhs`,寫的時候則使用 `write_lhs` 及 `write_rhs`。 以讀為例,`read_lhs` 表示左半不需讀取的 bit 數,而 `read_rhs` 表示右半不需讀取的 bit 數,寫的部分同理故省略之。 ::: #### bitcopy 實作 ``` ... data = *source++; bitsize = (_count > 8) ? 8 : _count; if (read_lhs > 0) { data <<= read_lhs; if (bitsize > read_rhs) data |= (*source >> read_rhs); } ... ``` 進入 while 迴圈時,先把 8-bit 的資料複製到 `data` 中,然後分兩種情形 * `read_lhs` 位移為 0 : 直接把 `source` 的 8 bit 給 data * `read_lhs` 位移非 0 : 非從 `source` 第一個 bit 開始複製,以下圖表示 假設 `_read_lhs` 為 3,即從第三個位元開始複製,則 `read_lhs = 3; read_rhs = 5`。而目前 `data` 是從 `source` 第 0 位元開始,所以必須把 `data` 往左移動 3 格,如圖左下方所示。而下一個 `source` (因為先前已做了 `data=*source++` ) 則往右移動 5 格,如圖右下方所示。最後再將處理好的左右兩部分做 `|` 運算,即可得到我們要的前 8 bit 。 ```graphviz digraph { node [shape=none, fontcolor=black, fontsize=10.5, width=5, fixedsize=true]; byte0 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="15"> <tr> <td port="port0">1</td> <td port="port1">1</td> <td port="port2">0</td> <td port="port3" bgcolor="#ffcce5">0</td> <td port="port4" bgcolor="#ffcce5">1</td> <td port="port5" bgcolor="#ffcce5">0</td> <td port="port6" bgcolor="#ffcce5">1</td> <td port="port7" bgcolor="#ffcce5">1</td> </tr> </table>> ]; byte1 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="15"> <tr> <td port="port0" bgcolor="#ffcce5">1</td> <td port="port1" bgcolor="#ffcce5">1</td> <td port="port2" bgcolor="#ffcce5">1</td> <td port="port3">1</td> <td port="port4">0</td> <td port="port5">1</td> <td port="port6">0</td> <td port="port7">1</td> </tr> </table>> ]; node [shape=plaintext, fontcolor=black, fontsize=20, width=1]; "data"->byte0[style=invis] "*source"->byte1[style=invis] { rank=same; byte0; byte1;} } ``` ```graphviz digraph { node [shape=none, fontcolor=black, fontsize=10.5, width=5, fixedsize=true]; byte0 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="15"> <tr> <td port="port0" bgcolor="#ffcce5">0</td> <td port="port1" bgcolor="#ffcce5">1</td> <td port="port2" bgcolor="#ffcce5">0</td> <td port="port3" bgcolor="#ffcce5">1</td> <td port="port4" bgcolor="#ffcce5">1</td> <td port="port5" >0</td> <td port="port6" >0</td> <td port="port7" >0</td> </tr> </table>> ]; byte1 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="15"> <tr> <td port="port0">0</td> <td port="port1">0</td> <td port="port2">0</td> <td port="port3">0</td> <td port="port4">0</td> <td port="port5" bgcolor="#ffcce5">1</td> <td port="port6" bgcolor="#ffcce5">1</td> <td port="port7" bgcolor="#ffcce5">1</td> </tr> </table>> ]; node [shape=plaintext, fontcolor=black, fontsize=20, width=1]; "data<<=read_lhs"->byte0[style=invis] "*source>>read_rhs"->byte1[style=invis] { rank=same; byte0; byte1;} } ``` ```graphviz digraph { node [shape=none, fontcolor=black, fontsize=10.5, width=5, fixedsize=true]; byte0 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="10"> <tr> <td port="port0" bgcolor="#ffcce5">0</td> <td port="port1" bgcolor="#ffcce5">1</td> <td port="port2" bgcolor="#ffcce5">0</td> <td port="port3" bgcolor="#ffcce5">1</td> <td port="port4" bgcolor="#ffcce5">1</td> <td port="port5" bgcolor="#ffcce5">1</td> <td port="port6" bgcolor="#ffcce5">1</td> <td port="port7" bgcolor="#ffcce5">1</td> </tr> </table>> ]; node [shape=plaintext, fontcolor=black, fontsize=15, width=1]; "data |= (*source>>read_rhs)"->byte0[style=invis] { rank=same; byte0; } } ``` ``` ... if (bitsize < 8) data &= read_mask[bitsize]; ... ``` 如果目前還需再讀取的 bit 數小於 8 時,這時候就要透過 mask 來將多餘的部分設成 0 再來是處理寫到 `dest` 的部分 ``` uint8_t original = *dest; uint8_t mask = read_mask[write_lhs]; if (bitsize > write_rhs) { ... } else { ... } ``` 目前的 `data` 為經過讀取時處理後的資料,在此直接給 `dest`。 而 `mask` 則用來保留左邊 bit 和清除右邊 bit ,因為若有位移的話,不能影響到左邊 bit 在此舉一實例做為參考: 假設 `_write = 3` ,則得到 `write_lhs = 3; write_rhs = 5`。 如果該次需要寫入的位元超過 `write_lhs` ( 即 `bitsize > write_rhs` ), 代表接下來要貼上的 8 位元有跨單位 ,故需要做以下處理。 ``` /* Cross multiple bytes */ *dest++ = (original & mask) | (data >> write_lhs); original = *dest & write_mask[bitsize - write_rhs]; *dest = original | (data << write_rhs); ``` 分述如下: * 將前 5 bit 寫入 `dest` 中,其中不能汙染到前 3 bit,且需將 `dest` 往下移動一格 * 把下一格 `dest` 的右邊 5 個 bit 保留好存到 `original` ( 其中左邊 3 個清成 0 ) * 把 `original` 往左移 5 bit,所以目前的前三 bit 就是上次 data 還沒複製到 dest 的資料。透過 | 把第二步驟留下的左三 bit 填滿 示意圖如下 : 粉紅色為貼上 data 的部分 ```graphviz digraph { graph [rankdir = TB] node [shape=plaintext, fontsize=40]; "" -> "before" -> "data" ->"after" ->"lhs/rhs"[fontcolor=black, color=white]; node [shape=none, fontcolor=black, fontsize=25, width=4.75, fixedsize=true]; byte0 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17"> <tr> <td port="port0">1</td> <td port="port1">1</td> <td port="port2">0</td> <td port="port3">0</td> <td port="port4">1</td> <td port="port5">0</td> <td port="port6">1</td> <td port="port7">1</td> </tr> </table>> ]; byte1 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17"> <tr> <td port="port0">0</td> <td port="port1">1</td> <td port="port2">0</td> <td port="port3">0</td> <td port="port4">1</td> <td port="port5">0</td> <td port="port6">0</td> <td port="port7">1</td> </tr> </table>> ]; byte3 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17"> <tr> <td port="port0">1</td> <td port="port1">1</td> <td port="port2">0</td> <td port="port3" bgcolor="#ffcce5">0</td> <td port="port4" bgcolor="#ffcce5">1</td> <td port="port5" bgcolor="#ffcce5">0</td> <td port="port6" bgcolor="#ffcce5">1</td> <td port="port7" bgcolor="#ffcce5">1</td> </tr> </table>> ]; byte4 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17"> <tr> <td port="port0" bgcolor="#ffcce5">1</td> <td port="port1" bgcolor="#ffcce5">1</td> <td port="port2" bgcolor="#ffcce5">1</td> <td port="port3">0</td> <td port="port4">1</td> <td port="port5">0</td> <td port="port6">0</td> <td port="port7">1</td> </tr> </table>> ]; t0 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17"> <tr> <td port="port3" bgcolor="#ffcce5">0</td> <td port="port4" bgcolor="#ffcce5">1</td> <td port="port5" bgcolor="#ffcce5">0</td> <td port="port6" bgcolor="#ffcce5">1</td> <td port="port7" bgcolor="#ffcce5">1</td> </tr> </table>> ]; t1 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17"> <tr> <td port="port0" bgcolor="#ffcce5">1</td> <td port="port1" bgcolor="#ffcce5">1</td> <td port="port2" bgcolor="#ffcce5">1</td> </tr> </table>> ]; bitsize [shape=record label="bitsize", color="white", fontcolor="black",fontsize=35]; "bitsize" -> byte0:port3 [color=black]; "bitsize" -> byte1:port2 [color=black]; t0:port3 -> byte0:port3 [color=none]; t1:port0 -> byte1:port0 [color=none]; byte3:port3 -> t0:port3 [color=none]; byte4:port0 -> t1:port0 [color=none]; write_lhs [shape=record label="write_lhs", color="white", fontcolor="black",fontsize=30]; "write_lhs" -> byte3:port3 [color=black arrowhead=none]; dest1 [shape=record label="dest(1)", color="white", fontcolor="black",fontsize=30]; "dest1" -> byte0:port0 [color=none]; dest2 [shape=record label="dest(2)", color="white", fontcolor="black",fontsize=30]; "dest2" -> byte1:port4 [color=none]; { rank=same; "before"; byte0; byte1 } { rank=same; "data"; t0; t1; } { rank=same; "lhs/rhs"; write_lhs;} { rank=same; "after"; byte3; byte4; } } ``` 上圖之詳細過程如下 : * 左半 ```graphviz digraph { node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; a [label="1|1|0|0|1|0|1|1", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "&" node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; b[label="1|1|1|0|0|0|0|0", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "=" node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; c[label="1|1|0|0|0|0|0|0", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "dest(1)"->a[style=invis] "mask"->b[style=invis] "original & mask"->c[style=invis] { rank=same; a; "&"; b; "="; c} } ``` ```graphviz digraph{ node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; d[label="1|1|0|0|0|0|0|0", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "|" node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; e[label="0|0|0|0|1|0|1|1",color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "=" node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; f[label="1|1|0|0|1|0|1|1", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=none, fontcolor=black, fontsize=20, width=1, fixedsize=true]; byte0 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="7"> <tr> <td port="port0">1</td> <td port="port1">1</td> <td port="port2">0</td> <td port="port3" bgcolor="#ffcce5">0</td> <td port="port4" bgcolor="#ffcce5">1</td> <td port="port5" bgcolor="#ffcce5">0</td> <td port="port6" bgcolor="#ffcce5">1</td> <td port="port7" bgcolor="#ffcce5">1</td> </tr> </table>> ]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "data>>write_lhs"->e[style=invis] "original & mask"->d[style=invis] f->byte0[style=invis] "="->"=>"[style=invis] { rank=same; d; "|"; e; "="; f} } ``` * 右半 ```graphviz digraph { node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; a [label="0|1|0|0|1|0|0|1", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "&" node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; b[label="0|0|0|1|1|1|1|1", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "=" node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; c[label="0|0|0|0|1|0|0|1", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "dest(2)"->a[style=invis] "mask"->b[style=invis] "original"->c[style=invis] { rank=same; a; "&"; b; "="; c} } ``` ```graphviz digraph{ node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; d[label="0|0|0|0|1|0|0|1", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "|" node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; e[label="1|1|1|0|0|0|0|0", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "=" node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true]; f[label="1|1|1|0|1|0|0|1", color=black, fillcolor=white, style=filled, fontsize=20]; node [shape=none, fontcolor=black, fontsize=20, width=1, fixedsize=true]; byte0 [ shape=none label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="7"> <tr> <td port="port0" bgcolor="#ffcce5">1</td> <td port="port1" bgcolor="#ffcce5">1</td> <td port="port2" bgcolor="#ffcce5">1</td> <td port="port3" >0</td> <td port="port4" >1</td> <td port="port5" >0</td> <td port="port6" >0</td> <td port="port7" >1</td> </tr> </table>> ]; node [shape=plaintext, fontcolor=black, fontsize=30, width=1]; "data<<write_rhs"->e[style=invis] "original"->d[style=invis] f->byte0[style=invis] "="->"=>"[style=invis] { rank=same; d; "|"; e; "="; f} } ``` 而第二種情況則不需考慮跨單位的處理 ``` *dest++ = (original & mask) | (data >> write_lhs); ``` `(original & mask)` 讓左邊保持原本的狀態,右邊清成 0,接著透過 | 把右邊清掉的位元填上 `data`。 ``` count -= bitsize; ``` 最後再更新還需寫多少 bit 即可。 ### 2. 說明其改進空間 由於過多的 branch miss 會造成整體效率下降,所以可以減少程式中不必要的 if 條件式之使用,便可減少分支的產生,以利提升程式效能。 這段 `bitcpy` 讀取的過程中,因為最後都能夠用 `read_mask` 將 `data` 中不需要的部分剔除,因此在這之前其實不需要再去調整 data 要讀取的 range 有多大、以及有沒有跨越到下一個 byte,直接將完整一個 byte 的大小讀取出來,最後再用 `read_mask` 留下需要的部分即可,因此程式可以改成如下,將所有 if 條件式移除: ```diff= size_t bitsize = (count > 8) ? 8 : count; - if (read_lhs > 0) { data <<= read_lhs; - if (bitsize > read_rhs) data |= (*source >> read_rhs); - } - if (bitsize < 8) data &= read_mask[bitsize]; ... ``` 此外,程式中另外宣告的兩個 `mask`,在使用時需要來回查看其內容,使用上有些許不便,額外宣告這兩個陣列也會增加記憶體 stack 空間的使用,應該有更好的實作方式,觀察後發現兩陣列的內容有規律性,其實透過 bitwise 的操作就可以直接完成了,取代方式如下: ```c read_mask[i] == 0xFF << (8 - i) write_mask[i] == 0xFF >> i ``` 但是實驗結果發現,以上取代方式反而會導致時間成本上的增加。 所以參考 [bakudr18](https://hackmd.io/@MR-9Qa9wQLWglSAUyd6UhQ/SkS-Y_lX_#%E6%94%B9%E5%96%84-bitcpy-%E6%95%88%E8%83%BD) 同學的實做方式,將 `mask` 的部份改成 ```c= #define READMASK(x) ((uint8_t)(~0U) << (8 - (x))) #define WRITEMASK(x) (~READMASK(x)) ``` 又在 `/ 8` 的部份可以換成 `>> 3`,以提升運算速度 ```diff= size_t read_lhs = _read & 7; size_t read_rhs = 8 - read_lhs; - const uint8_t *source = (const uint8_t *) _src + (_read / 8); + const uint8_t *source = (const uint8_t *) _src + (_read >> 3); size_t write_lhs = _write & 7; size_t write_rhs = 8 - write_lhs; - uint8_t *dest = (uint8_t *) _dest + (_write / 8); + uint8_t *dest = (uint8_t *) _dest + (_write >> 3); ``` 再來,可以減少變數的宣告, `orginal` 其實可以直接用 `*dest` 取代,並不影響結果。 ```diff= - uint8_t original = *dest; uint8_t mask = read_mask[write_lhs]; if (bitsize > write_rhs) { /* Cross multiple bytes */ - *dest++ = (original & mask) | (data >> write_lhs); + *dest++ = (*dest & mask) | (data >> write_lhs); - original = *dest & write_mask[bitsize - write_rhs]; + *dest = *dest & write_mask[bitsize - write_rhs]; - *dest = original | (data << write_rhs); + *dest = *dest | (data << write_rhs); } else { // Since write_lhs + bitsize is never >= 8, no out-of-bound access. mask |= write_mask[bitsize - write_lhs]; - *dest++ = (original & mask) | (data >> write_lhs); + *dest++ = (*dest & mask) | (data >> write_lhs); } ``` 為了提升運算效能,可以將 if 換成 bitwise ,避免不必要的 branch miss 因此將 `bitsize = (count > 8) ? 8 : count` 換成 bitwise 運算 想法是利用 `min(a,b)` 之 bitwise 運算,其中 `b` 帶入 `8` 即 `min(a,8)` ,符合預期結果 並參考 [先前一對一討論筆記](https://hackmd.io/CEe0nVwSQgOwHg5rVt32HQ) 分析出 min(a,b) :::info 已知 `A xor B xor A = B` 並考慮 `a-b` 是否為正, 針對 `a-b` 最左 bit 進行判斷,若為 `1` 即表示 `a<b`,否則表示 `a>b` 寫成程式碼可表示成可用 `(a-b)&(1<<31)` 再將此結果右移 31 位到最右 bit,即 `(((a-b) & (1<<31))>>63)` 此時 結果為 `1` or `0` 加上負號以二補數表示即 `111..111` or `000...000` 再與 `(a^b)` 做 `&` 運算, 其結果為 `(a^b)&(111...111) = (a^b)` or `(a^b)&(000...000) = 0` 最後再與 `b` 做 `^` 運算,可得 `b^(a^b) = a` or `b^0=b`, 因此可判斷出 `min(a,b)` * 相對應的表格如下 | operation | a < b | a > b | | -------- | -------- | -------- | | a-b | 1xx..xxx | 0xx..xxx | | (a-b)&(1<<31) | 100..000 | 000...000 | | ((a-b) & (1<<31)) >>63 | 000..001 | 000..000 | | -(((a-b) & (1<<31))>>63) | 111..111 | 000..000 | | (a ^ b) & -(((a-b) & (1<<31))>>63) | a^b | 000..000| | b ^ ((a ^ b) & -(((a-b) & (1<<31))>>63)|a|b| ::: 可得以下結果 ```diff= - size_t bitsize = (count > 8) ? 8 : count; + size_t bitsize = 8^((count^8)&(-(((count-8)&(1<<31))>>63))); ``` 將上述整理之後,`my_bitcpy` 程式碼如下 ```c= void v1_bitcpy(void *_dest, /* Address of the buffer to write to */ size_t _write, /* Bit offset to start writing to */ const void *_src, /* Address of the buffer to read from */ size_t _read, /* Bit offset to start reading from */ size_t count) { size_t bitsize; uint8_t data, original, mask; size_t read_lhs = _read & 7; size_t read_rhs = 8 - read_lhs; const uint8_t *source = (const uint8_t *) _src + (_read >>3); size_t write_lhs = _write & 7; size_t write_rhs = 8 - write_lhs; uint8_t *dest = (uint8_t *) _dest + (_write >>3); static const uint8_t MASK[9] = { 255, 127, 63, 31, 15, 7, 3, 1, 0 }; while (count > 0) { // read data bitsize = 8^((count^8)&(-(((count-8)&(1<<31))>>63))); data = (*source++ << read_lhs) | (*source >> read_rhs); data &= ~MASK[bitsize]; // write data mask = ~MASK[write_lhs]; *dest++ = (*dest & mask) | (data >> write_lhs); *dest &= MASK[bitsize - write_rhs]; *dest |= (data << write_rhs); count -= bitsize; } } ``` `main` 程式碼如下 ```c= int main() { struct timespec t1,t2; FILE *fp = fopen("comp_data","w"); for (int size = 8; size < 8000; size++) { memset(&output[0], 0x00, sizeof(output)); clock_gettime(CLOCK_REALTIME, &t1); bitcpy(&output[0], 2, &input[0], 4, size); clock_gettime(CLOCK_REALTIME, &t2); printf( "%lu\n", t2.tv_nsec - t1.tv_nsec); fprintf(fp, "%d\t", size); fprintf(fp, "%lu\t", t2.tv_nsec - t1.tv_nsec); memset(&output[0], 0x00, sizeof(output)); clock_gettime(CLOCK_REALTIME, &t1); my_bitcpy(&output[0], 2, &input[0], 4, size); clock_gettime(CLOCK_REALTIME, &t2); printf( "%lu\n", t2.tv_nsec - t1.tv_nsec); fprintf(fp, "%lu\n", t2.tv_nsec - t1.tv_nsec); } fclose(fp); return 0; } ``` 使用 `perf` 測量分支數量,參考自 [Linux 效能分析工具: Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) ``` $ sudo perf record \ -e branch-misses:u,branch-instructions:u \ -F 10000 ./bitcopy $ sudo perf report ``` 原始版本的測量結果如下 ``` Available samples 579 branch-misses:u 804 branch-instructions:u ``` 改寫後的版本測量如下 ``` Available samples 384 branch-misses:u 582 branch-instructions:u ``` 可從結果看出明顯減少許多分支指令 再使用 `perf` 測量 cache miss 等資訊 ``` $ sudo perf stat --repeat 5 \ -e cache-misses,cache-references,\ instructions,cycles ./bitcopy ``` 原先的版本 ``` Performance counter stats for './bitcopy' (5 runs): 269,893 cache-misses # 11.976 % of all cache refs ( +- 2.50% ) 2,253,672 cache-references ( +- 1.70% ) 355,270,927 instructions # 1.71 insn per cycle ( +- 0.06% ) 207,961,598 cycles ( +- 2.99% ) 0.4051 +- 0.0508 seconds time elapsed ( +- 12.54% ) ``` 改良後的版本 ``` Performance counter stats for './bitcopy' (5 runs): 201,930 cache-misses # 8.442 % of all cache refs ( +- 3.01% ) 2,391,934 cache-references ( +- 12.01% ) 356,046,206 instructions # 1.87 insn per cycle ( +- 0.04% ) 190,125,028 cycles ( +- 5.07% ) 0.1727 +- 0.0147 seconds time elapsed ( +- 8.52% ) ``` 做以上改善之後,利用 gnuplot 製圖觀察與原先程式碼在時間成本上的比較。 ``` $ gnuplot myplot.gp $ eog result.png ``` ![](https://i.imgur.com/mjMklHp.png) 然而,原實做方法必須對 `count > 8` 的資料拆成三段進行處理,但在參考同學的共筆之後,發現可以使用一個迴圈及一個暫存器來處理這一個部份,最後再針對剩餘部份複製即可,這裡須注意剩餘部份是否橫跨兩個 byte ,處理方式如最後的 if 判斷,額外做複製。 ```c= #define READMASK(x) ((uint8_t)(~0U) << (8 - (x))) #define WRITEMASK(x) (~READMASK(x)) void v2_bitcpy(void *_dest, /* Address of the buffer to write to */ size_t _write, /* Bit offset to start writing to */ const void *_src, /* Address of the buffer to read from */ size_t _read, /* Bit offset to start reading from */ size_t count) { size_t read_lhs = _read & 7; size_t read_rhs = 8 - read_lhs; const uint8_t *source = (const uint8_t *) _src + (_read >>3); size_t write_lhs = _write & 7; size_t write_rhs = 8 - write_lhs; uint8_t *dest = (uint8_t *) _dest + (_write >>3); uint8_t data; /* copy until count < 8 bits */ for (size_t bytecount = count >> 3; bytecount > 0; bytecount--) { data = (*source++ << read_lhs) | (*source >> read_rhs); *dest++ = (*dest & READMASK(write_lhs)) | (data >> write_lhs); *dest = (*dest >> write_lhs) | (data << write_rhs); } count &= 7; /* copy the remaining count */ data = ((*source++ << read_lhs) | (*source >> read_rhs)) & READMASK(count); *dest++ = (*dest & READMASK(write_lhs)) | ((data & READMASK(count)) >> write_lhs); if (count > write_rhs) *dest = (*dest & WRITEMASK(count - write_rhs)) | (data << write_rhs); } ``` 利用 gnuplot 製圖結果如下 ![](https://i.imgur.com/rtUY1Vq.png) ### 3. 問題 - [x] 效能圖干擾因素之修正 因為測量時間短,用機率統計(CDF)進行後處理,排除干擾 另外 cache , timer(microsecond) 也有影響 可參考 [Linux 效能分析的提示](https://hackmd.io/@sysprog/linux2021-fibdrv#-Linux-%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E6%8F%90%E7%A4%BA) - [X] 改良後卻沒有顯著改善之解決方法 ~~經過實驗證實,主要是將 mask 用 bitwise 運算代替造成的,若不改寫此部份,製圖後可看出其他改寫部份對時間成本的改善~~ 目前已做以下更改,時間成本上有獲得改善,但仍未找出原方法造成時間成本上升之原因 ```diff= - data &= 0xFF << (8-bitsize); + data &= READMASK(bitsize); - mask = 0xFF << (8 - write_lhs); + mask = READMASK(write_lhs); - *dest &= (0xFF >> (bitsize - write_rhs)); + *dest &= WRITEMASK(bitsize - write_rhs); ``` ### 4. 補充 - [ ] [lib/bitmap.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/bitmap.c?h=v5.12-rc2#n292) ```cpp=292 void __bitmap_replace(unsigned long *dst, const unsigned long *old, const unsigned long *new, const unsigned long *mask, unsigned int nbits) { unsigned int k; unsigned int nr = BITS_TO_LONGS(nbits); for (k = 0; k < nr; k++) dst[k] = (old[k] & ~mask[k]) | (new[k] & mask[k]); } EXPORT_SYMBOL(__bitmap_replace); ``` - [ ] [drivers/video/fbdev/amifb.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/amifb.c?h=v5.12-rc2#n2602) ```cpp=2597 /* * Unaligned forward bit copy using 32-bit or 64-bit memory accesses */ static void bitcpy(unsigned long *dst, int dst_idx, const unsigned long *src, int src_idx, u32 n) ... ``` - [ ] [linux/include/linux/bitmap.h](https://elixir.bootlin.com/linux/latest/source/include/linux/bitmap.h#L245) 中的 `bitmap_copy` 及 `bitmap_copy_clear_tail` 採逐 bit 進行資料複製,但此兩函式皆不會保留 byte 內沒有複製到的 bit。 ```c=245 static inline void bitmap_copy(unsigned long *dst, const unsigned long *src, unsigned int nbits) { unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); memcpy(dst, src, len); } /* * Copy bitmap and clear tail bits in last word. */ static inline void bitmap_copy_clear_tail(unsigned long *dst, const unsigned long *src, unsigned int nbits) { bitmap_copy(dst, src, nbits); if (nbits % BITS_PER_LONG) dst[nbits / BITS_PER_LONG] &= BITMAP_LAST_WORD_MASK(nbits); } ```