# Linux2021 Homework2(bitcpy)
contributed by < `Jings1017` >
###### tags: `linux2021`
> [GitHub](https://github.com/Jings1017/Bitcpy)
### 1. 解釋程式碼運作原理
```c=
#include <stdint.h>
void bitcpy(void *_dest, /* Address of the buffer to write to */
size_t _write, /* Bit offset to start writing to */
const void *_src, /* Address of the buffer to read from */
size_t _read, /* Bit offset to start reading from */
size_t count)
{
size_t read_lhs = _read & 7;
size_t read_rhs = 8 - read_lhs;
const uint8_t *source = (const uint8_t *) _src + (_read / 8);
size_t write_lhs = _write & 7;
size_t write_rhs = 8 - write_lhs;
uint8_t *dest = (uint8_t *) _dest + (_write / 8);
static const uint8_t read_mask[] = {
0x00, /* == 0 00000000b */
0x80, /* == 1 10000000b */
0xC0, /* == 2 11000000b */
0xE0, /* == 3 11100000b */
0xF0, /* == 4 11110000b */
0xF8, /* == 5 11111000b */
0xFC, /* == 6 11111100b */
0xFE, /* == 7 11111110b */
0xFF /* == 8 11111111b */
};
static const uint8_t write_mask[] = {
0xFF, /* == 0 11111111b */
0x7F, /* == 1 01111111b */
0x3F, /* == 2 00111111b */
0x1F, /* == 3 00011111b */
0x0F, /* == 4 00001111b */
0x07, /* == 5 00000111b */
0x03, /* == 6 00000011b */
0x01, /* == 7 00000001b */
0x00 /* == 8 00000000b */
};
while (count > 0) {
uint8_t data = *source++;
size_t bitsize = (count > 8) ? 8 : count;
if (read_lhs > 0) {
data <<= read_lhs;
if (bitsize > read_rhs)
data |= (*source >> read_rhs);
}
if (bitsize < 8)
data &= read_mask[bitsize];
uint8_t original = *dest;
uint8_t mask = read_mask[write_lhs];
if (bitsize > write_rhs) {
/* Cross multiple bytes */
*dest++ = (original & mask) | (data >> write_lhs);
original = *dest & write_mask[bitsize - write_rhs];
*dest = original | (data << write_rhs);
} else {
// Since write_lhs + bitsize is never >= 8, no out-of-bound access.
mask |= write_mask[bitsize - write_lhs];
*dest++ = (original & mask) | (data >> write_lhs);
}
count -= bitsize;
}
}
```
先從參數看起 :
* `_dest` : 表示欲貼上的起始位址
* `_write` : _dest 之偏移量
* `_src` : 表示欲複製的起始位址
* `_read` : _src 的偏移量
* `_count` : 表示欲複製的位元數
:::info
在程式中,策略是從 `_src` 中一次搬動一個 byte 到 `_dest` 中,若 bit offset 非 8 的倍數時,則需要做處理。
處理方式又分為讀和寫,讀的時候使用 `read_lhs` 及 `read_rhs`,寫的時候則使用 `write_lhs` 及 `write_rhs`。
以讀為例,`read_lhs` 表示左半不需讀取的 bit 數,而 `read_rhs` 表示右半不需讀取的 bit 數,寫的部分同理故省略之。
:::
#### bitcopy 實作
```
...
data = *source++;
bitsize = (_count > 8) ? 8 : _count;
if (read_lhs > 0) {
data <<= read_lhs;
if (bitsize > read_rhs)
data |= (*source >> read_rhs);
}
...
```
進入 while 迴圈時,先把 8-bit 的資料複製到 `data` 中,然後分兩種情形
* `read_lhs` 位移為 0 : 直接把 `source` 的 8 bit 給 data
* `read_lhs` 位移非 0 : 非從 `source` 第一個 bit 開始複製,以下圖表示
假設 `_read_lhs` 為 3,即從第三個位元開始複製,則 `read_lhs = 3; read_rhs = 5`。而目前 `data` 是從 `source` 第 0 位元開始,所以必須把 `data` 往左移動 3 格,如圖左下方所示。而下一個 `source` (因為先前已做了 `data=*source++` ) 則往右移動 5 格,如圖右下方所示。最後再將處理好的左右兩部分做 `|` 運算,即可得到我們要的前 8 bit 。
```graphviz
digraph {
node [shape=none, fontcolor=black, fontsize=10.5, width=5, fixedsize=true];
byte0 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="15">
<tr>
<td port="port0">1</td>
<td port="port1">1</td>
<td port="port2">0</td>
<td port="port3" bgcolor="#ffcce5">0</td>
<td port="port4" bgcolor="#ffcce5">1</td>
<td port="port5" bgcolor="#ffcce5">0</td>
<td port="port6" bgcolor="#ffcce5">1</td>
<td port="port7" bgcolor="#ffcce5">1</td>
</tr>
</table>>
];
byte1 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="15">
<tr>
<td port="port0" bgcolor="#ffcce5">1</td>
<td port="port1" bgcolor="#ffcce5">1</td>
<td port="port2" bgcolor="#ffcce5">1</td>
<td port="port3">1</td>
<td port="port4">0</td>
<td port="port5">1</td>
<td port="port6">0</td>
<td port="port7">1</td>
</tr>
</table>>
];
node [shape=plaintext, fontcolor=black, fontsize=20, width=1];
"data"->byte0[style=invis]
"*source"->byte1[style=invis]
{ rank=same; byte0; byte1;}
}
```
```graphviz
digraph {
node [shape=none, fontcolor=black, fontsize=10.5, width=5, fixedsize=true];
byte0 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="15">
<tr>
<td port="port0" bgcolor="#ffcce5">0</td>
<td port="port1" bgcolor="#ffcce5">1</td>
<td port="port2" bgcolor="#ffcce5">0</td>
<td port="port3" bgcolor="#ffcce5">1</td>
<td port="port4" bgcolor="#ffcce5">1</td>
<td port="port5" >0</td>
<td port="port6" >0</td>
<td port="port7" >0</td>
</tr>
</table>>
];
byte1 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="15">
<tr>
<td port="port0">0</td>
<td port="port1">0</td>
<td port="port2">0</td>
<td port="port3">0</td>
<td port="port4">0</td>
<td port="port5" bgcolor="#ffcce5">1</td>
<td port="port6" bgcolor="#ffcce5">1</td>
<td port="port7" bgcolor="#ffcce5">1</td>
</tr>
</table>>
];
node [shape=plaintext, fontcolor=black, fontsize=20, width=1];
"data<<=read_lhs"->byte0[style=invis]
"*source>>read_rhs"->byte1[style=invis]
{ rank=same; byte0; byte1;}
}
```
```graphviz
digraph {
node [shape=none, fontcolor=black, fontsize=10.5, width=5, fixedsize=true];
byte0 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="10">
<tr>
<td port="port0" bgcolor="#ffcce5">0</td>
<td port="port1" bgcolor="#ffcce5">1</td>
<td port="port2" bgcolor="#ffcce5">0</td>
<td port="port3" bgcolor="#ffcce5">1</td>
<td port="port4" bgcolor="#ffcce5">1</td>
<td port="port5" bgcolor="#ffcce5">1</td>
<td port="port6" bgcolor="#ffcce5">1</td>
<td port="port7" bgcolor="#ffcce5">1</td>
</tr>
</table>>
];
node [shape=plaintext, fontcolor=black, fontsize=15, width=1];
"data |= (*source>>read_rhs)"->byte0[style=invis]
{ rank=same; byte0; }
}
```
```
...
if (bitsize < 8)
data &= read_mask[bitsize];
...
```
如果目前還需再讀取的 bit 數小於 8 時,這時候就要透過 mask 來將多餘的部分設成 0
再來是處理寫到 `dest` 的部分
```
uint8_t original = *dest;
uint8_t mask = read_mask[write_lhs];
if (bitsize > write_rhs) {
...
} else {
...
}
```
目前的 `data` 為經過讀取時處理後的資料,在此直接給 `dest`。
而 `mask` 則用來保留左邊 bit 和清除右邊 bit ,因為若有位移的話,不能影響到左邊 bit
在此舉一實例做為參考:
假設 `_write = 3` ,則得到 `write_lhs = 3; write_rhs = 5`。
如果該次需要寫入的位元超過 `write_lhs` ( 即 `bitsize > write_rhs` ), 代表接下來要貼上的 8 位元有跨單位 ,故需要做以下處理。
```
/* Cross multiple bytes */
*dest++ = (original & mask) | (data >> write_lhs);
original = *dest & write_mask[bitsize - write_rhs];
*dest = original | (data << write_rhs);
```
分述如下:
* 將前 5 bit 寫入 `dest` 中,其中不能汙染到前 3 bit,且需將 `dest` 往下移動一格
* 把下一格 `dest` 的右邊 5 個 bit 保留好存到 `original` ( 其中左邊 3 個清成 0 )
* 把 `original` 往左移 5 bit,所以目前的前三 bit 就是上次 data 還沒複製到 dest 的資料。透過 | 把第二步驟留下的左三 bit 填滿
示意圖如下 :
粉紅色為貼上 data 的部分
```graphviz
digraph {
graph [rankdir = TB]
node [shape=plaintext, fontsize=40];
"" -> "before" -> "data" ->"after" ->"lhs/rhs"[fontcolor=black, color=white];
node [shape=none, fontcolor=black, fontsize=25, width=4.75, fixedsize=true];
byte0 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17">
<tr>
<td port="port0">1</td>
<td port="port1">1</td>
<td port="port2">0</td>
<td port="port3">0</td>
<td port="port4">1</td>
<td port="port5">0</td>
<td port="port6">1</td>
<td port="port7">1</td>
</tr>
</table>>
];
byte1 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17">
<tr>
<td port="port0">0</td>
<td port="port1">1</td>
<td port="port2">0</td>
<td port="port3">0</td>
<td port="port4">1</td>
<td port="port5">0</td>
<td port="port6">0</td>
<td port="port7">1</td>
</tr>
</table>>
];
byte3 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17">
<tr>
<td port="port0">1</td>
<td port="port1">1</td>
<td port="port2">0</td>
<td port="port3" bgcolor="#ffcce5">0</td>
<td port="port4" bgcolor="#ffcce5">1</td>
<td port="port5" bgcolor="#ffcce5">0</td>
<td port="port6" bgcolor="#ffcce5">1</td>
<td port="port7" bgcolor="#ffcce5">1</td>
</tr>
</table>>
];
byte4 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17">
<tr>
<td port="port0" bgcolor="#ffcce5">1</td>
<td port="port1" bgcolor="#ffcce5">1</td>
<td port="port2" bgcolor="#ffcce5">1</td>
<td port="port3">0</td>
<td port="port4">1</td>
<td port="port5">0</td>
<td port="port6">0</td>
<td port="port7">1</td>
</tr>
</table>>
];
t0 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17">
<tr>
<td port="port3" bgcolor="#ffcce5">0</td>
<td port="port4" bgcolor="#ffcce5">1</td>
<td port="port5" bgcolor="#ffcce5">0</td>
<td port="port6" bgcolor="#ffcce5">1</td>
<td port="port7" bgcolor="#ffcce5">1</td>
</tr>
</table>>
];
t1 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="17">
<tr>
<td port="port0" bgcolor="#ffcce5">1</td>
<td port="port1" bgcolor="#ffcce5">1</td>
<td port="port2" bgcolor="#ffcce5">1</td>
</tr>
</table>>
];
bitsize [shape=record label="bitsize", color="white", fontcolor="black",fontsize=35];
"bitsize" -> byte0:port3 [color=black];
"bitsize" -> byte1:port2 [color=black];
t0:port3 -> byte0:port3 [color=none];
t1:port0 -> byte1:port0 [color=none];
byte3:port3 -> t0:port3 [color=none];
byte4:port0 -> t1:port0 [color=none];
write_lhs [shape=record label="write_lhs", color="white", fontcolor="black",fontsize=30];
"write_lhs" -> byte3:port3 [color=black arrowhead=none];
dest1 [shape=record label="dest(1)", color="white", fontcolor="black",fontsize=30];
"dest1" -> byte0:port0 [color=none];
dest2 [shape=record label="dest(2)", color="white", fontcolor="black",fontsize=30];
"dest2" -> byte1:port4 [color=none];
{ rank=same; "before"; byte0; byte1 }
{ rank=same; "data"; t0; t1; }
{ rank=same; "lhs/rhs"; write_lhs;}
{ rank=same; "after"; byte3; byte4; }
}
```
上圖之詳細過程如下 :
* 左半
```graphviz
digraph {
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
a [label="1|1|0|0|1|0|1|1", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"&"
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
b[label="1|1|1|0|0|0|0|0", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"="
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
c[label="1|1|0|0|0|0|0|0", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"dest(1)"->a[style=invis]
"mask"->b[style=invis]
"original & mask"->c[style=invis]
{ rank=same; a; "&"; b; "="; c}
}
```
```graphviz
digraph{
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
d[label="1|1|0|0|0|0|0|0", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"|"
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
e[label="0|0|0|0|1|0|1|1",color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"="
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
f[label="1|1|0|0|1|0|1|1", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=none, fontcolor=black, fontsize=20, width=1, fixedsize=true];
byte0 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="7">
<tr>
<td port="port0">1</td>
<td port="port1">1</td>
<td port="port2">0</td>
<td port="port3" bgcolor="#ffcce5">0</td>
<td port="port4" bgcolor="#ffcce5">1</td>
<td port="port5" bgcolor="#ffcce5">0</td>
<td port="port6" bgcolor="#ffcce5">1</td>
<td port="port7" bgcolor="#ffcce5">1</td>
</tr>
</table>>
];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"data>>write_lhs"->e[style=invis]
"original & mask"->d[style=invis]
f->byte0[style=invis]
"="->"=>"[style=invis]
{ rank=same; d; "|"; e; "="; f}
}
```
* 右半
```graphviz
digraph {
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
a [label="0|1|0|0|1|0|0|1", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"&"
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
b[label="0|0|0|1|1|1|1|1", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"="
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
c[label="0|0|0|0|1|0|0|1", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"dest(2)"->a[style=invis]
"mask"->b[style=invis]
"original"->c[style=invis]
{ rank=same; a; "&"; b; "="; c}
}
```
```graphviz
digraph{
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
d[label="0|0|0|0|1|0|0|1", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"|"
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
e[label="1|1|1|0|0|0|0|0", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"="
node [shape=record, fontcolor=black, fontsize=14, width=3, fixedsize=true];
f[label="1|1|1|0|1|0|0|1", color=black, fillcolor=white, style=filled, fontsize=20];
node [shape=none, fontcolor=black, fontsize=20, width=1, fixedsize=true];
byte0 [
shape=none
label=<<table border="0" cellborder="1" cellspacing="0" cellpadding="7">
<tr>
<td port="port0" bgcolor="#ffcce5">1</td>
<td port="port1" bgcolor="#ffcce5">1</td>
<td port="port2" bgcolor="#ffcce5">1</td>
<td port="port3" >0</td>
<td port="port4" >1</td>
<td port="port5" >0</td>
<td port="port6" >0</td>
<td port="port7" >1</td>
</tr>
</table>>
];
node [shape=plaintext, fontcolor=black, fontsize=30, width=1];
"data<<write_rhs"->e[style=invis]
"original"->d[style=invis]
f->byte0[style=invis]
"="->"=>"[style=invis]
{ rank=same; d; "|"; e; "="; f}
}
```
而第二種情況則不需考慮跨單位的處理
```
*dest++ = (original & mask) | (data >> write_lhs);
```
`(original & mask)` 讓左邊保持原本的狀態,右邊清成 0,接著透過 | 把右邊清掉的位元填上 `data`。
```
count -= bitsize;
```
最後再更新還需寫多少 bit 即可。
### 2. 說明其改進空間
由於過多的 branch miss 會造成整體效率下降,所以可以減少程式中不必要的 if 條件式之使用,便可減少分支的產生,以利提升程式效能。
這段 `bitcpy` 讀取的過程中,因為最後都能夠用 `read_mask` 將 `data` 中不需要的部分剔除,因此在這之前其實不需要再去調整 data 要讀取的 range 有多大、以及有沒有跨越到下一個 byte,直接將完整一個 byte 的大小讀取出來,最後再用 `read_mask` 留下需要的部分即可,因此程式可以改成如下,將所有 if 條件式移除:
```diff=
size_t bitsize = (count > 8) ? 8 : count;
- if (read_lhs > 0) {
data <<= read_lhs;
- if (bitsize > read_rhs)
data |= (*source >> read_rhs);
- }
- if (bitsize < 8)
data &= read_mask[bitsize];
...
```
此外,程式中另外宣告的兩個 `mask`,在使用時需要來回查看其內容,使用上有些許不便,額外宣告這兩個陣列也會增加記憶體 stack 空間的使用,應該有更好的實作方式,觀察後發現兩陣列的內容有規律性,其實透過 bitwise 的操作就可以直接完成了,取代方式如下:
```c
read_mask[i] == 0xFF << (8 - i)
write_mask[i] == 0xFF >> i
```
但是實驗結果發現,以上取代方式反而會導致時間成本上的增加。
所以參考 [bakudr18](https://hackmd.io/@MR-9Qa9wQLWglSAUyd6UhQ/SkS-Y_lX_#%E6%94%B9%E5%96%84-bitcpy-%E6%95%88%E8%83%BD) 同學的實做方式,將 `mask` 的部份改成
```c=
#define READMASK(x) ((uint8_t)(~0U) << (8 - (x)))
#define WRITEMASK(x) (~READMASK(x))
```
又在 `/ 8` 的部份可以換成 `>> 3`,以提升運算速度
```diff=
size_t read_lhs = _read & 7;
size_t read_rhs = 8 - read_lhs;
- const uint8_t *source = (const uint8_t *) _src + (_read / 8);
+ const uint8_t *source = (const uint8_t *) _src + (_read >> 3);
size_t write_lhs = _write & 7;
size_t write_rhs = 8 - write_lhs;
- uint8_t *dest = (uint8_t *) _dest + (_write / 8);
+ uint8_t *dest = (uint8_t *) _dest + (_write >> 3);
```
再來,可以減少變數的宣告, `orginal` 其實可以直接用 `*dest` 取代,並不影響結果。
```diff=
- uint8_t original = *dest;
uint8_t mask = read_mask[write_lhs];
if (bitsize > write_rhs) {
/* Cross multiple bytes */
- *dest++ = (original & mask) | (data >> write_lhs);
+ *dest++ = (*dest & mask) | (data >> write_lhs);
- original = *dest & write_mask[bitsize - write_rhs];
+ *dest = *dest & write_mask[bitsize - write_rhs];
- *dest = original | (data << write_rhs);
+ *dest = *dest | (data << write_rhs);
} else {
// Since write_lhs + bitsize is never >= 8, no out-of-bound access.
mask |= write_mask[bitsize - write_lhs];
- *dest++ = (original & mask) | (data >> write_lhs);
+ *dest++ = (*dest & mask) | (data >> write_lhs);
}
```
為了提升運算效能,可以將 if 換成 bitwise ,避免不必要的 branch miss
因此將 `bitsize = (count > 8) ? 8 : count` 換成 bitwise 運算
想法是利用 `min(a,b)` 之 bitwise 運算,其中 `b` 帶入 `8`
即 `min(a,8)` ,符合預期結果
並參考 [先前一對一討論筆記](https://hackmd.io/CEe0nVwSQgOwHg5rVt32HQ) 分析出 min(a,b)
:::info
已知 `A xor B xor A = B`
並考慮 `a-b` 是否為正,
針對 `a-b` 最左 bit 進行判斷,若為 `1` 即表示 `a<b`,否則表示 `a>b`
寫成程式碼可表示成可用 `(a-b)&(1<<31)`
再將此結果右移 31 位到最右 bit,即 `(((a-b) & (1<<31))>>63)`
此時 結果為 `1` or `0`
加上負號以二補數表示即 `111..111` or `000...000`
再與 `(a^b)` 做 `&` 運算,
其結果為 `(a^b)&(111...111) = (a^b)` or `(a^b)&(000...000) = 0`
最後再與 `b` 做 `^` 運算,可得 `b^(a^b) = a` or `b^0=b`,
因此可判斷出 `min(a,b)`
* 相對應的表格如下
| operation | a < b | a > b |
| -------- | -------- | -------- |
| a-b | 1xx..xxx | 0xx..xxx |
| (a-b)&(1<<31) | 100..000 | 000...000 |
| ((a-b) & (1<<31)) >>63 | 000..001 | 000..000 |
| -(((a-b) & (1<<31))>>63) | 111..111 | 000..000 |
| (a ^ b) & -(((a-b) & (1<<31))>>63) | a^b | 000..000|
| b ^ ((a ^ b) & -(((a-b) & (1<<31))>>63)|a|b|
:::
可得以下結果
```diff=
- size_t bitsize = (count > 8) ? 8 : count;
+ size_t bitsize = 8^((count^8)&(-(((count-8)&(1<<31))>>63)));
```
將上述整理之後,`my_bitcpy` 程式碼如下
```c=
void v1_bitcpy(void *_dest, /* Address of the buffer to write to */
size_t _write, /* Bit offset to start writing to */
const void *_src, /* Address of the buffer to read from */
size_t _read, /* Bit offset to start reading from */
size_t count)
{
size_t bitsize;
uint8_t data, original, mask;
size_t read_lhs = _read & 7;
size_t read_rhs = 8 - read_lhs;
const uint8_t *source = (const uint8_t *) _src + (_read >>3);
size_t write_lhs = _write & 7;
size_t write_rhs = 8 - write_lhs;
uint8_t *dest = (uint8_t *) _dest + (_write >>3);
static const uint8_t MASK[9] = { 255, 127, 63, 31, 15, 7, 3, 1, 0 };
while (count > 0) {
// read data
bitsize = 8^((count^8)&(-(((count-8)&(1<<31))>>63)));
data = (*source++ << read_lhs) | (*source >> read_rhs);
data &= ~MASK[bitsize];
// write data
mask = ~MASK[write_lhs];
*dest++ = (*dest & mask) | (data >> write_lhs);
*dest &= MASK[bitsize - write_rhs];
*dest |= (data << write_rhs);
count -= bitsize;
}
}
```
`main` 程式碼如下
```c=
int main()
{
struct timespec t1,t2;
FILE *fp = fopen("comp_data","w");
for (int size = 8; size < 8000; size++) {
memset(&output[0], 0x00, sizeof(output));
clock_gettime(CLOCK_REALTIME, &t1);
bitcpy(&output[0], 2, &input[0], 4, size);
clock_gettime(CLOCK_REALTIME, &t2);
printf( "%lu\n", t2.tv_nsec - t1.tv_nsec);
fprintf(fp, "%d\t", size);
fprintf(fp, "%lu\t", t2.tv_nsec - t1.tv_nsec);
memset(&output[0], 0x00, sizeof(output));
clock_gettime(CLOCK_REALTIME, &t1);
my_bitcpy(&output[0], 2, &input[0], 4, size);
clock_gettime(CLOCK_REALTIME, &t2);
printf( "%lu\n", t2.tv_nsec - t1.tv_nsec);
fprintf(fp, "%lu\n", t2.tv_nsec - t1.tv_nsec);
}
fclose(fp);
return 0;
}
```
使用 `perf` 測量分支數量,參考自 [Linux 效能分析工具: Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)
```
$ sudo perf record \
-e branch-misses:u,branch-instructions:u \
-F 10000 ./bitcopy
$ sudo perf report
```
原始版本的測量結果如下
```
Available samples
579 branch-misses:u
804 branch-instructions:u
```
改寫後的版本測量如下
```
Available samples
384 branch-misses:u
582 branch-instructions:u
```
可從結果看出明顯減少許多分支指令
再使用 `perf` 測量 cache miss 等資訊
```
$ sudo perf stat --repeat 5 \
-e cache-misses,cache-references,\
instructions,cycles ./bitcopy
```
原先的版本
```
Performance counter stats for './bitcopy' (5 runs):
269,893 cache-misses # 11.976 % of all cache refs ( +- 2.50% )
2,253,672 cache-references ( +- 1.70% )
355,270,927 instructions # 1.71 insn per cycle ( +- 0.06% )
207,961,598 cycles ( +- 2.99% )
0.4051 +- 0.0508 seconds time elapsed ( +- 12.54% )
```
改良後的版本
```
Performance counter stats for './bitcopy' (5 runs):
201,930 cache-misses # 8.442 % of all cache refs ( +- 3.01% )
2,391,934 cache-references ( +- 12.01% )
356,046,206 instructions # 1.87 insn per cycle ( +- 0.04% )
190,125,028 cycles ( +- 5.07% )
0.1727 +- 0.0147 seconds time elapsed ( +- 8.52% )
```
做以上改善之後,利用 gnuplot 製圖觀察與原先程式碼在時間成本上的比較。
```
$ gnuplot myplot.gp
$ eog result.png
```
![](https://i.imgur.com/mjMklHp.png)
然而,原實做方法必須對 `count > 8` 的資料拆成三段進行處理,但在參考同學的共筆之後,發現可以使用一個迴圈及一個暫存器來處理這一個部份,最後再針對剩餘部份複製即可,這裡須注意剩餘部份是否橫跨兩個 byte ,處理方式如最後的 if 判斷,額外做複製。
```c=
#define READMASK(x) ((uint8_t)(~0U) << (8 - (x)))
#define WRITEMASK(x) (~READMASK(x))
void v2_bitcpy(void *_dest, /* Address of the buffer to write to */
size_t _write, /* Bit offset to start writing to */
const void *_src, /* Address of the buffer to read from */
size_t _read, /* Bit offset to start reading from */
size_t count)
{
size_t read_lhs = _read & 7;
size_t read_rhs = 8 - read_lhs;
const uint8_t *source = (const uint8_t *) _src + (_read >>3);
size_t write_lhs = _write & 7;
size_t write_rhs = 8 - write_lhs;
uint8_t *dest = (uint8_t *) _dest + (_write >>3);
uint8_t data;
/* copy until count < 8 bits */
for (size_t bytecount = count >> 3; bytecount > 0; bytecount--) {
data = (*source++ << read_lhs) | (*source >> read_rhs);
*dest++ = (*dest & READMASK(write_lhs)) | (data >> write_lhs);
*dest = (*dest >> write_lhs) | (data << write_rhs);
}
count &= 7;
/* copy the remaining count */
data = ((*source++ << read_lhs) | (*source >> read_rhs)) & READMASK(count);
*dest++ = (*dest & READMASK(write_lhs)) | ((data & READMASK(count)) >> write_lhs);
if (count > write_rhs)
*dest = (*dest & WRITEMASK(count - write_rhs)) | (data << write_rhs);
}
```
利用 gnuplot 製圖結果如下
![](https://i.imgur.com/rtUY1Vq.png)
### 3. 問題
- [x] 效能圖干擾因素之修正
因為測量時間短,用機率統計(CDF)進行後處理,排除干擾
另外 cache , timer(microsecond) 也有影響
可參考 [Linux 效能分析的提示](https://hackmd.io/@sysprog/linux2021-fibdrv#-Linux-%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E6%8F%90%E7%A4%BA)
- [X] 改良後卻沒有顯著改善之解決方法
~~經過實驗證實,主要是將 mask 用 bitwise 運算代替造成的,若不改寫此部份,製圖後可看出其他改寫部份對時間成本的改善~~
目前已做以下更改,時間成本上有獲得改善,但仍未找出原方法造成時間成本上升之原因
```diff=
- data &= 0xFF << (8-bitsize);
+ data &= READMASK(bitsize);
- mask = 0xFF << (8 - write_lhs);
+ mask = READMASK(write_lhs);
- *dest &= (0xFF >> (bitsize - write_rhs));
+ *dest &= WRITEMASK(bitsize - write_rhs);
```
### 4. 補充
- [ ] [lib/bitmap.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/bitmap.c?h=v5.12-rc2#n292)
```cpp=292
void __bitmap_replace(unsigned long *dst,
const unsigned long *old, const unsigned long *new,
const unsigned long *mask, unsigned int nbits)
{
unsigned int k;
unsigned int nr = BITS_TO_LONGS(nbits);
for (k = 0; k < nr; k++)
dst[k] = (old[k] & ~mask[k]) | (new[k] & mask[k]);
}
EXPORT_SYMBOL(__bitmap_replace);
```
- [ ] [drivers/video/fbdev/amifb.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/amifb.c?h=v5.12-rc2#n2602)
```cpp=2597
/*
* Unaligned forward bit copy using 32-bit or 64-bit memory accesses
*/
static void bitcpy(unsigned long *dst, int dst_idx, const unsigned long *src,
int src_idx, u32 n)
...
```
- [ ] [linux/include/linux/bitmap.h](https://elixir.bootlin.com/linux/latest/source/include/linux/bitmap.h#L245) 中的 `bitmap_copy` 及 `bitmap_copy_clear_tail`
採逐 bit 進行資料複製,但此兩函式皆不會保留 byte 內沒有複製到的 bit。
```c=245
static inline void bitmap_copy(unsigned long *dst, const unsigned long *src,
unsigned int nbits)
{
unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
memcpy(dst, src, len);
}
/*
* Copy bitmap and clear tail bits in last word.
*/
static inline void bitmap_copy_clear_tail(unsigned long *dst,
const unsigned long *src, unsigned int nbits)
{
bitmap_copy(dst, src, nbits);
if (nbits % BITS_PER_LONG)
dst[nbits / BITS_PER_LONG] &= BITMAP_LAST_WORD_MASK(nbits);
}
```