contributed by < linD026
>
linux2021
1
2
3
4
1
對於如何得到環形鏈結串列的中間節點,重點在於第 5 行的判斷。
這其實很好理解,和我之前看的 YouTube: If Programming Was An Anime 有關 (雖然裡面主要講的是 floyd's tortoise and hare,但對可以讓利用速度快慢所造成的差異性理解更清楚)。
fast
是 slow
的兩倍走訪節點數,就能確保在達成終止條件 if (fast->next == list || fast->next->next == list)
時 slow
在中間節點。此函式為 merge sort 重新連接的步驟,函式輸入 lhs (left), rhs (right), head (result) 。
此函式為 merge sort 主要遞迴程式。
list_cut_position
把 q
的左半部(前半段)節點移動到 left
上。left
和 q
分別進行原函式遞迴,之後再利用 list_merge
合併到 sorted
。q
,把排序完畢的鏈結串列存回去 q
。我認為結構可以改成:
把 head
和 node
的結構所包含的資訊分開,讓兩者持有自己類別的資訊。
因此關於相關的函式操作需更改成下列形式:
4,598,645 bytes allocated
3,945,661 bytes allocated
完整實作可見: main.c
完整實作可見: list.h
Reference
完整實作可見: analysis.c
temp note :
根據註解此 mergesort 會盡可能地以 2:1 合併,因此當有兩個 大小的串列會合併成 大小的串列來達到避免 cache thrashing 直到 個元素至 cache 。
再者雖然 mergesort 沒有 fully-eager bottom-up mergesort 好,但用較少的比較 ( ),因此只要在一般情況下跟 L1 合適就比較快。
L1 為 CPU 的 cache 。可以下
getconf -a | grep CACHE
查看。
解釋分離與合併時的數量是如何符合最佳化。
說明對於數量是如掌控與操作。
會讓 tail 往右側移一位到 head 的位置。
關於 bit 如何操作請看 第 158 行。
假設在 list_sort()
裡的 b 先排,則 tail 會在 a 上 :
最後再把剩下的 pending 合併排序即可。
Reference
tree sort vs. merge sort
uint16_t
單位,且由以下程式碼給出:-fstack-usage
得到,中間數字為 stack usage (單位: Byte)。valgrind -v --leak-check=full
。size
。gprof
。關於完整 analysis.txt
資料請點擊函式標題。
tree-sort.c | list_sort.c | |
---|---|---|
total heap usage | 10,003 allocs 3 frees 248,664 bytes allocated |
10,003 allocs 3 frees 248,664 bytes allocated |
main function | 192 dynamic (tree_sort) | 128 static (list_sort) |
可以看出雖然 list-sort cmpu16
高達 130915357 比其他函式呼叫的多,但在從時間花費上可以看出其所造成的成本並不比 tree sort 其他函式呼叫累積起來的成本高。
然而上述資料所呈現的記憶體使用量沒有執行時期的紀錄,因此我開一個執行緒利用 #include <sys/resource.h>
裡的 struct rusage
來記錄每 1ms 記憶體的使用量,但這次節點數為 10000 、次數 100 次:
dtusage.h
。為何不用 valgrind –tool=massif 呢?
沒想到 valgrind 有這功能因此試了一下,只貼上記憶體圖形化與 stack 使用率較為突出的部分:
註:tree-sort.c 有較高的 stack 使用量,然而
list-sort.c
除了 n = 1 比較高外都皆為 480 。關於此問題請看 n = 1 stack 使用量大問題。
經反覆測試與檢測初始化所造成的成本發現, n = 1 會提高的原因是因為 analysis.c 的 linked list 是先配置完記憶體後,進入數測量次數的迴圈再走訪一次設定亂數。
會這麼做的原因是因為程式比較好寫,不須寫例外狀況(次數為一)。因此第一次會因為呼叫 list_add
與設定鑑測資料(如時間)而導致 n = 1成本變高。
以下是函式簡化後:
可以看出 stack 被壓縮到先前運行中段時的成本 480 。
Reference
tim.tv_sec
and tim.tv_nsec
?2
考慮函式 func 接受一個 16 位元無號整數 ,並回傳小於或等於 的 power-of-2 (漢字可寫作為 2 的冪)
N |= N >> 1;
先省略 4 到 5 行,因為這裡還用不到。
return (N + 1) >> 1;
但這就有問題了,如果是 65535 (1111_1111_1111_1111) 預期結果會是什麼?
如果是以平常寫程式的經驗推論,會是在 3 到 6 行都是與輸入一樣後,加 1 使其 overflow 為 [1]_0000_0000_0000 再 >> 1
。
應該等於 0 吧。
然而實測卻是這樣:
可能會覺得 printf 的型態 ( %d ) 是 int 非 uint16_t ( unsigned short ),但這並不能解釋傳輸的值為
.. 1000 0000 0000 0000
。 ( 已補上各個型態輸出 )
因此閱讀了 C11 N2310 ISO/IEC 9899:20172x(E) 的 6.3 Conversions 後有了些想法:
以下若非特別註明皆引述自 C11 N2310 ISO/IEC 9899:20172x(E)。
1 Several operators convert operand values from one type to another automatically. This subclause specifies the result required from such an implicit conversion, as well as those that result from a cast operation (an explicit conversion).
數個 operators 會自動地對 operand values 進行型態轉換。
— The rank of a signed integer type shall be greater than the rank of any signed integer type with less precision.
— The rank of long long int shall be greater than the rank of long int, which shall be greater than the rank of int, which shall be greater than the rank of short int, which shall be greater than the rank of signed char.
— The rank of any unsigned integer type shall equal the rank of the corresponding signed integer type, if any.
對各個型態的 integer 之間的 rank 有明確規範。
1 When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
2 Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type)
一個 integer type 再轉換到另一個 integer type 時,若可表示則不變。
1 Many operators that expect operands of arithmetic type cause conversions and yield result types in a similar way. The purpose is to determine a common real type for the operands and result. For the specified operands, each operand is converted, without change of type domain, to a type whose corresponding real type is the common real type. Unless explicitly stated otherwise, the common real type is also the corresponding real type of the result, whose type domain is the type domain of the operands if they are the same, and complex otherwise. This pattern is called the usual arithmetic conversions :
…
Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.
Otherwise, if the type of the operand with signed integer type can represent all of the values of the type of the operand with unsigned integer type, then the operand with unsigned integer type is converted to the type of the operand with signed integer type.
5 The type of an integer constant is the first of the corresponding list in which its value can be represented.
可以知道 1 會是 int 型態,亦即 int32_t 。
型態為 uint16_t
的變數 N
會因為 1 轉成有號整數 (int32_t
) 型態,再做 right shift 。
0000 0000 0000 0001 _ 0000 0000 0000 0000
5 The result of >> is right-shifted bit positions. If has an unsigned type or if has asigned type and a nonnegative value, the value of the result is the integral part of the quotient of . If E1 has a signed type and a negative value, the resulting value is implementation-defined.
0000 0000 0000 0000 _ 1000 0000 0000 0000 => 32768
- kernel api - is_power_of_2
bool is_power_of_2(unsigned long n)Determine whether some value is a power of two, where zero is not considered a power of two.
確認 n 不為零,且在與自己減一過後的值 AND 後為零的話回傳 true 。
(1000 & 0111) == 0 => true
(1111 & 1110) == 0 => false
1UL 為 unsigned long 型的 1 。
會區分 32 與 64 位元。
而繼續追 fls64
會查到是在 arch
目錄下:
arch/alpha/include/asm/bitops.h, line 369 (as a function)
arch/alpha/include/asm/bitops.h, line 376 (as a function)
arch/powerpc/include/asm/bitops.h, line 235 (as a function)
arch/s390/include/asm/bitops.h, line 395 (as a function)
arch/x86/include/asm/bitops.h, line 366 (as a function)
可以看到會因各個架構而有不同定義,在 x86 底下: Linux kernel source tree 。
如果查看 alpha 會是:
Built-in Function: int __builtin_clzl (unsigned long)
Similar to__builtin_clz
, except the argument type is unsigned long.
而 __builtin_clz
則是計算從 most significant bit 開始有幾個 0 ,亦即在到第一個 1 之前有幾個 0 。
roundup_pow_of_two
fls_long
(fls64) 會是 64 - 7 = 57 。rounddown_pow_of_two
fls_long
(fls64) 會是 64 - 7 = 57 。在記憶體分配的 buddy memory allcation 原理:
There are various forms of the buddy system; those in which each block is subdivided into two smaller blocks are the simplest and most common variety. Every memory block in this system has an order, where the order is an integer ranging from 0 to a specified upper limit. The size of a block of order n is proportional to , so that the blocks are exactly twice the size of blocks that are one order lower. Power-of-two block sizes make address computation simple, because all buddies are aligned on memory address boundaries that are powers of two. When a larger block is split, it is divided into two smaller blocks, and each smaller block becomes a unique buddy to the other. A split block can only be merged with its unique buddy block, which then reforms the larger block they were split from.
可以知道記憶體會被切分成大小為 的 block ,且是以其作單位去分配給行程等使用。
wikipedia 給了一個很好理解的例子請去查看 budd memory allocation - example 。
如果查 國家教育研究院的雙語詞彙、學術名詞暨辭書資訊網 也可以查到說明
伙伴系統 buddy system
2003年6月 資訊與通信術語辭典
名詞解釋:
資料結構的一種儲存器管理系統。系統中儲存器的模塊總是劃分成2的冪次。如果一個模塊比存儲要求大得多,就將它分成兩個規模都是2的冪次的模塊,稱此兩模塊為〝伙伴〞。如果兩個伙伴均為可用,可再重新組合成一個較大的模塊,因為它們都是2的冪次,所以容易計算出模塊的伙伴的位址,而後確定是否可以組合成大模塊。
但不可能每個去要記憶體的行程都剛好使用 大小,因此在 Linux 核心內部有 slab allocator 去管理剩下空閒的記憶體。
在其中,如何知曉資料身處在哪個 block 此 is_power_of_2
等函式就顯得重要許多。
Just like any good idea, this approach introduces new complications into a system as well. For example, how much memory should one dedicate to the pool of memory that serves specialized requests of a given size, as opposed to the general pool? One particular allocator, the slab allocator by uber-engineer Jeff Bonwick (which was designed for use in the Solaris kernel), handles this issue in a rather nice way [B94].
Specifically,** when the kernel boots up, it allocates a number of object caches for kernel objects that are likely to be requested frequently (such as locks, file-system inodes, etc.)**; the object caches thus are each segregated free lists of a given size and serve memory allocation and free requests quickly. When a given cache is running low on free space, it requests some slabs of memory from a more general memory allocator (the total amount requested being a multiple of the page size and the object in question).
完整請看 YouTube : SL[AUO]B: Kernel memory allocator design and philosophy
YouTube : Linux 核心設計:記憶體管理 (二) (2019-04-14)
註:此標示僅為簡介,若要完整知道請全部看完三部共 8 小時 46 分鐘左右。
mm/slab_common.c
mm/slob.c
在 mm/slab_common.c
第 556 行和 mm/slob.c
第 483 行說道 ,是為了確保在 kmalloc caches 是對齊。
create_boot_cache
在 mm/slab.c
, mm/slab_common.c
和 mm/slub.c
被呼叫。__do_kmalloc_node
則在 mm/slab.c
, mm/slob.c
被呼叫。而關於 .. guarantee natural alignment for kmalloc ..
這行註解,可以從 lwn 的一篇文章看起 Alignment guarantees for kmalloc() :
In particular, Babka wanted to discuss when kmalloc() should return objects with their "natural alignment". What that term means was not actually defined until near the end of the session; presumably nobody will object to a slight bit of reordering at this point. Natural alignment for an object means that its beginning address is a multiple of each of up to three different quantities: ARCH_KMALLOC_MINALIGN (eight bytes, by default), the cache-line size (for larger objects), and the size of the object itself when that size is a power of two. The actual required alignment will generally be the least common multiple of the three.
可以看到對於 natural alignment 的說明,然而此篇文章主要是在探討 kmalloc 的意外狀況:
Most of the time, Babka said, kmalloc() already returns naturally aligned objects for a power-of-two allocation size; that result falls out of how the slab pages are laid out. But there are exceptions: when SLUB debugging is enabled or when the SLOB allocator is used. Few people worry much about SLOB, but the SLUB debugging exception can lead to problems for unsuspecting developers.
有了對 natural alignment 定義後,可以再看 natural alignement 的大略描述,引自 Red Hat Developer blog :
Natural alignment describes the practice in which the address of the data type is a multiple of the size of the data type. Using natural alignment allows the processor to avoid doing multiple memory operations to access a single value. This natural alignment has a cost, and it can lead to larger data structures.
To improve the speed of memory accesses and processor performance, each memory access on modern processors deals with a larger group of bits than a single eight bit byte, typically, 64 bits (eight bytes) or 128 bits (16 bytes).
For example, a double precision floating point value is made up of eight bytes (b0 through b7) and we assume the memory width is eight bytes. If the double precision floating point number is naturally aligned, the entire value starting at address+0 can be read with a single memory access (as can be seen in the diagram below.)
However, if the floating point number is not aligned (unaligned), it straddles two words in memory and two read must be performed (one read for b0 through b3 and another read for b4 through b7). Additionally, the processor needs to paste the bits from the two reads into a single 64-bit value which also takes time.
可以看到是為了對 CPU 讀取數值時的次數等最佳化。
此篇內容也有提到如果進行了 Alignment 可能會因為原先要儲存的數值小於對齊的需求,因而有 padding 來填充使其符合大小。
可以利用 gcc 的 -Wpadded 來檢查是否有 padding 。
文中也給出了很好的例子來告訴我們再知曉其操作後如何 re-order 這些資料:
在 x86_64 下。
詳細請點此連結。
Reference
額外閱讀
在第 1713 行,檢測要配置給 CPU 的記憶體大小是否對齊。
Reference
3
_write
與 _read
是為了確保 bit 在做完複製後期結果為符合我們預期 (即想要的偏移量)。可以看到在第 4 到第 8 行對讀取資料的位置進行了修正:bitsize > write_rhs
分成了跨位元組與否的複製。uint8_t original = *dest;
儲存原先資訊uint8_t mask = read_mask[write_lhs];
設置 mask 為寫入的資料偏移位置做處理original & mask
(清除要放入位置的資料)data >> write_lhs
(調整 data 到我們要放入的位置)*dest++ = (original & mask) | (data >> write_lhs);
寫入original = *dest & write_mask[bitsize - write_rhs];
重新利用 original
為下一個位元組要放入的資料的 mask 。*dest = original | (data << write_rhs);
寫入example:
write in _ _ _ _ _ _ 1 1 , 1 1 _ _ _ _ _ _
write_rhs = 2
write_lhs = 6
引述自 Stéphane Marchesin - Linux Graphics Drivers: an Introduction :
4.2 Framebuffer operations
The framebuffer operations structure is how non-modesetting framebuffer callbacksare set. Different callbacks can be set depending on what functionality you wish to implement, like fills, copies, or cursor handling. By filling struct fb_ops callbacks, one can implement the following functions:Copy data from area to another
而 fb_copyarea
是被包在 fb_ops
結構中,在 drivers/video/fbdev/amifb.c 初始為以下:
amifb_copyarea
則一樣在 drivers/video/fbdev/amifb.c 中:
其中 copy_one_line
使用到 bitcpy
:
是以 byte 單位。
引自 Oracle® Linux 6Porting Guide
You can use the
copy_from_user()
andcopy_to_user()
functions to move data between kernel space and user space. Alternatively, when moving one, two, four, or eight bytes of data, you can use eitherput_user()
andget_user()
oraccess_ok()
to validate the user-space address followed by either__put_user()
or__get_user()
.If user programs require direct access to device memory, you can use the mmap() system call, which maps device memory directly into user space. For example, the X server uses mmap() to write to video adapter memory and PCI devices usually memory map their control registers to improve performance. A limitation is that the area being mapped has to be a multiple of PAGE_SIZE and start at a physical memory address that is also a multiple of PAGE_SIZE.
查閱資料時看到此函式,覺得有趣也上放來。
在 linux-kernel-labs.github.io 的 memory mapping 有看到
Enable the PG_reserved bit of each page with SetPageReserved(). Clear the bit with ClearPageReserved() before freeing the memory.
其中 SetPageReserved()
為 #define SetPageReserved(page) set_bit(PG_reserved, &(page)->flags)
。
而此作用在 linux-kernel-labs.github.io - memory mapping 是
Since the pages are mapped to user space, they might be swapped out. To avoid this we must set the PG_reserved bit on the page. Enabling is done using SetPageReserved() while reseting it (which must be done before freeing the memory) is done with ClearPageReserved():
因此去查了原始碼竟然是用 inline assembly 實作,是為了確保是 atomic 層級的操作
Reference
4
// thing, will put int to the box
// control all the thing
// __cstr_node is box
// pool is the place store those boxes
// hash => index_0 index_1, index -> linked list
// pool => all the node
// memory level
因為函式個別解釋會太過瑣碎,因此我以 test_cstr
為起始講解:
CSTR_BUFFER
cstr_buffer
為主的物件。CSTR_STACK_SIZE
大小為 128 Bytes 儲存在 stack 的小字串。__cstr_data
結構 ( aka cstring ),用來儲存資訊,是 cstr_buffer
唯一指向的物件。cstr_cat
type
非 CSTR_ONSTACK
或是但超出 CSTR_STACK_SIZE
則進入 cstr_cat2
。cstr_cat2
則會開始處理到中字串與大字串的範疇。if (*str == 0)
為判斷加上的字串是否在 CSTR_STACK_SIZE
層級內加完。s->cstr[i] = 0;
則是為了確保不要在進入 cstr_cat2
使接下來讀取字串產生問題。cstr_cat2
if (sa + sb < CSTR_INTERNING_SIZE)
是在判斷是否是在 CSTR_INTERNING 層級。cstr_interning
內。sb->str
。hash_blob
__cstr_ctx
裡儲存的 hash
中符合 cstr
的字串。struct __cstr_interning *si
中的 hash
尋找符合的字串。hash_blob
的 uint32_t hash
值除以 si->size - 1
的餘數為 index
,hash[index]
中開始尋找。
hash & (si->size - 1)
, 是因為有確保每次更新 si->size
都為二的倍數。pool
新增新的 node
並在 hash[index]
標記。
疑慮,是否有確保
n = &si->pool->node[si->index++];
在大於INTERNING_POOL_SIZE
時能正常運作。
cstring
型態。expand
hash
的大小不夠或初次呼叫時,會呼叫此函式。hash
大小會直接是 HASH_START_SIZE 16
。unsigned new_size = si->size * 2;
,此操作能確保 size
為 2 的次方。insert_node
一一複製原先的 hash所擁有的資料。5.44 Built-in functions for atomic memory access -
__sync_sub_and_fetch
至此,cstr_cat
的操作才算完成。
//TODO release close
接下來的函式是在針對單一字串進行讀取使所用的函式與巨集。
type
有不同的對應。CSTR_PERMANENT
或 CSTR_INTERNING
會直接回傳。CSTR_ONSTACK
會進行 cstr_clone
。
cstr_clone
則是在 s->hash_size 小於 CSTR_INTERNING_SIZE時,把 cstring s
記入到 hash
裡。關於 hash ,之前在看論壇時有人有推薦可以用 uthash 這個東西。
此並非是 C 的函式庫,只是個標頭檔因此使用起來很方便。
Add, find and delete are normally constant-time operations. This is influenced by your key domain and the hash function.
原始碼
蠻妙的是它是系列相關的專案, 包含了 utstring 。
atomic
and __sync
atomic
and __sync
__sync
的實作在某些硬體上根本沒有,官方也推薦新的程式碼該用別的These functions are implemented in terms of the ‘__atomic’ builtins (see __atomic Builtins). They should not be used for new code which should use the ‘__atomic’ builtins instead.
cstr_cat
- 字串完整性cstr_interning
函式亦即對 hash
操作時作保護,但對於字串操控還有 cstr_cat
這個函式。
test_cstr
沒辦法檢測對 string interning
的操作,因此我改成如下:
lock
,因此直接在另設個 lock
在呼叫時包起來。
ref
的數值大小。cstr_grab
和 cstr_release
對 ref
進行加減的實驗。ref: 3, type 0
ref: 0, type 32750
type
會是 32750 是因為保存其的cstring s
已被釋放掉。
ref
在增加之前已被減為零。
cat
一樣方式處理,但須加上 sleep()
:
關於老師所說
這是什麼可怕的解法?
是可以運用 reader and writer
的方法來處理。
設一個減值的 lock
擋住 release
但不擋住增值操作,之後再設一個 semaphore
來記錄目前增值有幾個。當那個 semaphore
為零代表增值操作已經完成, lock
才解除讓減值操作。
但當時候只為了區分兩種操作,為了讓實驗方便操作所以才直接設 sleep()
。
ref
在個執行緒保持一致性對函式 lock 即可。ref
值為何則須以分開增減值操作。
ref
為零的狀態且只可讀取。雖然有對共享值作如 __sync_sub_and_fetch
、 __sync_add_and_fetch
等保護,但在讀取數值與執行順序上卻沒有保證,關於這概念可以看 並行程式設計: 執行順序 - Sequential Consistency。
簡單來說,存到 cache 要作增減的值必須要保持在最新的狀態,而當其他執行緒要去存取時也要能夠看到前者的更新資訊。
Reference
相關閱讀