contributed by < linD026
>
linux2021
space_left
表示。filler
字元陣列中。LARGE_STRING_LEN
(此為 256 ), 儲存於 heap。
data
指標提取字串。size
表示。最大長度以被分配的 bit 多寡決定。此為 54 bits 共 最大長度。capacity
表示。應須涵蓋到 size
所能表示的範圍,因此以二的次方來儲存。在此 6 bits 來表示,共 。is_ptr
、 is_large_string
、 flag2
、 flag3
LARGE_STRING_LEN
,儲存於 heap。
int
型態的 refcnt
( reference count
),因此在存取字串時須注意。xs_capacity
在提取時是以 2 的冪 (利用對 1
作位元偏移)。__builtin_clz
取得該數的 the number of leading 0-bits
並減去 32 (視型態此為 uint32_t
) 取得該數的左邊算起來第一個 1 的位數後減 1 。 bool reallocate
來判斷是擴充還是分配新的記憶體空間。len
符合大字串,則可以看到在分配時跳過了前四個 byte 。const void *p
為要儲存的字串。xs_allocate_data
裡決定。_Static_assert
是 C11 新增的檢測方式,在編譯時期檢測錯誤。
The constant expression is evaluated at compile time and compared to zero. If it compares equal to zero, a compile-time error occurs and the compiler must display message (if provided) as part of the error message (except that characters not in basic source character set aren't required to be displayed).
Otherwise, if expression does not equal zero, nothing happens; no code is emitted.
//TODO 為什麼是這寫法 ((void) (struct{_Static_assert(); int dummy;}){1}), function())
char **data
為要複製的資料,複製至 xs *x
並讓 data
指向複製的內容。char **data
所擁有的指標指向複製的資料而已。xs
結構作操作,而是直接以原始資料與複製目的地作操作。xs
結構所沒使用到的 flag2
改成 sharing
,紀錄複製物件所使用的資料不是自己的。flag3
則是改成 source
來紀錄來源物件的資料是有被其他物件使用到。capacity
直接複製,再判斷是否是大字串還是小字串來作字串長度的處理。xs_cow_lazy_copy
是多餘的。A
( or ) :A
( or ) :B
( or ) :orig
紀錄原先指標的位置。dataptr
和 slen
指向剩下所需的字串與其長度。orig
即完成。byte / 8
和 byte % 8
時改進成 byte >> 3
和 byte & 7
。xs_cow_lazy_copy
是多餘的。Reference
我針對 copy on write 設定兩個參數其一為 sharing
表示字串是不是共通的亦即別人的,另一個是 source
表示是否為來源字串。
在對字串進行操作時若是 CoW 則使用 xs_cow_write_trim
和 xs_cow_write_concat
兩個 marco 來處理。
缺點是對複製字串進行操作時須知道他的來源字串為何,才可對來源作 xs_dec_refcnt
等操作。
Reference
完整程式碼。
malloc
和 xs_new
後在進行複製的差異。malloc
以在建議一個指標 malloc
後再 memcpy
,xs
的複製以 xs_cow_copy
為主。xs_new
裡的記憶體配置是以 2 的冪來估算 (capacity)。因此 是不夠,會以 來分配。realloc
,缺點是不需要多的記憶體每次還是會給你多一倍的空間,導致會有相等大小的記憶體沒被使用到。
因此還需要設計 reclaim
reference count
取代的 source
改成 reclaim
。xs *x
的字串儲存結構為指標且不是 CoW
的 shadow copy
則可以進行調整成適當大小的空間。264
,但多餘的記憶體 272
並沒有馬上被釋放,而是先放在 extra-heap
之中。extra-heap
) 拿取。
stack
(opt_store
) 和 heap
(normal_store
) 之中,所呈現的效能差異。stack
和 heap
都會進行以下操作:
stack
和 heap
的字串都可以直接儲存在 cache
裡。
L1
是可以明顯到看兩著並無太大落差:
NUM
大小的字串陣列,分別是:char *small[NUM] = {0};
和 xs small[NUM];
print
出小字串的步驟,則選擇陣列的 NUM/2
小字串。stack
和 heap
所擁有的特徵提高。L1
確實有差異性了:
cache-misses
, cache-references
, instructions、cycles
, ref-cycles
,存放於 stack
的實作都比在 heap
好。Spatial locality
可以看出,在使用儲存位置較相近資料時有較好的表現。以下若非特別聲明皆引自 stackoverflow - Is accessing data in the heap faster than from the stack?
For stack-based data, the stack-pointer-register-relative address can also be calculated and hardcoded at compile time. Then the stack-pointer-register may be adjusted by the total size of function arguments, local variables, return addresses and saved CPU registers as the function is entered and returns (i.e. at runtime). Adding more stack-based variables will just change the total size used to adjust the stack-pointer-register, rather than having an increasingly detrimental effect.
Because the absolute virtual address, or a segment- or stack-pointer-register-relative address can be calculated at compile time for global and stack based data, runtime access is very fast.
If successive lines of your source code list global variables, they'll be arranged in adjacent memory locations (albeit with possible padding for alignment purposes). The same is true for stack-based variables listed in the same function.
stack
(跟執行的函式有關),亦即儲存於自己的 stack
的資料只有自己可以讀取。For heap-based data, a runtime heap allocation library must consult and update its internal data structures to track which parts of the block(s) aka pool(s) of heap memory it manages are associated with specific pointers the library has provided to the application, until the application frees or deletes the memory. If there is insufficient virtual address space for heap memory, it may need to call an OS function like
sbrk
to request more memory (Linux may also callmmap
to create backing memory for large memory requests, then unmap that memory onfree
/delete
).
For the heap access, both the pointer and the heap memory must be in registers for the data to be accessible (so there's more demand on CPU caches, and at scale - more cache misses/faulting overheads).
With heap hosted data, the program has to access the data via a runtime-determined pointer holding the virtual memory address on the heap, sometimes with an offset from the pointer to a specific data member applied at runtime. That may take a little longer on some architectures.
資料存放於 stack 和 heap 的成本在配置記憶體、存取與儲存形式都有很大的落差。
其中,維護儲存於不連續記憶體( heap )的成本,確實高於 stack 。
關於相關詳細比較,那篇 stackoverflow 的下方也有人提及兩者效能表現比較的文章:
Reference
整合第二周的 string interning
和第三周的xs
。
hash table
管理的節點結構更改:
hash table
裡控管。cstr_interning
會回傳 hash table
中紀錄的節點所擁有的 xs
。xs_new
則是將其複製。
hash table
的資料。interning
,新建的節點操作是建立後,再對記憶體 reclaim
成適當大小。hash table
的資料,其他都是副本,因此在操作的前置動作只須對指標以及原先資料的 reference count
作處理就行。xs_new
來建立 hash table
裡資料,原先物件則釋放掉。
marco
:
hash table
的資料不會被刪除。hash table
資料:
xs_new
。hash table
裡的原始資料則為 cstr_interning
。xs_cow_write_trim
來操作。xs_trim
等直接操作,因此小字串直接操作即可。xs_concat
成長為中大字串時,會自動寫入 hash table
,並且使用者拿到的 xs
會是副本。xs
的物件利用 xs_grow
擴大記憶體,則物件不會紀錄至 hash table
。xs_tmp
可以改成 xs_new_self
的建立,如果想要自己釋放掉且不想被儲存的話。
完整程式碼: linD026
DESCRIPTION
fork() creates a new process by duplicating the calling process. The new process is referred to as the child process. The calling process is referred to as the parent process.The child process and the parent process run in separate memory spaces. At the time of fork() both memory spaces have the same content. Memory writes, file mappings (mmap(2)), and unmappings (munmap(2)) performed by one of the processes do not affect the other.
With vfork(), the parent process is suspended, and the child process uses the address space of the parent. Beacuse vfork() does not use copy-on-write, if the child process changes any pages of the parent's address space, the altered pages will be visible to the parent once it resumes.
引自:Operating System Concepts 10th Edition Asia Edition - page 400Historic description
Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller's data space, often needlessly, since usually immediately afterward an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent's memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables were held in a register.
引自: Linux Programmer's Manual - vfork()
Reference
相關閱讀
一個 process 對一塊 page 讀取時,根據不同硬體架構會有特定的 page table 來讀取。
Each process a pointer (mm_struct→pgd) to its own Page Global Directory (PGD) which is a physical page frame. This frame contains an array of type pgd_t which is an architecture specific type defined in <asm/page.h>. The page tables are loaded differently depending on the architecture. On the x86, the process page table is loaded by copying mm_struct→pgd into the cr3 register which has the side effect of flushing the TLB. In fact this is how the function __flush_tlb() is implemented in the architecture dependent code.
kernel.org - Chapter 3 Page Table Management
而當多個 process 共用一個 page frame 時,則會採用 CoW :
Operating system support in distributed system此投影片主要講 threads in distributed system
generic page.h
的註解可以看到,這用於 NOMMU architectures
。
mm_struct
vm_area_struct
/ vmasThere is a single chain of vmas attached to the mm_struct that describes the entire address space, sorted by address. This chain also can be collected into a tree for faster lookup.
address_space
address_space structure is the anchor for all mappings related to the file object. It contains links to every vma mapping this file, as well as links to every physical page containing data from the file.
The only vmas not attached to an address_space are the ones describing the stack and the ones describing the bss space for the application and each shared library. These are called the ’anonymous’ vmas and are backed directly by swap space.
page
All physical pages in the system have a structure that contains all the metadata associated with the page.
mmap
分配一段記憶體空間,此空間以新的或是改寫過的 vma 表示,跟 task 的 mm_struct
和 address_space
相連。mmap
呼叫時 The page table 並沒有操作。有三個 lock
a read/write semaphore
- mm_struct
page_table_lock
- mm_struct
i_shared_lock
( i_shared_sem
) - address_space
sharing PTE Page
然而上述資料有點舊,跟現況不太符合。
關於 pte_chain 的演進:
因此,又找了這份投影片來解析 struct page
在 CoW 下得操作:
vma 的整體結構圖
vma 資料搜尋
struct page
簡述
anon_vma
管理 CoW 的 page
anon_vma
in action
anon_vma
in action
kernel.org - mm Concepts overview
The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program’s stack and heap or by explicit calls to mmap(2) system call.
Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.
page 在 CoW 的操作。
kernel.org - Chapter 4 Process Address Space
old_page
會先處理 anonymous pages 狀態
anon_vma
,以確保在尋找資料時不會找到 parent 或 siblings
mm_struct
結構控管多個 vm_area_struct
, 而其中 vm_area_struct
又以資料結構來儲存。( rbtree
)。vm_area_struct
的 mapping 可以是 Anonymous mapping 。vm_area_struct
再藉由多個資料結構如 struct inode struct
, address_space mapping
(priorty tree - i_mmap
, radix tree - page_tree
) 等至 struct page 。pages
的儲存結構為 radix tree 保存於 address_space
。(如果要直接對 page 操作的話,方便反向尋找)do_cp_page
等函式,搭配 lock 去做 CoW 處理。 (/mm/memory.c)establish_pte
,建立 page table entry 來提升效率。
註:現今實作又改了:
The kernel's radix tree is, he said, a great data structure, but it has far fewer users than one might expect. Instead, various kernel subsystems have implemented their own data structures to solve the same problems. He tried to fix that by converting some of those subsystems and found that the task was quite a bit harder than it should be. The problem, he concluded, is that the API for radix trees is bad; it doesn't fit the actual use cases in the kernel.
unsigned long
的 array of pointerPart of the issue is that the "tree" terminology is confusing in this case. A radix tree isn't really like the classic trees that one finds in data-structure texts. Addition of an item to a tree has been called "insertion" for decades, for example, but an "insert" doesn't really describe what happens with a radix tree, especially if an item with the given key is already present there. Radix trees also support concepts like "exception entries" that users find scary just because of the naming that was used.
延伸閱讀
最後,實作一直在改進,周遭資訊卻很難跟上。 (譬如 struct address_space 中看起來取代 radix tree 的 xarray 沒多少資訊)
光是要找到正確的資訊就好困難。
雖然可以從現今程式碼著手,但在沒整體架構的演進與變化脈絡的知識,真的會看不太懂。
Reference
相關閱讀 (有訂閱 lwn.net 可以看)
// TODO