2021q1 Homework5 (quiz5)

contributed by < linD026 >

tags: `linux2021`

2021 年第 5 週隨堂測驗題目

0 進度

測驗 1

1 解釋程式碼以及效能分析

解釋程式碼

STRUCT_BODY

此 marco 是 on_heap 狀態下，儲存的資料結構。
結構組成是 8 bytes ( 64 bits ) + pointer ( 8 bytes ) 。
其中前者利用 54 bits 紀錄記憶體使用量 – size
1 bit 來判斷是否為動態記憶體配置 – on_heap
和用 6 bits 紀錄分配的記憶體大小的指數 – capacity
其他三個與 on_heap 相同性質的 flag – flag1, flag2, flag3
隨後是指向其資料的指標 – ptr

#define STRUCT_BODY(type)                                                  \
    struct {                                                               \
        size_t size : 54, on_heap : 1, capacity : 6, flag1 : 1, flag2 : 1, \
            flag3 : 1;                                                     \
        type *ptr;                                                         \
    }

NEXT_POWER_OF_2
- 利用
  $2^{n}$ 在二進制只會有一個 bit 為 1 的性質判斷。
- 譬如
  $s = 4$ ，
  $4_{10} = 0100_{2}$ ，!(
  $0100_{2}$ &
  $0011_{2}$ ) 為 true ，結果為 s 。
- 而當判斷式為否時，則利用 __builtin_clzl 找 the number of leading 0-bits in x後，在與 64 相減得到下一個
  $2^{n}$ 。
- 會是與 64 相減的是因為 __builtin_clzl 回傳的是 unsigned long 型態。而一般 uint64_t 為 typedef unsigned long uint64_t 。
  
  Built-in Function: int builtin_clz (unsigned int x)
  Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
  Built-in Function: int builtin_clzl (unsigned long)
  Similar to __builtin_clz, except the argument type is unsigned long.
```
#define NEXT_POWER_OF_2(s) \
    !(s & (s - 1)) ? s : (size_t) 1 << (64 - __builtin_clzl(s))
```
nonnull attribute
```
#define NON_NULL __attribute__((nonnull))
```
nonnull (arg-index, ...)
The nonnull attribute specifies that some function parameters should be non-null pointers.

v

此為 vector 的宣告。
t 為儲存資料的型態，s 為初始化時儲存資料的數量， name 此 vector 結構的名稱。
最後 ... 為不定變數。
第一個 struct 是先以 _Static_assert 判斷 s 是否過大。
union 則是除存於 heap 和 stack 的資料狀態。

#define v(t, s, name, ...)                                              \
    (void) ((struct {                                                   \
        _Static_assert(s <= 8, "it is too big");                        \
        int dummy;                                                      \
    }){1});                                                             \
    union {                                                             \
        STRUCT_BODY(t);                                                 \
        struct {                                                        \
            size_t filler;                                              \
            t buf[NEXT_POWER_OF_2(s)];                                  \
        };                                                              \
    } name __attribute__((cleanup(vec_free))) = {.buf = {__VA_ARGS__}}; \
    name.size = sizeof((__typeof__(name.buf[0])[]){0, __VA_ARGS__}) /   \
                    sizeof(name.buf[0]) -                               \
                1;                                                      \
    name.capacity = sizeof(name.buf) / sizeof(name.buf[0])

cleanup

cleanup (cleanup_function)
The cleanup attribute runs a function when the variable goes out of scope. This attribute can only be applied to auto function scope variables; it may not be applied to parameters or variables with static storage duration. The function must take one parameter, a pointer to a type compatible with the variable. The return value of the function (if any) is ignored.

If -fexceptions is enabled, then cleanup_function is run during the stack unwinding that happens during the processing of the exception. Note that the cleanup attribute does not allow the exception to be caught, only to perform an action. It is undefined what happens if cleanup_function does not return normally.

gcc 提供了 cleanup attribute ，讓當變數超出 scope 時，自動跑 cleanup 裡的函式。
須注意這不能應用在 parameters or variables 是靜態儲存的情況。

Variadic Macros

This kind of macro is called variadic. When the macro is invoked, all the tokens in its argument list after the last named argument (this macro has none), including any commas, become the variable argument. This sequence of tokens replaces the identifier __VA_ARGS__ in the macro body wherever it appears.
例子：

#define eprintf(...) fprintf (stderr, __VA_ARGS__)
eprintf ("%s:%d: ", input_file, lineno)
     →  fprintf (stderr, "%s:%d: ", input_file, lineno)

At least one argument 問題 - standard C

… if you left the variable argument empty, you would have gotten a syntax error, because there would have been an extra comma after the format string.

This formulation looks more descriptive, but historically it was less flexible: you had to supply at least one argument after the format string. In standard C, you could not omit the comma separating the named argument from the variable arguments.
(Note that this restriction has been lifted in C++2a, and never existed in GNU C; see below.)

解決方法

First, in GNU CPP, and in C++ beginning in C++2a, you are allowed to leave the variable argument out entirely:
```
eprintf ("success!\n")
     → fprintf(stderr, "success!\n", );
```
Second, C++2a introduces the __VA_OPT__ function macro. This macro may only appear in the definition of a variadic macro. If the variable argument has any tokens, then a __VA_OPT__ invocation expands to its argument; but if the variable argument does not have any tokens, the __VA_OPT__ expands to nothing:
```
#define eprintf(format, ...) \
  fprintf (stderr, format __VA_OPT__(,) __VA_ARGS__)
```
__VA_OPT__ is also available in GNU C and GNU C++.
Historically, GNU CPP has also had another extension to handle the trailing comma: the ‘##’ token paste operator has a special meaning when placed between a comma and a variable argument. Despite the introduction of __VA_OPT__, this extension remains supported in GNU CPP, for backward compatibility. If you write
```
#define eprintf(format, ...) fprintf (stderr, format, ##__VA_ARGS__)
```

Reference

效能分析

初步測試 - stack to heap

先對 vector 進行 push back ，測量對於記憶體的分配是否符合預期。
可以看到記憶體分配從 stack 到 heap 的轉換並沒有達到預想的原始大小 2 倍的分配。

// analysis
v(int, 3, vec_int, 1, 2, 3);
int i = 3;
for (;i < 1000;i++)
    vec_push_back(vec_int, i);
    
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1        146,056              136              128             8            0
  2        148,671              264              256             8            0
  3        151,615              520              512             8            0
  4        157,439            1,032            1,024             8            0
  5        169,023            2,056            2,048             8            0
  6        192,127            4,104            4,096             8            0
  7        235,969            4,104            4,096             8            0
99.81% (4,096B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.81% (4,096B) 0x109474: __vec_push_back (self_vector.c:121)
| ->99.81% (4,096B) 0x10A627: main (self_vector.c:211)
|   
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)

在進一步顯示 capacity 的大小在從 stack 轉換至 heap 時，是如何改變的：

capacity(vec1)=4
capacity(vec1)=32
...
capacity(vec1)=64
...
capacity(vec1)=128
...

可以看出這是因為在 __vec_push_back 函式中，對於 stack to heap 的操作， capacity 是以 capacity(v) + 1 狀態更新。
因此次狀況會是：

i n i t a i l c a p a c i t y = 2;

capacity =

2^{c a p a c i t y} + 1;

c a p a c i t y = 4 + 1 = 5 = 2^{5} = 32;

if (v->size == capacity) {
    void *tmp =
        malloc(elemsize * (size_t) 1 << (v->capacity = capacity + 1));
    memcpy(tmp, v->buf, elemsize * v->size);
    v->ptr = tmp;
    v->on_heap = 1;
    memcpy(&v->ptr[v->size++ * elemsize], e, elemsize);
}

雖然不影響操作結果，但對於記憶體增長的曲線較不符合預期，因此如果不想遽增可以改成：

malloc(elemsize * (size_t) 1 << (v->capacity = ilog2(capacity) + 1));

CPU 使用率

 Performance counter stats for './m_self' (10000 runs):

               719      cache-misses              #    1.740 % of all cache refs      ( +-  1.08% )
            4,1345      cache-references                                              ( +-  0.05% )
           76,6800      instructions              #    0.97  insn per cycle           ( +-  0.02% )
           78,7666      cycles                                                        ( +-  0.06% )

       0.000313471 +- 0.000000765 seconds time elapsed  ( +-  0.24% )

2 限制與改進

FBVector - reuse previously-allocated memory

在 FBvector.md 中提及對於記憶體增長的倍數是 power of 2 的弊端，是因為先前的配置過得記憶體大小必定小於增長過後的大小，因此這不會使 vector 重新使用之前的記憶體。
導致最後必定會無法使用 deallocated chunks 。

The initial HP implementation by Stepanov used a growth factor of 2; i.e., whenever you'd push_back into a vector without there being room, it would double the current capacity. This was not a good choice: it can be mathematically proven that a growth factor of 2 is rigorously the worst possible because it never allows the vector to reuse any of its previously-allocated memory.

1 + 2^^1 + 2^^2 + 2^^3... + 2^^n = 2^^(n+1) - 1
This means that any new chunk requested will be larger than all previously used chunks combined, so the vector must crawl forward in memory; it can't move back to its deallocated chunks.

而在後續也有提及任何小於 2 的數字可以保正在某個 reallocation 次數之後，可以重新使用原先記憶體。

But any number smaller than 2 guarantees that you'll at some point be able to reuse the previous chunks. For instance, choosing 1.5 as the factor allows memory reuse after 4 reallocations; 1.45 allows memory reuse after 3 reallocations; and 1.3 allows reuse after only 2 reallocations.

當然這是在假設 allocator 會儘量使用先前記憶體的情形，但這不影響我們避免最糟的選擇。

Of course, the above makes a number of simplifying assumptions about how the memory allocator works, but definitely you don't want to choose the theoretically absolute worst growth factor. fbvector uses a growth factor of 1.5. That does not impede good performance at small sizes because of the way fbvector cooperates with jemalloc (below).

factor - 1.5 更改

#define FACTOR 1.5
#define CHUNK_SIZE 4
#define vec_capacity(v) \
    ((v.on_heap) ? (size_t) (CHUNK_SIZE * pow(FACTOR, v.capacity)) : sizeof(v.buf) / sizeof(v.buf[0]))
static inline float ilog_factor(float n) /* log1.5(n) = log2(n)/log2(1.5)*/
{
    return ceilf(log2f(n)/log2f(FACTOR));
}

__vec_push_back

...
if (v->on_heap) {
    if (v->size == capacity)
        v->ptr = realloc(v->ptr, elemsize * CHUNK_SIZE * pow(FACTOR, (++v->capacity)));
    memcpy(&v->ptr[v->size++ * elemsize], e, elemsize);
} else {
    if (v->size == capacity) {
        v->capacity = 1;
        void *tmp =
            malloc(elemsize * (size_t)(CHUNK_SIZE * pow(FACTOR, (v->capacity))));
        memcpy(tmp, v->buf, elemsize * v->size);
        v->ptr = tmp;
        v->on_heap = 1;
        memcpy(&v->ptr[v->size++ * elemsize], e, elemsize);
    } else
        memcpy(&v->buf[v->size++ * elemsize], e, elemsize);
}
...

__vec_reserve

...
if (n > capacity) {
    float new_cap = ilog_factor(n/CHUNK_SIZE);
    new_cap = (new_cap < 1) ? 1 : new_cap;
    if (v->on_heap) {
        v->ptr = realloc(v->ptr,
                         elemsize * (size_t)(CHUNK_SIZE * pow(FACTOR, v->capacity = new_cap)));
    } else {
        void *tmp =
            malloc(elemsize * (size_t)(CHUNK_SIZE * pow(FACTOR, v->capacity = new_cap)));
        memcpy(tmp, v->buf, elemsize * v->size);
        v->ptr = tmp;
        v->on_heap = 1;
    }
}
}
...

v(int, 3, vec_int, 1, 2, 3);
int i = 3;
for (;i < 10;i++) {
    vec_push_back(vec_int, i);
    printf("capacity(vec1)=%zu, %d\n", vec_capacity(vec_int), vec_int.capacity);
}
vec_reserve(vec_int, 15);
printf("capacity(vec1)=%zu, %d\n", vec_capacity(vec_int), vec_int.capacity);

-----------------------
capacity(vec1)=4, 4
capacity(vec1)=6, 1
capacity(vec1)=6, 1
capacity(vec1)=9, 2
capacity(vec1)=9, 2
capacity(vec1)=9, 2
capacity(vec1)=13, 3
capacity(vec1)=20, 4

reallocate address 測試

從下方結果顯示出實際情況並不完全如 FBVector 所假設。
這原因可以從此看得出原因。

void reuse_test(void) {
    v(uint64_t, 3, test, 1, 2, 3);
    printf("%10s    %10s    %10s    %10s\n", "i", "capacity", "realsize", "address");
    printf("%10d    %10d    %10ld    %10p\n", 0, test.capacity, vec_capacity(test), &vec_pos(test, 1));
    for (int i = 1;i <= 20;i++) {
        //vec_reserve(test, vec_capacity(test) * 2);
        vec_reserve(test, (size_t)(CHUNK_SIZE * pow(FACTOR, i)));
        printf("%10d    %10d    %10ld    %10p\n", i, test.capacity, vec_capacity(test), &vec_pos(test, 1));
    }
}

$ make test
./m_vector
         i      capacity      realsize       address
         0             4             4    0x7fff6c164e20
         1             1             6    0x55889e4696b8
         2             2             9    0x55889e4696b8
         3             3            13    0x55889e4696b8
         4             4            20    0x55889e4696b8
         5             5            30    0x55889e4696b8
         6             6            45    0x55889e4696b8
         7             7            68    0x55889e4696b8
         8             8           102    0x55889e4696b8
         9             9           153    0x55889e4696b8
        10            10           230    0x55889e4696b8
        11            11           345    0x55889e4696b8
        12            12           518    0x55889e4696b8
        13            13           778    0x55889e4696b8
        14            14          1167    0x55889e4696b8
        15            15          1751    0x55889e4696b8
        16            16          2627    0x55889e4696b8
        17            17          3941    0x55889e4696b8
        18            18          5911    0x55889e4696b8
        19            19          8867    0x55889e4696b8
        20            20         13301    0x55889e4696b8
        21            21         19951    0x7f2ab4d96018
        22            22         29927    0x7f2ab4d5b018
        23            23         44890    0x7f2ab4d5b018
        24            24         67336    0x7f2ab4cd7018
        25            25        101004    0x7f2ab4cd7018
        26            26        151507    0x7f2ab4baf018
        27            27        227260    0x7f2ab4baf018
        28            28        340890    0x7f2ab4915018
        29            30        767004    0x7f2ab433a018
        30            30        767004    0x7f2ab433a018

-------------------------------------------------------------------------------
$ make self
cc -o m_self self_vector.c -g -lm
$ make test_self
./m_self
         i      capacity      realsize       address
         0             4             4    0x7ffe86ce1c00
         1             3             8    0x560605a6a6b8
         2             4            16    0x560605a6a6b8
         3             5            32    0x560605a6a6b8
         4             6            64    0x560605a6a6b8
         5             7           128    0x560605a6a6b8
         6             8           256    0x560605a6a6b8
         7             9           512    0x560605a6a6b8
         8            10          1024    0x560605a6a6b8
         9            11          2048    0x560605a6a6b8
        10            12          4096    0x560605a6a6b8
        11            13          8192    0x560605a6a6b8
        12            14         16384    0x560605a6a6b8
        13            15         32768    0x7f9dbbd78018
        14            16         65536    0x7f9dbbcf7018
        15            17        131072    0x7f9dbbbf6018
        16            18        262144    0x7f9dbb9f5018
        17            19        524288    0x7f9dbb5f4018
        18            20       1048576    0x7f9dbadf3018
        19            21       2097152    0x7f9db9df2018
        20            22       4194304    0x7f9db7df1018
        21            23       8388608    0x7f9db3df0018
        22            24      16777216    0x7f9dabdef018
        23            25      33554432    0x7f9d9bdee018
        24            26      67108864    0x7f9d7bded018
        25            27     134217728    0x7f9d3bdec018
        26            28     268435456    0x7f9cbbdeb018
        27            29     536870912    0x7f9bbbdea018
        28            30    1073741824    0x7f99bbde9018
        29            31    2147483648    0x7f95bbde8018
        30            32    4294967296    0x7f8dbbde7018

FBVector - autiumatically detect - memory chunk

如果在細部探討 allocator 對於 memory 的操作，可以看到 vector 對於記憶體分配：

r e a l s i z e (c u r r e n t s i z e) = c h u n c k s i z e \times f a c t o r^{⌈ l o g_{f a c t o r}^{c u r r e n t s i z e} ⌉}

從上述式子得知，記憶體是以倍數增長，因此會有多餘的記憶體。

As discussed above, std::vector also needs to (re)allocate in quanta. The next quantum is usually defined in terms of the current size times the infamous growth constant. Because of this setup, std::vector has some slack memory at the end much like an allocated block has some slack memory at the end.

因此 fbvector 正是藉由 jemalloc 用以自動檢測記憶體大小並且 reallocation 的方式避免。

It doesn't take a rocket surgeon to figure out that an allocator- aware std::vector would be a marriage made in heaven: the vector could directly request blocks of "perfect" size from the allocator so there would be virtually no slack in the allocator. Also, the entire growth strategy could be adjusted to work perfectly with allocator's own block growth strategy. That's exactly what fbvector does - it automatically detects the use of jemalloc and adjusts its reallocation strategy accordingly.

以下為 folly 對此操作的程式碼：

`goodMallocSize()`

inline size_t goodMallocSize(size_t minSize) noexcept {
  if (minSize == 0) {
    return 0;
  }

  if (!canNallocx()) {
    // No nallocx - no smarts
    return minSize;
  }

  // nallocx returns 0 if minSize can't succeed, but 0 is not actually
  // a goodMallocSize if you want minSize
  auto rv = nallocx(minSize, 0);
  return rv ? rv : minSize;
}

`jemallocMinInPlaceExpandable`

// We always request "good" sizes for allocation, so jemalloc can
// never grow in place small blocks; they're already occupied to the
// brim.  Blocks larger than or equal to 4096 bytes can in fact be
// expanded in place, and this constant reflects that.
static const size_t jemallocMinInPlaceExpandable = 4096;

大小為 4096 的原因：

Virtually all modern allocators allocate memory in fixed-size quanta that are chosen to minimize management overhead while at the same time offering good coverage at low slack. For example, an allocator may choose blocks of doubling size (32, 64, 128, <t_co>, …) up to 4096, and then blocks of size multiples of a page up until 1MB, and then 512KB increments and so on.

`computePushBackCapacity()` - factor 1.5

當 capacity() 大於 4096 後，以 1.5 倍成長：

return (capacity() * 3 + 1) / 2;

size_type computePushBackCapacity() const {
if (capacity() == 0) {
  return std::max(64 / sizeof(T), size_type(1));
}
if (capacity() < folly::jemallocMinInPlaceExpandable / sizeof(T)) {
  return capacity() * 2;
}
if (capacity() > 4096 * 32 / sizeof(T)) {
  return capacity() * 2;
}
return (capacity() * 3 + 1) / 2;
}

`xallocx()` - shrink

jemalloc 提供的 xallocx :

The xallocx() function resizes the allocation at ptr in place to be at least size bytes, and returns the real size of the allocation. If extra is non-zero, an attempt is made to resize the allocation to be at least (size + extra) bytes, though inability to allocate the extra byte(s) will not by itself result in failure to resize. Behavior is undefined if size is 0, or if (size + extra > SIZE_T_MAX).

emplace_back_aux()

folly::FBVector 的 push_back 實作也是此函式

利用 jemalloc 裡的 xallocx reallocate vector ，
如果失敗了則使用 allocate => memory copy => deallocate cycle 。

size_type byte_sz =
    folly::goodMallocSize(computePushBackCapacity() * sizeof(T));
if (usingStdAllocator && usingJEMalloc() &&
    ((impl_.z_ - impl_.b_) * sizeof(T) >=
     folly::jemallocMinInPlaceExpandable)) {
  // Try to reserve in place.
  // Ask xallocx to allocate in place at least size()+1 and at most sz
  //  space.
  // xallocx will allocate as much as possible within that range, which
  //  is the best possible outcome: if sz space is available, take it all,
  //  otherwise take as much as possible. If nothing is available, then
  //  fail.
  // In this fashion, we never relocate if there is a possibility of
  //  expanding in place, and we never reallocate by less than the desired
  //  amount unless we cannot expand further. Hence we will not reallocate
  //  sub-optimally twice in a row (modulo the blocking memory being freed).
  size_type lower = folly::goodMallocSize(sizeof(T) + size() * sizeof(T));
  size_type upper = byte_sz;
  size_type extra = upper - lower;

  void* p = impl_.b_;
  size_t actual;

  if ((actual = xallocx(p, lower, extra, 0)) >= lower) {
    impl_.z_ = impl_.b_ + actual / sizeof(T);
    M_construct(impl_.e_, std::forward<Args>(args)...);
    ++impl_.e_;
    return;
  }
}
// reallocation failed
...

延伸閱讀

stackoverflow - push_back vs emplace_back

For emplace_back constructor A (int x_arg) will be called. And for push_back A (int x_arg) is called first and move A (A &&rhs) is called afterwards.
Proposed Wording for Placement Insert

限制

`realloc()`

關於 clib-base 的 realloc ，我們並不能如期知道裡面對於記憶體操作的實作機制是 in-place 或 alloc-memcpy-dealloc 。這也是為何先前 address 測試時，不符合預期的原因，

FBVector - in place reallocation

But wait, there's more. Many memory allocators do not support in- place reallocation, although most of them could. This comes from the now notorious design of realloc() to opaquely perform either in-place reallocation or an allocate-memcpy-deallocate cycle. Such lack of control subsequently forced all clib-based allocator designs to avoid in-place reallocation, and that includes C++'s new and std::allocator. This is a major loss of efficiency because an in-place reallocation, being very cheap, may mean a much less aggressive growth strategy. In turn that means less slack memory and faster reallocations.

而關於此解， FBVector 是利用了自己的 allocator jemalloc 中的 xallocx 。

folly::Malloc - `smartRealloc()`

/**
 * This function tries to reallocate a buffer of which only the first
 * currentSize bytes are used. The problem with using realloc is that
 * if currentSize is relatively small _and_ if realloc decides it
 * needs to move the memory chunk to a new buffer, then realloc ends
 * up copying data that is not used. It's generally not a win to try
 * to hook in to realloc() behavior to avoid copies - at least in
 * jemalloc, realloc() almost always ends up doing a copy, because
 * there is little fragmentation / slack space to take advantage of.
 */
FOLLY_MALLOC_CHECKED_MALLOC FOLLY_NOINLINE inline void* smartRealloc(
    void* p,
    const size_t currentSize,
    const size_t currentCapacity,
    const size_t newCapacity) {
  assert(p);
  assert(currentSize <= currentCapacity &&
         currentCapacity < newCapacity);

  auto const slack = currentCapacity - currentSize;
  if (slack * 2 > currentSize) {
    // Too much slack, malloc-copy-free cycle:
    auto const result = checkedMalloc(newCapacity);
    std::memcpy(result, p, currentSize);
    free(p);
    return result;
  }
  // If there's not too much slack, we realloc in hope of coalescing
  return checkedRealloc(p, newCapacity);
}

///////////////////////////////////////////////////////////////////////////////

inline void* checkedRealloc(void* ptr, size_t size) {
  void* p = realloc(ptr, size);
  if (!p) {
    throw_exception<std::bad_alloc>();
  }
  return p;
}

capacity -
$f a c t o r^{c a p a c t i y}$ 計算

此 vector 是以 6 bits 儲存 capacity ，因此其能所表達的數值範圍就會比較小。

power of 1.5 儲存空間小於 power of 2
FBVector 是直接儲存數值，而非儲存指數。因此在計算
${1.5}^{n}$ 倍數時，可以以 (capacity() * 3 + 1) / 2; 方式增長。

ceilf 改進

ceilf 是求數值的 upper bound integer ，而我以用 modff 直接去的浮點數的整數型態來節省成本。

#define FACTOR 1.5
#define CHUNK_SIZE 4
static inline float ilog_factor(float n) /* log1.5(n) = log2(n)/log2(1.5)*/
{
  return ceilf(log2f(n) / log2f(FACTOR));
}

static inline float modff_factor(float n) {
  float integer;
  modff((log2f(n) / log2f(FACTOR)), &integer);
  return integer + 1;
}

function
folly::Malloc - oodMallocSize
folly::Malloc - jemallocMinInPlaceExpandable
folly::Malloc - smartRealloc
folly::FBVector - in place
folly::FBVector - computePushBackCapacity
folly::FBVector - emplace_back_aux
folly::FBVector - shrink_to_fit

log2f
Fast log2(float x) implementation C++
Fastest implementation of log2(int) and log2(float)

3 vector - insert 和 erase

insert

原先是打算使用 stdarg.h 的不定變數來存取資料，但在思考如何從 arg_list 型態的 args 變數利用 va_arg 函式提取資料時發現，我不知道型態是什麼。
起初我是打算利用已知資料大小來設個 char type_[elemsize]; 或是 typedef struct {char dummy; char tmp[elemsize - 1];} type_; 來儲存資料。但兩者型態都不能夠運用在 va_arg 或宣告函式當中。

此為 struct error ，另一個 char 型態忘了紀錄，但我記得是不能用在 va_arg 函式上的錯誤。

vector.c: In function ‘__vec_insert’:
vector.c:121:24: error: variable-sized object may not be initialized
  121 |                 struct type_ e = va_arg(args, struct type_);
      |

後來查了一下，發現網路上對此的相關的說明甚少，目前比較與此問題較相關為此篇 stackoverflow 文章：
stackoverflow - va_list with unknown type names - only the byte size is known!
但仍舊沒有也解決方案，因此最後我改變儲存形式以 array 型態傳入。

在設計 insert 函式時有須注意：

不定變數的傳遞
不定變數的個數對記憶體大小的影響，如 insert 過後是否在 stack 或 heap
記憶體讀取是以 char ( 1 byte ) 為單位乘以 elemsize

#define VAR_ARG_COUNT(v, ...)\
(sizeof((__typeof__(v.buf[0])[]){0, __VA_ARGS__}) /sizeof(v.buf[0]) - 1)       

#define vec_insert(v, pos, ...)\
    __vec_insert(&v, pos, vec_elemsize(v), vec_capacity(v), VAR_ARG_COUNT(v, __VA_ARGS__), &(__typeof__(v.buf[0])[VAR_ARG_COUNT(v, __VA_ARGS__)]){__VA_ARGS__})

NON_NULL void __vec_insert(void *restrict vec,
                                    size_t pos,
                                    size_t elemsize,
                                    size_t capacity,
                                    size_t arg_count,
                                    ...)
{
    union {
        STRUCT_BODY(char);
        struct {
            size_t filler;
            char buf[];
        };
    } *v = vec;  

    if (pos >= 0 && pos <= v->size) {
        va_list args;
        va_start(args, arg_count);
        char *e_list = (char *)va_arg(args, void *restrict);
        va_end(args);
        
        while (v->size + arg_count >= capacity) {
            if (!v->on_heap) {
                v->capacity = 1;
                void *tmp =
                    malloc(elemsize * (size_t)(CHUNK_SIZE * pow(FACTOR, (v->capacity))));
                memcpy(tmp, v->buf, elemsize * v->size);
                v->ptr = tmp;
                v->on_heap = 1;
            }
            else {
                v->ptr = realloc(v->ptr, elemsize * CHUNK_SIZE * pow(FACTOR, (++v->capacity)));
            }
            capacity = (size_t) (CHUNK_SIZE * pow(FACTOR, v->capacity));
        }

        char *ptr = NULL;
        if (v->on_heap) {
            if (v->size + arg_count >= capacity) 
                v->ptr = realloc(v->ptr, elemsize * CHUNK_SIZE * pow(FACTOR, (++v->capacity)));
            ptr = v->ptr;
        }
        else
            ptr = v->buf;
            
        size_t after_pos = pos;
        char *after_pos_ptr = (char *) &ptr[after_pos * elemsize];
        void *after_pos_buf = malloc(elemsize * (size_t)(v->size - pos));
        memcpy(after_pos_buf, after_pos_ptr, elemsize * (size_t)(v->size - pos));

        for (int i = 0;i < arg_count;i++) {
            memcpy(&ptr[after_pos++ * elemsize], &e_list[i * elemsize], elemsize);
        }
        memcpy(&ptr[after_pos * elemsize], after_pos_buf, elemsize * ((size_t)v->size - pos));
        free(after_pos_buf);
        v->size = v->size + arg_count;
    }
}

v(float, 4, test, 1.0, 2.0, 3.0, 4);
vec_insert(test, 3, 5, 6, 7, 8, 9, 10, 11, 12);
display(test);
-------------------------------------------------------------------------------
$ make test
./m_vector
1.00 2.00 3.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 4.00 heap

可以從程式碼得知此操作的成本極高，有多個 memory copy 與 allocate 操作，但目前還沒找到比較好的實作手段。
比較可能實行的方法大概是改變儲存結構，使其為 linked list 的方式。

erase

利用不定變數達到 function overloading 的部份概念（因為是利用 parameter 數量來判斷，沒辦法達成不同型態但同一個數量的 function overloading ）。
當傳入數值為 1 個時， erase 那個位置即可，而若輸入兩個變數則會是範圍 erase 。

#define vec_erase(v, ...)\
    __vec_erase(&v, vec_elemsize(v), vec_capacity(v), VAR_ARG_COUNT(v, __VA_ARGS__), __VA_ARGS__);

NON_NULL void __vec_erase(void *restrict vec,
                                    size_t elemsize,
                                    size_t capacity,
                                    size_t arg_count,
                                    ...)
{
    union {
        STRUCT_BODY(char);
        struct {
            size_t filler;
            char buf[];
        };
    } *v = vec;  

    va_list args;
    va_start(args, arg_count);
    
    char *ptr = NULL;
    if (v->on_heap)
        ptr = v->ptr;
    else
        ptr = v->buf;
    
    if (arg_count > 1) {
        size_t start = va_arg(args, size_t);
        size_t end = va_arg(args, size_t);
        if (start >= 0 && end <= v->size && v->size - (end - start) > 0) {
            memmove(&ptr[elemsize * start], &ptr[elemsize * (end)], elemsize * ((size_t)(v->size - (end - start))));
            v->size = v->size - (end - start);
        }
    }
    else {
        size_t pos = va_arg(args, size_t);
        if (pos >= 0 && pos <= v->size && v->size > 0) {
            memmove(&ptr[elemsize * pos], &ptr[elemsize * (pos + 1)], elemsize * ((size_t)(v->size - pos)));
            v->size--;
        }
    }
    va_end(args);
}

v(float, 4, test, 1.0, 2.0, 3.0, 4);
vec_insert(test, 3, 5, 6, 7, 8, 9, 10, 11, 12);
display(test);
vec_erase(test, 0, 4);
display(test);
vec_erase(test, 7);
display(test);
vec_erase(test, 11);
display(test);
-------------------------------------------------------------------------------
$ make test
./m_vector
1.00 2.00 3.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 4.00 heap
6.00 7.00 8.00 9.00 10.00 11.00 12.00 4.00 heap
6.00 7.00 8.00 9.00 10.00 11.00 12.00 heap
6.00 7.00 8.00 9.00 10.00 11.00 12.00 heap

Reference

測驗 2

1 解釋程式碼

coroutine

struct

struct cr {
    void *label;
    int status;
    void *local; /* private local storage */
};

label

/* Helper macros to generate unique labels */
#define __cr_line3(name, line) _cr_##name##line
#define __cr_line2(name, line) __cr_line3(name, line)
#define __cr_line(name) __cr_line2(name, __LINE__)

struct cr {
    void *label;
    int status;
    void *local; /* private local storage */
};

#define cr_label(o, stat)                                   \
    do {                                                    \
        (o)->status = (stat);                               \
        __cr_line(label) : (o)->label = &&__cr_line(label); \
    } while (0)

action from label

#define cr_begin(o)                     \
    do {                                \
        if ((o)->status == CR_FINISHED) \
            return;                     \
        if ((o)->label)                 \
            goto *(o)->label;           \
    } while (0)
    
#define cr_end(o) cr_label(o, CR_FINISHED)
    
#define cr_wait(o, cond)         \
    do {                         \
        cr_label(o, CR_BLOCKED); \
        if (!(cond))             \
            return;              \
    } while (0)

#define cr_exit(o, stat)   \
    do {                   \
        cr_label(o, stat); \
        return;            \
    } while (0)

system call

#define cr_sys(o, call)                                                     \
    cr_wait(o, (errno = 0) || !(((call) == -1) &&                           \
                                (errno == EAGAIN || errno == EWOULDBLOCK || \
                                 errno == EINPROGRESS || errno == EINTR)))

queue

#define cr_queue(T, size) \
    struct {              \
        T buf[size];      \
        size_t r, w;      \
    }
#define cr_queue_init() \
    {                   \
        .r = 0, .w = 0  \
    }
#define cr_queue_len(q) (sizeof((q)->buf) / sizeof((q)->buf[0]))
#define cr_queue_cap(q) ((q)->w - (q)->r)
#define cr_queue_empty(q) ((q)->w == (q)->r)
#define cr_queue_full(q) (cr_queue_cap(q) == cr_queue_len(q))

#define cr_queue_push(q, el) \
    (!cr_queue_full(q) && ((q)->buf[(q)->w++ % cr_queue_len(q)] = (el), 1))
#define cr_queue_pop(q) \
    (cr_queue_empty(q) ? NULL : &(q)->buf[(q)->r++ % cr_queue_len(q)])

base network function


static int nonblock(int fd)
{
    int flags = fcntl(fd, F_GETFL, 0);
    if (flags == -1)
        return -1;
    return fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}

main func

    char *host = argv[1];
    int port = atoi(argv[2]);

    int fd = socket(AF_INET, SOCK_STREAM, 0);

    struct sockaddr_in addr = {
        .sin_family = AF_INET,
        .sin_addr =
            {
                .s_addr = inet_addr(host),
            },
        .sin_port = htons(port),
    };
    connect(fd, (struct sockaddr *) &addr, sizeof(struct sockaddr_in));

    select(fd + 1, &fds, NULL, NULL, NULL);

AP function

static void socket_read_loop(struct cr *o, int fd)
{
    static uint8_t b;
    static int r;
    cr_begin(o);
    for (;;) {
        cr_sys(o, r = recv(fd, &b, 1, 0));
        if (r == 0)
            cr_exit(o, 1);
        cr_sys(o, write(STDOUT_FILENO, &b, 1));
    }
    cr_end(o);
}

static void socket_write_loop(struct cr *o, int fd, byte_queue_t *in)
{
    static uint8_t *b;
    cr_begin(o);
    for (;;) {
        cr_wait(o, !cr_queue_empty(in));
	    b = cr_queue_pop(in);
        cr_sys(o, send(fd, b, 1, 0));
    }
    cr_end(o);
}

static void stdin_loop(struct cr *o, byte_queue_t *out)
{
    static uint8_t b;
    static int r;
    cr_begin(o);
    for (;;) {
        cr_sys(o, r = read(STDIN_FILENO, &b, 1));
        if (r == 0) {
            cr_wait(o, cr_queue_empty(out));
            cr_exit(o, 1);
        }
        cr_wait(o, !cr_queue_full(out));
        cr_queue_push(out, b);
    }
    cr_end(o);
}

2 限制與改進

Coroutine in C

原先對於 coroutine 相關操作如 struct cr 結構等，都是需要使用者自己加入至 function 當中。
這很容易會因為太多參數需要顧慮，造成有可能出錯。
因此我以用 EV3 - Coroutines in C 的形式改進。使得在撰寫函式時，不需要顧忌 coroutine 的相關參數等。
而且，在 EV3 - Coroutines in C 裡，他的函式宣告：

#define CORO_DEFINE(name) int coro_##name( co_t *co_p )

無法自設函式參數，而在這之上我利用不定變數做出能夠改進。

#define cr_function(func_name, ...)                                            \
  void cr_##func_name(struct cr *o, __VA_ARGS__)
#define cr_call(func_name, o, ...) cr_##func_name(&(o), __VA_ARGS__)

完整程式碼

GitHub - linD026 - coroutine.h

test coroutine

#include <stdio.h>
#include "coroutine.h"

// void test(struct cr*o, int state, int par1)
cr_function(test, int state, int par1) {
  cr_begin();
  //cr_wait(cond, final)
  cr_wait((state % 2 == 1), cr_restart());
  printf("0 %d %d\n", state, par1);
  //cr_end(final)
  cr_end(cr_restart());
}

cr_function(test1, int state, int par1, int par2) {
  cr_begin();
  cr_wait((state % 2 == 0), cr_restart());
  printf("1 %d %d %d\n", state, par1, par2);
  cr_end(cr_restart());
}

int main(void) {
  struct cr cor1 = cr_init();
  struct cr cor2 = cr_init();
  for (int i = 1;i <= 10;i++) {
    //test(cor1, i, i)
    cr_call(test, cor1, i, i);
    cr_call(test1, cor2, i, i, i);
  }
}

% make test  
./test_coroutine
0 1 1
1 2 2 2
0 3 3
1 4 4 4
0 5 5
1 6 6 6
0 7 7
1 8 8 8
0 9 9
1 10 10 10

Reference

EV3 - Coroutines in C

EV3 子計畫 Coroutines in C - yield 和 restart

#define cr_restart(final)                                                      \
  do {                                                                         \
    final;                                                                     \
    (o)->label = NULL;                                                         \
    (o)->status = CR_BLOCKED;                                                  \
    return;                                                                    \
  } while (0)

#define cr_yield(final)                                                        \
  do {                                                                         \
    cr_label(CR_BLOCKED);                                                      \
    final;                                                                     \
    return;                                                                    \
  } while (0)

Reference

EV3 - Coroutines in C

2021q1 Homework5 (quiz5)

tags: linux2021

0 進度

測驗 1

1 解釋程式碼以及效能分析

解釋程式碼

cleanup

Variadic Macros

At least one argument 問題 - standard C

解決方法

效能分析

初步測試 - stack to heap

CPU 使用率

2 限制與改進

FBVector - reuse previously-allocated memory

factor - 1.5 更改

reallocate address 測試

FBVector - autiumatically detect - memory chunk

goodMallocSize()

jemallocMinInPlaceExpandable

computePushBackCapacity() - factor 1.5

xallocx() - shrink

emplace_back_aux()

限制

realloc()

FBVector - in place reallocation

folly::Malloc - smartRealloc()

capacity - factorcapactiy 計算

ceilf 改進

3 vector - insert 和 erase

insert

erase

測驗 2

1 解釋程式碼

coroutine

system call

queue

base network function

AP function

2 限制與改進

Coroutine in C

test coroutine

EV3 子計畫 Coroutines in C - yield 和 restart

Read more

Linux 核心採納 Rust 的狀況

[Linux 核心 Copy On Write 實作機制](https://hackmd.io/@linD026/Linux-kernel-COW-content)

Linux 核心 Read-Copy Update 筆記整理

[Linux 核心 Copy On Write - Memory Region](https://hackmd.io/@linD026/Linux-kernel-COW-content)

tags: `linux2021`

`goodMallocSize()`

`jemallocMinInPlaceExpandable`

`computePushBackCapacity()` - factor 1.5

`xallocx()` - shrink

`realloc()`

folly::Malloc - `smartRealloc()`

capacity -
$f a c t o r^{c a p a c t i y}$ 計算