並行程式設計: 實作輕量級的 Mutex Lock

貢獻者: RinHizakura, jserv

Futex

在傳統的同步機制之實作中，相關操作往往都需要對作業系統核心中的物件 (object) 進行存取。這意味著無論 task A 是進入或離開 critical section 時，都勢必要進入作業系統核心中檢查競爭是否發生，即便過程中實際上是不存在競爭的。如此一來，這些看似不必要的系統呼叫就導致了巨大的成本。在這樣的問題下，Futex (fast userspace mutex) 被開發者們提出。Futex 可用於實作使用者空間之同步機制，如 mutex、condition variable 等介面都可以藉此實作。

操作 Futex 的方式，是藉由一個使用者空間的 32 位元的變數 (注意在 64 位元平台 Futex 的單元仍是 32 位元，術語為 futex word)，和 Linux 核心中的 wait queue 互動。要同步的任務間會共享該變數：執行緒間的同步可直接指定某個變數位址即可，若涉及跨行程的同步，就要在共享記憶體中建立。

當任務嘗試進入或退出 critical section 時，先以 atomic operation 方式嘗試更動 Futex 的狀態，接著查看 Futex 的狀態是否與欲更動結果一致。若一致，則顯示沒有競爭發生，不需再費時做高代價的系統呼叫;反之表示競爭發生，只有此時才執行 futex(2) 系統呼叫去完成相應處理(等待或者喚醒):作業系統核心根據 futex 要求的操作將任務從 wait queue 中取出喚醒，或者放入 wait queue 中等待。

在〈Linux 核心設計: 不僅是個執行單元的 Process 提及 futex〉，我們藉此實作 mutex lock，可見:

系統呼叫介面

完整的 Futex 一部分仰賴在 userspace 的記憶體空間與相應的 atomic operation，這部份會在後續以測驗題為案例直接說明。而另一部份是當競爭發生時，需要透過 futex(2) 去等待或者喚醒對應的任務。futex 的介面如下:

 long syscall(SYS_futex, uint32_t *uaddr, int futex_op, uint32_t val,
                    const struct timespec *timeout,   /* or: uint32_t val2 */
                    uint32_t *uaddr2, uint32_t val3);

由於在 Linux 上 futex 沒有直接的 C 語言函式介面，因此需藉由 syscall 來進行。而參數包含以下:

uaddr: 即 futex word
futex_op: Futex 操作的對應編碼
val, timeout, uaddr2, val3 與 futex_op 有關，根據 futex_op 的差異決定變數是否使用，並可能會有不同意義

futex_op 有數種，其分成 operation 和 option 二個部分，可通過查看各自對應的 bits 得知。operation 表示具體的 futex 操作，而 option 則是可以對該操作進行細部的行為調整。在 futex(2) 文件對此更詳盡的解釋。

為了充分支援 pthread_cond 的實作，特別是廣播 (broadcast) 操作，futex(2) 引入 requeue 功能，並藉由 futex 系統呼叫提供相關操作介面，其中包括一對相互匹配的操作 futex_wait_requeue_pi 和 futex_requeue。

在多個臨界區段 (critical section，簡稱 CS) 之間，使用 mutex 保證互斥性，但無法滿足不同臨界區段之間的前後順序。因此，我們可使條件變數（condition variable，簡稱 condvar 或 cond）在一個臨界區段 A 中等待另一臨界區段 B 的訊號 (signal)。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

condvar 不僅是互斥的同步機制，還要一種方式等待其他執行緒進行某些操作。當我們需要在臨界區段 A 中等待臨界區段 B 先完成某些操作時，可用 condvar。

一個 condvar 必須與一個 mutex 配對使用，因此等待 condvar 的過程實際上涉及二次等待：

等待 condvar 的訊號（通常由其他執行緒的發出)
等待所屬的 mutex lock 的釋放

這也意味著等待的執行緒需要兩次等待 (wait) 和兩次喚醒 (wake)。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

發送 condvar 訊號（signal）會喚醒等待在 condvar 上的一個執行緒，該執行緒接著將等待轉移至所屬的 mutex lock。使用 broadcast 的方法則是喚醒所有等待在 condvar 上的執行緒。

為了避免浪費喚醒操作所導致的執行緒切換，可將等待在二個佇列之間的執行緒縮減為一次呼叫，以減少競爭 mutex lock 的次數。這就是 futex 提供的 requeue 功能。

被 requeue 的等待者（即呼叫 futex_wait_requeue_pi 阻塞的執行緒），必須等待所屬的 condvar 的 mutex lock 的釋放。當被呼叫的執行緒從 futex_wait_requeue_pi 呼叫中被喚醒，並不意味著它已持有 (acquire) 所屬的 condvar 的 mutex lock，它需要去競爭該 lock。

requeue 在此過程中將 condvar 和 mutex 鎖視為 futex1 和 futex2。

以下提供 3 個 futex 操作的封裝。

`futex_wait`

/* Atomically check if '*futex == value', and if so, go to sleep */
static inline void futex_wait(atomic int *futex, int value)
{
    syscall(SYS_futex, futex, FUTEX_WAIT_PRIVATE, value, NULL);
}

FUTEX_WAIT_PRIVATE 的 operation 部分即 FUTEX_WAIT，表示呼叫者會被作業系統核心加入到 wait queue 中等待，直到 futex 的值變得與給定的 value 不同。FUTEX_PRIVATE 則是 option 部分，表示 futex 僅由執行緒之間共享，不可用來同步多個 process。

注意到當使用 FUTEX_WAIT，在作業系統核心將該執行緒放進 wait queue 以 suspend 之以前，會先檢查 futex word 是否等於給定的 val，若不相同則該系統呼叫會直接返回 EWOULDBLOCK/EAGAIN。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

出處: Futexes Are Tricky

如上圖場景，Thread₃ 在讀取 futex word 後被打斷，期間 Thread₁ 已更新 futex word 的值，則等到 thread₃ 試圖用原本的讀到的值呼叫 FUTEX_WAIT 時，這會返回並伴隨著 error EWOULDBLOCK。

`futex_wake`

/* Wake up 'limit' threads currently waiting on 'futex' */
static inline void futex_wake(atomic int *futex, int limit)
{
    syscall(SYS_futex, futex, FUTEX_WAKE_PRIVATE, limit);
}

FUTEX_WAKE_PRIVATE 可分為 FUTEX_WAKE 和 FUTEX_PRIVATE 兩部分。FUTEX_PRIVATE 前面已經說明過就不多贅述，而 FUTEX_WAKE 會喚醒至多 limit 個的等待者。

`futex_requeue`

/* Wake up 'limit' waiters, and re-queue the rest onto a different futex */
static inline void futex_requeue(atomic int *futex,
                                 int limit,
                                 atomic int *other)
{
    syscall(SYS_futex, futex, FUTEX_REQUEUE_PRIVATE, limit, INT_MAX, other);
}

FUTEX_REQUEUE_PRIVATE 對應到 FUTEX_REQUEUE operation。該操作會喚醒最多 limit 個等待者，但和 FUTEX_WAKE 不同的地方是，若 futex 中所有的等待者超過 limit 個，則剩下的等待者皆會被喚醒並改在 other 這個 futex word 對應的 wait queue 中等待，函式中的 INT_MAX 表示可以從 futex 移到 other 的最大數量。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

FUTEX_REQUEUE 可說是 FUTEX_WAKE 的 superset。FUTEX_REQUEUE 是為了解決 FUTEX_WAKE 可能引起的Thundering herd problem (驚群效應)。考慮以下多個執行緒同時運行以下邏輯的場景:

lock(A)
while (!check_value(V)) {
  unlock(A);
  block_on(B);
  lock(A);
};
unlock(A);

若 futex 是經由 FUTEX_WAKE 喚醒，則所有在等待 B 的執行緒都會被喚醒，接著他們會一起嘗試去獲得 A，但是我們知道最後只能有其中一個執行緒能夠獲得 A，於是其他執行緒好不容易醒來，卻要繼續等 A，這就意味著二次 futex 的呼叫。在此之上，若執行緒之間是跨處理器的，還要加上 cache coherence 的成本。而 FUTEX_REQUEUE 應對的方式就是直接在一次的 futex 呼叫中直接讓醒來後也得不到 A 的執行緒直接去等 A 就好。

上述的場景其實就是 mutex + condition variable 很經典的使用模式，後續我們可在這些 lock 的實作中看到更為具體的 futex 使用方式。

實作 Mutex lock

程式碼

mutex_trylock

static bool mutex_trylock(mutex_t *mutex)
{
    int state = load(&mutex->state, relaxed);
    if (state & MUTEX_LOCKED)
        return false;

    state = fetch_or(&mutex->state, MUTEX_LOCKED, relaxed);
    if (state & MUTEX_LOCKED)
        return false;

    thread_fence(&mutex->state, acquire);
    return true;
}

mutex_trylock 嘗試持有 mutex，若成功則返回 true，反之返回 false。

程式碼先用 atomic_load 去確認 mutex 是否為 lock 狀態，若已被鎖住則直接返回 false 即可。否則的話，嘗試去對 mutex 進行 atomic_fetch_or_explicit。這會將 state or 上 bitflag MUTEX_LOCKED，並且返回該操作之前的 state 值。則我們可藉此確認進行上鎖時，是否已被其他執行緒鎖住，若是，則此執行緒的上鎖操作失敗。反之，只有確保整個 atomic operation 是從沒有 MUTEX_LOCKED bitflag 到有的狀態，才算是成功。

spin_lock() 要解決 cache line bouncing 問題。

static inline void mutex_lock(mutex_t *mutex)
{
#define MUTEX_SPINS 128
    for (int i = 0; i < MUTEX_SPINS; ++i) {
        if (mutex_trylock(mutex))
            return;
        spin_hint();
    }

    int state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed);

    while (state & MUTEX_LOCKED) {
        futex_wait(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING);
        state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed);
    }

    thread_fence(&mutex->state, acquire);
}

mutex_lock 會嘗試去持有 mutex。

最開始的 for 迴圈先試著在開始的一小段時間中，用不斷 mutex_trylock 的方式去持有 lock。在 lock 大部分狀況下其實都不需要和別人競爭的場景，或是持有 lock 的人通常使用一陣子就會返還的情形下，這作法就不必透過系統呼叫進到 Linux 核心。

若 for 迴圈這一段時間都得不到 lock，改成藉 atomic_exchange_explicit 將 state 更改成 MUTEX_LOCKED | MUTEX_SLEEPING 的方式來持有。其中額外的 MUTEX_SLEEPING 表示執行緒不僅要試著 lock mutex，且若它未能即時得到，會用 futex_wait 把自己加入到 wait queue 中。因此日後要 unlock 時就需要辨別該 flag 以 futex_wake 才能正確喚醒之。

附帶一提，spin_hint 的作用是什麼呢? 查看程式碼的定義如下:

#if defined(__i386__) || defined(__x86_64__)
#define spin_hint() __builtin_ia32_pause()
#elif defined(__aarch64__)
#define spin_hint() __asm__ __volatile__("isb\n")
#else
#define spin_hint() ((void) 0)
#endif

在同步機制中，經常需要以 spin-wait loop 方式檢測 lock 的狀態，以嘗試持有 lock。但這些頻繁的檢測將導致 pipeline 上都是對 lock 狀態的讀操作，那麼當另一個執行緒對 lock 進行寫操作時，因為相依關係 pipeline 就必須重排，這導致性能的下降和更多的耗電。

現代處理器事實上相當重視此問題，以至於指令集架構中甚至提出相關的指令。以 x86 為例，具備 PAUSE 指令，在 gcc 中則可呼叫內建函式 __builtin_ia32_pause。

mutex_unlock

static inline void mutex_unlock(mutex_t *mutex)
{
    int state = exchange(&mutex->state, 0, release);
    if (state & MUTEX_SLEEPING)
        futex_wake(&mutex->state, 1);  // FFFF
}

綜前所述，mutex_unlock 就不難理解，即重置 state 並在必要時呼叫 futex_wake。若 lock 狀態是上鎖但是沒有等待者(!(state & MUTEX_SLEEPING))，就可跳過 futex_wake 來節省時間。

實作 Condition variable

cond_wait

static inline void cond_wait(cond_t *cond, mutex_t *mutex)
{
    int seq = load(&cond->seq, relaxed);

    mutex_unlock(mutex);

#define COND_SPINS 128
    for (int i = 0; i < COND_SPINS; ++i) {
        if (load(&cond->seq, relaxed) != seq) {
            mutex_lock(mutex);
            return;
        }
        spin_hint();
    }

    futex_wait(&cond->seq, seq);

    mutex_lock(mutex);

    fetch_or(&mutex->state, MUTEX_SLEEPING, relaxed);
}

cond_wait 會暫停目前的任務直到被通知。其實作會監看 seq 的改變，實際上這在 cond_signal 或者 cond_broadcast 時可能發生。這邊我們也可見到用上 mutex 實作的類似技巧，來提昇避免 futex 系統呼叫的可能性。若嘗試 COND_SPINS 次後仍然失敗，則退而透過 futex_wait 等待，並在等待到 condition variable 後 acquire lock。

注意最後的地方額外為 state 添加 MUTEX_SLEEPING flag。為何？這其實與 futex_requeue 的行為有關。我們知道 futex_requeue 會將執行緒從等待 cond->seq 的 wait queue 放到等待 mutex->state 的 wait queue，然而這表示日後要喚醒等待 mutex->state 的執行緒時，就勢必需要呼叫 futex_wake 才能成功喚醒。為了確保之後做 mutex_unlock 都能呼叫到 futex_wake，才必須額外加上 MUTEX_SLEEPING 這個 flag。

測試和驗證

main.c 模擬的同步場景：給定 16 個工作節點

n_{0}

n_{15}

，他們彼此間具備從屬關係的:

n_{k}

的親代節點是

n_{k - 1}

。每個節點需要都等待親代節點就緒 (ready) 方可進行下一階段的工作 (

n_{0}

沒有親代節點，因此無此限制)。

另一方面，這 16 個節點共享一個 clock，clock 從 tick 1 開始累計，節點在每個 tick 都有必須完成的事：在偶數 tick 時想像成是完成階段任務，可改變自身狀態為就緒並通知其子節點繼續後續任務，而在奇數 tick 時則推動時間讓 tick += 1。

struct clock {
    mutex_t mutex;
    cond_t cond;
    int ticks;
};

/* A node in a computation graph */
struct node {
    struct clock *clock;
    struct node *parent;
    mutex_t mutex;
    cond_t cond;
    bool ready;
};

在設定的場景下，可知我們需要同步若干變數:

clock 中的 ticks: 其被所有的節點共享。因為每個執行緒都先根據 tick 的進展決定是否繼續下一步驟，因此需要 condition variable + mutex 機制去通知此事件，且通知應該是用 broadcast 方式讓所有執行緒得知
node 中的 ready: 節點
$n_{k}$ 要確認
$n_{k - 1}$ 是 ready 才能進行該 tick 的工作，反過來說是親代節點就緒時，可主動通知這件事。此時同樣需要 condition variable + mutex 機制，但以 signal 方式通知子節點就好

main

int main(void)
{
    struct clock clock;
    clock_init(&clock);

#define N_NODES 16
    struct node nodes[N_NODES];
    node_init(&clock, NULL, &nodes[0]);
    for (int i = 1; i < N_NODES; ++i)
        node_init(&clock, &nodes[i - 1], &nodes[i]);

    pthread_t threads[N_NODES];
    for (int i = 0; i < N_NODES; ++i) {
        if (pthread_create(&threads[i], NULL, thread_func, &nodes[i]) != 0)
            return EXIT_FAILURE;
    }

    clock_tick(&clock);
    clock_wait(&clock, 1 << N_NODES);
    clock_stop(&clock);

    for (int i = 0; i < N_NODES; ++i) {
        if (pthread_join(threads[i], NULL) != 0)
            return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

回到測試程式的主體，node_init 將節點間的從屬關係建立，然後就為各個節點建立執行緒進行上述場景的工作，詳細可以查看 thread_func 下的實作。

main 執行緒呼叫第一個 clock_tick 來讓 tick 變為 1，這樣其他執行緒就可以開始根據 tick 逐步進行。而這裡的 clock_wait 會一直等待 tick 到 1 << N_NODES 之後再用 clock_stop 來讓節點的執行緒得以結束。

實作 Priority-inheritance mutex

Priority inversion

前面已然探討數種 futex operation，並展示如何據此實作各種同步機制。然而我們實際所面對的並行程式場景其實遠比想像複雜。例如若僅憑前面提到的 futex operation，我們可能面臨 Priority inversion 問題。

Priority inversion 是指具高優先級的任務在 blocking wait 一個目前被低優先級任務持有的 lock 時，由於一個中度優先級的任務此時不斷搶占低優先級任務的 CPU，結果導致低優先級任務沒辦法即時釋放 lock，高優先級就間接被低優先級的任務牽制住。

而 Priority inheritance 以作為解決 Priority inversion 的一種方式。簡單來說，當高優先級任務被低優先級任務持有的鎖 blocking 時，此時可以將低優先級任務的優先級暫時提升到跟該高優先級任務一樣的程度，這樣任何中間級別的任務就沒辦法間接限制住高優先級的任務。

考量效率，Priority inheritance 必須是 transitive 的。這意味著若一個高優先級任務被一個較低優先級任務持有的鎖 block 住，而低優先級任務本身又被另一個中等優先級任務持有的 lock block 住，以此類推形成一整個 chain。則 chain 中的所有任務之優先級都會提高到與高優先級的任務相同。

Priority-inheritance Futex (PI-futex)

對 user 來說，若欲使用 futex 來解決上述問題，與前面介紹的 futex operation 在 futex word 的數值選擇上具有較大的彈性不同，PI-futex 必須遵守一些協議來和作業系統核心正確的互動:

若 lock 不被任何任務擁有，其值為 0
若 lock 被某個任務擁有，lock 值為該任務的 thread id (TID, gettid(2))
若 lock 被持有且有其他任務在 wait queue 中在等待之，lock 的值會是 FUTEX_WAITERS | TID (FUTEX_WAITERS 是一個特殊 bitflag)

當 futex 被某任務持有且 wait queue 中有其他 waiter 時，futex 在作業系統核心中就會關聯到一個 RT-mutex 來實作，後者是作業系統核心達成 Priority inheritance 的關鍵設施。PI-futex 就是以 RT-mutex 為基礎成立的。

建立 Priority inversion 的場景

解決問題前，我們需要重現 Priority inversion 中提到的場景即可。為此，我們建立三個優先級各異的執行緒:

static void *thread_low(void *arg)
{
    /* The low priority thread takes the lock shared with high priority
     * thread until it is finish. */
    struct ctx *ctx = (struct ctx *) arg;

    mutex_lock(&ctx->m0);
    sleep(1);
    mutex_unlock(&ctx->m0);

    return NULL;
}

其中 thread_low 是其中最低優先級的，他會試圖 lock mutex m0，然後等到 sleep 1 秒之後再試圖進行釋放 lock 的行為。

static atomic bool thread_stop = false;
static void *thread_mid(void *arg)
{
    /* The middle priority thread is very busy, so it may
     * block the high priority thread if priority inversion. */
    struct ctx *ctx = (struct ctx *) arg;

    while (!load(&thread_stop, relaxed))
        ;

    return NULL;
}

而 thread_mid 是中間優先級的任務，它的用途是「裝忙」。一旦啟動後它會佔據 CPU，直到 thread_stop 被 main 執行緒為 true 之後才結束。

static void *thread_high(void *arg)
{
    struct ctx *ctx = (struct ctx *) arg;

    mutex_lock(&ctx->m0);
    /* If 'h' is printed, it means the high priority
     * thread obtain the lock because the low priority thread
     * releases it by priority inherit. */
    if (load(&thread_stop, relaxed) == false)
        printf("h");
    mutex_unlock(&ctx->m0);

    return NULL;
}

thread_high 則是最高優先級的任務，它和 thread_low 需要相同的 mutex m0。這裡的 if 判斷式是判斷 thread_high 拿到 lock 時是否是我們讓 thread_middle 結束的時候。若是表示發生 priority inversion，因為若單純考慮優先級，thread_high 理應在此之前就完成了。否則的話單字 'h' 應該會被顯示出來。

    ...
    /* Create the low priority thread, it comes first so the low priority thread
     * has more chance to get lock in prior. */
    TRY(pthread_create_with_prio(&low_t, &attr, thread_low, &ctx, 10));
    /* Create the middle priority thread */
    TRY(pthread_create_with_prio(&mid_t, &attr, thread_mid, &ctx, 20));
    /* Create the high priority thread */
    TRY(pthread_create_with_prio(&high_t, &attr, thread_high, &ctx, 30));

    sleep(2);
    store(&thread_stop, true, relaxed);
    ...

在 main 執行緒，我們按照 low

\to

middle

\to

high 的順序建立三個執行緒。在建立後，等待兩秒再去設置 thread_stop 讓 middle 執行緒有機會釋放 CPU 資源。且假設:

執行緒從 pthread_create 被呼叫到建立完成的時間是可以忽略的很短時間
對於 low/high 執行緒，一旦執行緒被建立，在極短的時間間隔內就會試圖做 mutex_lock

那麼可以簡單分析時序如下:

T0:
- low 先被建立並持有 m0，接著做 sleep，預計在 T1 時醒來
- middle 接著被建立，獲得 CPU 並在其 while loop 中「裝忙」
- high 最後建立，因為優先級原因搶佔 middle，但因為嘗試獲取 m0 失敗，無法繼續向下執行，讓出 CPU 給 middle
T1:
- low 睡醒，但由於 middle 的優先級高於 low，無法持有 CPU 而不能釋放 m0
- middle 獲得 CPU 資源，high 偶爾會搶佔，但因為始終無法持有 m0 而不能有任何進展
T2: main 執行緒此時設定 thread_stop
- middle 得以離開 while，釋放 CPU 資源
- 因為 middle 結束了，low 終於有機會可以進展以釋放 m0，然後 high 也終於得以繼續執行

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

此處時序分析不嚴謹，因為假設中的「時間很短可忽略」實際上並不是真的能完全忽略，且也簡化作業系統核心的排程方式，主要是方便解釋 priority inversion 的現象

在上面的時序中我們可以看到，其實預期有較高優先權的 high 應該要在 T0 就可以立刻做完了，但卻因為 priority inversion，high 必須等到 T2 時設置 thread_stop 讓 middle 結束，才能夠真正進行。

完整的測試程式碼可見 pi-test/main.c。若我們用原本測驗題提供的 mutex 實作來執行這個測試環境，可以發現確實沒辦法得到字母 h 被印出的結果 (注意可能有少數例外)。反之，我們可用已提供 PI-mutex 的 glibc 實作 (即 pthread 系列 API) 來驗證成功避免 priority inversion 的狀況為何，在這種方式下，預期可看到 'h' 成功輸出終端機畫面。

最後注意執行測試場景的細節:

必須透過 sudo 執行，因為牽涉對執行緒優先權的調整，需要 root 權限
必須使用 taskset 將程式鎖定在單一 CPU 上，否則多 CPU 資源的情況下，就沒辦法讓 middle 佔據唯一的 CPU。

基於 futex 實作 PI-mutex

在 PI-futex 一節，我們已針對 PI-futex 的使用規範進行了簡單的介紹。而 futex(2) 中對於如何基於 futex 以實作 PI-mutex 其實有更充足的說明，我們可以按照手冊上的敘述來進行。

`mutex_lock_pi`

參照手冊的敘述:

This operation is used after an attempt to acquire the lock via an atomic user-mode instruction failed because the futex word has a nonzero value—specifically, because it contained the (PID-namespace-specific) TID of the lock owner.

static bool mutex_trylock_pi(mutex_t *mutex)
{
    /* TODO: We have FUTEX_TRYLOCK_PI which enable for special
     * trylock in kernel, but it should be fine to just try at
     * userspace now. */
    pid_t zero = 0;
    pid_t tid = gettid();

    /* Try to obtain the lock if it is not contended */
    if (cmpxchg(&mutex->state, &zero, tid))
        return true;

    thread_fence(&mutex->state, acquire);
    return false;
}

static inline void mutex_lock_pi(mutex_t *mutex)
{
    for (int i = 0; i < MUTEX_SPINS; ++i) {
        if (mutex->trylock(mutex))
            return;
        spin_hint();
    }

    /* Since timeout is set as NULL, so we block until the lock is obtain. */
    futex_lock_pi(&mutex->state, NULL);

    thread_fence(&mutex->state, acquire);
}

因此，我們在 trylock 裡實作在使用者空間透過 atomic operation 來持有 lock 的行為。成功持有 lock 的具體過程是嘗試完成「確認原本的 futex word 是 0」及「寫入 tid」。而 lock 會重複數次的 trylock，若始終無法持有 lock，再藉由 futex_lock_pi 改進入作業系統核心來等待，這也允許讓作業系統核心進行 priority-inheritance。

後續改進空間:

futex_lock_pi 第二個參數的 NULL 對應到 system call 的 timeout，NULL 表示不停等待直到持有 lock 為止，提供 timeout 以適時的回到使用者空間是否可能獲得更多好處?
針對 Priority-inheritance，futex 系統呼叫額外提供 FUTEX_TRYLOCK_PI
- 這是因為雖然嘗試持有 lock 可以透過 atomic operation 在使用者空間進行就好，但作業系統核心中存在額外的資訊，比如 FUTEX_WAITERS 和 FUTEX_OWNER_DIED 的狀態，可以提供 race-free 的方式來持有 lock

`mutex_unlock_pi`

參照手冊的敘述:

This is called when the user-space value at uaddr cannot be changed atomically from a TID (of the owner) to 0.

static inline void mutex_unlock_pi(mutex_t *mutex)
{
    pid_t tid = gettid();

    if (cmpxchg(&mutex->state, &tid, 0))
        return;

    futex_unlock_pi(&mutex->state);
}

根據說明，我們先嘗試在使用者空間持有 lock，若失敗再經由 futex_unlock_pi 從核心去釋放 lock。

將這些 API 組合後，再此執行上述 priority inversion 的場景，若實作成功，預期可見字母 'h' 的輸出! 詳細實作可見 mutex.h。

效能測試和解讀

skinny-mutex 專案提供 perf.c 檔案來測試該專案本身的實作與 POSIX Threads 實作的區別。先不探討該專案的目的與實作細節，這邊我們也可學習該測量的手法並且應用到本測驗題的 futex-based lock 中，藉此評測正確性與相關效能。

`contention` 測試手法

完整的測試程式碼可見 perf-test/main.c。以下簡述這裡所建立的情境: 我們建立

n

個執行緒和

n + 1

個 mutex。想像這

n + 1

個 mutex 是一個環狀結構，即第

i

個 mutex 的下一個 mutex 是第

(i + 1)

個。而每個執行緒最初持有一個 mutex，然後會嘗試去取下一個，當下一個成功取得，則釋放原本持有的 mutex，這樣完成一個循環後，繼續嘗試持有再下一個，如此往復一定次數。

這樣測試的效果是: 每個時刻最多只會有一個執行緒能夠持有二個 mutex，進而進行下個循環，且該執行緒也是該時刻僅有的一個。而等到該執行緒釋放其中一個 mutex 之後，才能再有另一個執行緒能夠有新進展。藉此可達到執行緒之間彼此競爭 mutex 之場景。

正確性部份，結合 ThreadSanitizer 執行該測試程式碼，搭配測驗題提供的 mutex，初步沒有檢驗到有 race condition 的錯誤。

效能比較

藉由上述測試程式，可比較這

n + 1

個 mutex 是使用 glibc 的實作，或者本測驗題的實作能提供更好的效能。實驗方式為重複 100 次上述的測試手法，並記錄每一次完成所需要的時間。為了方便檢視，將這 100 次的時間從小到大排序，以 scatter plot 方式展示。以下為初步的測試結果。

藍點的 "futex" 即本實作，可看到其大部分表現優於 glibc 的實作，表示在 lock 的競爭上可能提供成本更小的持有和釋放方式。雖然最差情況也可能與 pthread 實作接近(見最右側藍點)。某種程度上，考慮以下原因，這可能是可預期的結果:

glibc 的 mutex 需考量普遍性和應用場景的多樣性，因此行為上勢必更加複雜
pthread_mutex_t 結構為了普遍性需更多位元來儲存相關資訊，可能對 cache 不友善
本實作較為簡單，在複雜的同步狀況下是否仍舊能保持正確/維持高效率

TODO: 引入 MutexShootout

並行程式設計: 實作輕量級的 Mutex Lock

Futex

系統呼叫介面

futex_wait

futex_wake

futex_requeue

實作 Mutex lock

實作 Condition variable

測試和驗證

實作 Priority-inheritance mutex

Priority inversion

Priority-inheritance Futex (PI-futex)

建立 Priority inversion 的場景

基於 futex 實作 PI-mutex

mutex_lock_pi

mutex_unlock_pi

效能測試和解讀

contention 測試手法

效能比較

參考資料

Read more

單一指令處理器 (OISC)

從 CPU cache coherence 談 Linux spinlock 可擴展能力議題

淺談 Microkernel 設計和真實世界中的應用

並行程式設計: 概念

`futex_wait`

`futex_wake`

`futex_requeue`

`mutex_lock_pi`

`mutex_unlock_pi`

`contention` 測試手法