
今天我要介紹的論文內容是 PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications
發表人為Zhang Yang, Aiqing Zhang, and Zeyao Mo
此篇論文是將會發布在2021的Tsinghua Science and Technology中
----------
The paper I am going to present today is PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications
Presented by Zhang Yang, Aiqing Zhang, and Zeyao Mo
This paper will be published in the 2021 issue of Tsinghua Science and Technology

今天我將會分為五個部分報告
----------
Today I will report in five parts

首先是這篇論文的簡介
本文主要介绍分布式共享内存(DSM)的广泛使用,以缓解不断扩大的处理与内存差距,并将其应用于numa。
如果不能适应numa效应,就会大大降低应用性能,尤其是在当今数十到数百核的多核平台上。然而,传统的方法如首触和内存策略在假页共享、碎片化或易用性方面都有不足。
本文提出了一种分区共享内存的方法,使多线程应用只需稍加修改代码就能实现完全的NUMA感知,并开发了配套的NUMA感知堆管理器,消除了错误的分页和最小化碎片。
-----------
First is the introduction to this paper
This paper focuses on the widespread use of Distributed Shared Memory (DSM) to mitigate the ever-widening processing-memory gap with numa.
Failure to accommodate the numa effect can significantly degrade application performance, especially on today's multi-core platforms with tens to hundreds of cores. However, traditional approaches such as first-touch and memory policies fall short in terms of false page sharing, fragmentation, or ease of use.

那什麼是numa呢?
是一種為多處理器的電腦設計的記憶體架構,記憶體存取時間取決於記憶體相對於處理器的位置。在NUMA下,處理器存取它自己的本地記憶體的速度比非本地記憶體快一些。
-----------
What is numa?
A memory architecture designed for multiprocessor computers, where memory access time depends on the location of memory relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory.

這分析了基於作業系統的NUMA感知方法的缺陷。 並行科學應用通常建立在MPI+線程模型上,並表現出類似圖1的批量同步並行(BSP)執行模式。
應用程式以鎖步的方式執行,每個鎖步往往由線程間交換階段和線程-本地擁有者-計算階段組成。 這兩個階段和多個鎖步的組合會使記憶體訪問模式變得非常不規則,就像多塊應用和掃蕩中的記憶體訪問模式一樣。
首先,由於基於作業系統的方法只能處理固定大小的頁面,而應用程式通常處理可變大小的塊,因此碎片化會變得很嚴重。
-----------
This analyze the pitfalls of OS- based NUMA-awareness approaches. Parallel scientific applications are often built upon the MPI+Thread model and exhibit Bulk Synchronous Parallel (BSP) execution patterns similar to Fig. 1.
The application executes in locksteps and each lockstep often consists of an inter-thread exchange phase and a thread-local owner- computation phase. The combination of these two phases and multiple locksteps can make the memory access pattern highly irregular as those in multi-block applications and sweepings

首先,由於基於作業系統的方法只能處理固定大小的頁面,而應用程式通常處理可變大小的塊,因此碎片化會變得很嚴重。
其次,為了緩解碎片化,堆管理器會啟動,在線程之間共享頁面。
這在一顆CPU中是沒有問題的,但今天如果進入多顆CPU時,就會造成跨CPU(NUMA Node)去存取非本地記憶體的東西
----------
First, since OS-based approaches can only work with fixed-size pages and applications usually work with variable-size blocks, fragmentation can become significant.
Second, to mitigate the fragmentation, heap managers will kick in and share the page among threads.
This is not a problem in one CPU, but today if it goes to multiple CPUs, it will cause cross-CPU (NUMA Node) access to non-local memory.

a 為剛剛所述的方案
我們提出了一種「分區共用記憶體 」的方法,以彌補基於作業系統的NUMA感知的缺陷。
的方法。 這種方法是涉及應用程式、運行時系統和底層操作系統的共同設計。
分區共用記憶體(Partitioned Shared Memory,PSM)的主要思想如圖2所示。
計算節點上的線程將作業系統提供的節點記憶體全域視圖劃分為若干線程本地區域,即(TLM),每個線程的本地記憶體與線程運行的NUMA節點綁定。 如圖2b所示,每個線程都可以在任何線程的TLM中分配記憶體,因此共用記憶體被保留。 然而,分配線程完全知道記憶體塊來自哪裡,並可以充分利用這一事實來適應NUMA架構。
但B的方法對於開法者來說是非常麻煩的
所以他希望撰寫一個api 來解決這個問題 讓開發者可以用A的方式 來達成B的效果
---------
a For the method just described
We propose a “partitioned shared memory” approach as a remedy to the pitfalls of OS-based NUMA-awareness
approaches. This approach is a co-design involving applications, runtime system, and the underlying OS.
The main idea of Partitioned Shared Memory (PSM) is shown in Figure 2. The threads on a compute node divide the global view of node memory provided by the operating system into a number of thread-local regions, i.e. (TLM), and each thread's local memory is bound to the NUMA node on which the thread is running. As shown in Figure 2b, each thread can allocate blocks of memory in any thread's TLM, so shared memory is preserved. However, the allocating thread is fully aware of where the memory block comes from and can take full advantage of this fact to adapt to the NUMA architecture.
But B's method is very troublesome for the prescribers
So he wants to write an api to solve this problem so that developers can use the way A to achieve the effect of B
--------
圖2c所示的分區共用記憶體軟體棧解決了基於操作系統的方法的其他缺陷。
首先,與基於OS的方法不同,它工作在運行時層面,並以可變大小的塊而不是固定大小的頁來服務記憶體,從而減少了碎片。
第二,由於頁面可以在同一NUMA節點上的線程之間共用,但絕不可以在不同NUMA節點上的線程之間共用,因此在保持低碎片化的同時,消除了虛假的頁面共用。
第三,這種方法除了能夠在複雜的實際應用中實現完全的NUMA感知外,還很容易使用,因為對應用代碼的唯一改變是記憶體分配調用,分配器提供記憶體塊應駐留在哪個線程上。 最後,所提出的方法是可移植的,因為它只需要操作系統確保一個頁面被綁定到指定的NUMA節點上。
--------
The partitioned shared-memory software stack as illustrated in Fig. 2c addresses other pitfalls of OS- based approaches.
First, unlike OS-based approaches, it works at the runtime level and serves memory as variable-size blocks instead of fixed-size pages, thus reduces fragmentation.
Second, since pages can be shared among threads on the same NUMA node but never shared among threads on different NUMA nodes, false page-sharing is eliminated while keeping the fragmentation low.
Third, this approach remains easy to use besides its ability to achieve full NUMA-awareness in complex real-world applications, since the only changes to application code are memory allocation calls where the allocator supplies on which thread shall the memory block resides. Lastly, the proposed approach is portable, since it only requires the OS to ensure a page is bound to the specified NUMA node.

為了支援分區共用記憶體抽象,以及消除錯誤的分頁,減少碎片,減輕頁面分配和綁定造成的開銷,
PsmArena提供了兩個基本的記憶體管理API。
PsmArena提供的API是非常低級的,類似於C標準庫中提供的malloc和freeAPI。 因此,PsmArena既可以直接支援來自應用程式的記憶體請求,也可以通過標準庫的呼叫間接支援(如C++容器)。 )
PsmArena堆管理器能夠解決以下挑戰。 第一,管理器將始終從正確的位置返回塊;第二,管理器將消除錯誤的頁面共享,減少碎片;第三,該實現將是線程安全的,並且可以用幾十個到幾百個內核進行擴展。
--------
To support partitioned shared memory abstraction, as well as to eliminate incorrect paging, reduce fragmentation, and mitigate the overhead caused by page allocation and binding
PsmArena provides two basic memory management APIs.
The APIs provided by PsmArena are very low-level and are similar to the malloc and free APIs provided in C standard library. Thus PsmArena can support memory requests both directly from the application and indirectly through standard library calls (such as C++
containers).
The PsmArena heap manager is able to address the following challenges.
First, the manager will always return blocks from correct locations;
second, the manager will eliminate false page-sharing and reduce fragmentation;
third, the implementation will be thread- safe and scalable with tens to hundreds of cores.

TCMalloc是一個針對多核系統優化的堆管理器,通過精心設計的隔離存儲方案,能夠實現低碎片化。
然而,TCMalloc不是位置感知的,並且存在錯誤的頁面共享和返回遠端
塊。
例如,TCMalloc對全域共用記憶體一視同仁,並在每個線程緩存和中央自由清單之間來回遷移記憶體;因此,意外的遠端塊可能經常被引入。
我們將TCMalloc的設計擴展到圖3中。 我們沒有將堆作為一個整體來處理,而是將其劃分為多個獨立的NUMA節點堆,每個節點管理屬於一個NUMA節點的記憶體塊的方式與TCMalloc相同。
線程被綁定到內核上這種設計通過利用TCMalloc中先進的隔離存儲方案,消除了錯誤的頁面共享,減少了碎片化,從而實現了NUMA感知。 這種設計還提高了擴充性,因為除了位置感知頁面分配器在某個NUMA節點上分配頁面外,所有的鎖都是本地的NUMA節點堆。
-------
**TCMalloc is a heap manager optimized for multicore systems, with the ability of low fragmentation from the carefully-designed segregated storage scheme.
However, TCMalloc is not location-aware and suffers from both false page-sharing and returning remote
blocks.
For example, TCMalloc treats the global shared memory equally and migrates memory blocks back and forth among per-thread caches and central free-lists; thus, unexpected remote blocks can be frequently
introduced.
We extend the TCMalloc design to that in Fig. 3. Instead of treating the heap as a whole, we divide it into several independent NUMA node heaps, whereby each manages memory blocks belonging to one NUMA node the same way as in TCMalloc.
Threads are bound to cores
This design eliminates false page-sharing and reduces fragmentation by utilizing the advanced segregated storage scheme in TCMalloc and thus achieves NUMA-awareness. This design also improves the scalability, since except for the location- aware page allocator which allocates pages on a certain NUMA node, all locks are local to a NUMA node heap.
**

所以今天當程式跟我請求記憶體時 會先與Request dispatcher申請
之後他會看申請的thread是屬於哪一個node的
之後就會分配相對應的位置給他使用
-----
when the program requests memory from me, it will first request it from the Request dispatcher.
and then he will see which node the requested thread belongs to
and then he will allocate the corresponding location to use

這是測試環境設備
所使用的為此顆CPU
每個都有8 cores
共使用32顆
總共256的core
--------
This is the test environment equipment
The CPU used is this one
Each has 8 cores
32 cores in total
Total of 256 cores

首先他先以左邊這個程式來做測試
結果在右邊的表格
上方
左邊是共執行幾個thread 右邊三項分別為使用不同的演算法 之後產生在不同numa node的記憶體數量
下方
左邊是共執行幾個thread 分別為寫記憶體時的時間
-------
First, he tested the program on the left
The results are in the table on the right
top
The left side shows the total number of threads run, and the right side shows the amount of memory generated in different numa nodes by using different algorithms.
Bottom
The left side shows the total number of threads executed and the time spent writing memory.

之後也拿了其他測試環境 分別為跑2d與3d的再不演算法下的時間區別
-------
Then also took the other test environment respectively for the time difference between running 2d and 3d again without algorithm

為了解決傳統的基於操作系統的弊端。
NUMA感知方法,並説明多線程。
本文提出了一種分區共用記憶體方法,該方法由一個應用級的分區共用記憶體概念抽象和一個配套的NUMA感知多線程堆管理器PsmArena組成。
PsmArena在塊粒度上管理記憶體,消除了錯誤的頁面共用,並最大限度地減少了碎片。
-------
To address the drawbacks of traditional OS-based
NUMA-awareness approaches and help multithreaded
propose a partitioned shared-memory approach in this paper, that consists of an application-level conceptual abstraction of partitioned shared-memory and an accompanying NUMA-aware multithreaded heap manager called PsmArena.
PsmArena manages memory at the block granularity which eliminates false page-sharing and minimizes fragmentation.