shared ディレクトリ - HackMD

<style> /* code要素の設定。バッククォーテーションで囲まれた要素に対する設定。 */ .markdown-body code { font-family: 'Roboto Mono', menlo, monospace; font-size: 0.9em; line-height: 1em; color: #0177aa !important; background-color: transparent !important; } /* codeの前後に挿入される空白を消す */ .markdown-body code::before, .markdown-body code::after { content: ""; } /* code block */ .markdown-body pre code { color: black !important; } /* リンクの見た目 */ .markdown-body a { color: inherit; /* 色は本文と同じ */ text-decoration: underline dashed #BDBDBD 0.1em; /* 下線のスタイル */ text-underline-offset: 0.3em; /* 下線の位置の調整 */ } /* リンク要素の上にマウスが乗ったときの見た目 */ .markdown-body a:hover { text-decoration: underline dashed #E080A0 0.1em; text-underline-offset: 0.3em; } /* 下線の設定。++で囲まれた要素につく下線のスタイルを設定する。 */ ins { text-decoration: underline solid black 0.05em; text-underline-offset: 0.25em; } p { font-family: Roboto Condensed; } .markdown-body mark { padding: 0; background: linear-gradient(transparent 60%, #FDBF60 60%); } </style> # shared ディレクトリプロセス間のIPCを実現するためのライブラリが実装してある。ghost-userspaceではagentスレッドが1つのプロセス空間で隔離されて実行される。そのため、agentたちと効率的なIPCを行うために、共有メモリを介したIPCを実現するクラスが[**shared**ディレクトリ](https://github.com/google/ghost-userspace/tree/main/shared)に実装されいる。このディレクトリは大まかに3つの要素に分けられている。 || |-| | shmem.(c\|h) | | prio_table.(c\|h) | | fd_server.(c\|h) | これらの要素がまとめられ、1つのライブラリとして（**shared**という名前で）ビルドされる（[ビルドルール](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/BUILD#L923-L944)）。 # 共有メモリの構造 ghOStでは共有メモリを用いてIPCを行う。共有メモリは以下のようなデータ構造を持つ。。 ![](https://hackmd.io/_uploads/BJ8c1Qrfa.jpg) どの共有メモリも++HugePageの倍数++（つまり2MBの倍数）分の領域を確保する。その領域の先頭にはヘッダ用の領域が確保されていて、メタ情報などを格納している。ヘッダ用に確保された領域のサイズは[`kHeaderReservedBytes(=4096)`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/shmem.h#L94)である。領域の一番先頭に位置する[`InternalHeader`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/shmem.cc#L49-L61)は以下のように定義されている。メンバに対応する部分を上図に書いておいた。 ```cpp // This currently occupies the first page of every mapping (from offset zero). struct GhostShmem::InternalHeader { int64_t header_version; size_t mapping_size; size_t header_size; size_t client_size; std::atomic<bool> ready, finished; pid_t owning_pid; int64_t client_version; }; ``` # shmem.h shmem.hでは[`GhostShmem`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/shmem.h#L33-L95)というクラスが実装されている。このクラスは++共有メモリを抽象化したクラス++であり、自プロセスがホストとなる共有メモリを作成したり、他プロセスがホストしている共有メモリに接続したりすることができる。 `GhostShmem`は以下のように定義されている。 ```cpp class GhostShmem { public: GhostShmem() {} // 現プロセスが共有メモリを作成しホストする。 // 共有メモリの名前は"/memfd:ghost-shmem-{name}"として作られる。 // nameはホスト側で一意の文字列になっている必要がある。 GhostShmem(int64_t client_version, const char* name, size_t size); ~GhostShmem(); // pidで指定したプロセスがホストしている共有メモリに接続するメンバ関数。 // 名前が"/memfd:ghost-shmem-{name}"の共有メモリに接続される。 bool Attach(int64_t client_version, const char* name, pid_t pid); // Called by clients when they are aready for remote connections to proceed. // REQUIRES: Must be called. void MarkReady(); // A raw byte mapping into the hosted shared memory region. inline char* bytes() { return static_cast<char*>(data_); } // This is the client usable bytes addressable via bytes(). It will be at // least as large as requested at time of construction. size_t size(); // This includes internal overheads and roundings on the mapping. size_t absolute_size() const { return map_size_; } inline const void* absolute_start() const { return shmem_; } // The process that owns the shmem region. pid_t Owner() const; // Internal overheads that clients may optimized passed mapping sizes against. // This is useful as it represents the padding that should be considered if // trying to optimally pack against the huge-page backing. static size_t OverHeadbytes() { return kHeaderReservedBytes; } GhostShmem(const GhostShmem&) = delete; GhostShmem(GhostShmem&&) = delete; // 名前が"/memfd:ghost-shmem-blob-{n}"の共有メモリを作成する静的メンバ関数。 // nは呼び出しとともにインクリメントされていく値であり、名前の重複が避けることができる。 static GhostShmem* GetShmemBlob(size_t size); private: struct InternalHeader; void WaitForReady(); static int memfd_create(const char* name, unsigned int flags) { return syscall(__NR_memfd_create, name, flags); } void CreateShmem(int64_t client_version, const char* suffix, size_t size); bool ConnectShmem(int64_t client_version, const char* suffix, pid_t pid); void* shmem_ = nullptr; // 共有メモリへのポインタ size_t map_size_; // 共有メモリのサイズ（2べき） int memfd_ = -1; // 共有メモリのファイルディスクリプタ InternalHeader* hdr_ = nullptr; // ヘッダーへのポインタ void* data_; // データ領域へのポインタ static int OpenGhostShmemFd(const char* suffix, pid_t pid); static constexpr int kHeaderReservedBytes = 4096; // PAGE_SIZE }; ``` 共有メモリはプロセス上で一意の名前を持っていなければならない（procfs上で名前が重複するため？）。[`GetShmemBlob`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/shmem.cc#L212-L222)という静的メンバ関数を利用すると名前の衝突を避けることができる。コンストラクタがいくつか定義されているが、パラメータ付きコンストラクタで作成すると自プロセスがホストの共有メモリを同時に作成してくれる。 :::success このクラスを直接使うことはあまりなさそう。 prio_table.h で定義されている PrioTable クラスの方を直接的に使う。 ::: # prio_table.h prio_table.hでは[`PrioTable`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/prio_table.h#L75-L127)というクラスが実装されている。このクラスは`GhostShmem`を利用して、共有メモリのデータ領域に以下のようなデータを配置する。 ![F5DE32AA-49D2-4EA3-87C9-22BF65BA6C08](https://hackmd.io/_uploads/rypj_PmY6.jpg) ## ghost_shmem_hdr ```cpp struct ghost_shmem_hdr { uint16_t version; uint16_t hdrlen; // このヘッダーのサイズ（== sizeof(struct ghost_shmem_hdr)） uint32_t maplen; // PrioTableが使用するメモリ領域のサイズ（<= GhostShmem::size()） uint32_t si_num; // sched_itemの配列の要素数 uint32_t si_off; // sched_itemの配列へのオフセット uint32_t wc_num; // work_classの配列の要素数 uint32_t wc_off; // work_classの配列へのオフセット uint32_t st_cap; // streamの容量（StreamCapacityのどれかの値） uint32_t st_off; // streamへのオフセット } ABSL_CACHELINE_ALIGNED; ``` 共有メモリのデータ部のレイアウト情報などをまとめたもの。 ## sched_item ```cpp struct sched_item { uint32_t sid; // このPrioTable内でsched_itemを識別する番号 uint32_t wcid; // このsched_itemが属しているwork_classの番号 uint64_t gpid; // スレッドの識別子（gtid） uint32_t flags; /* schedulable attributes */ seqcount_t seqcount; uint64_t deadline; // デッドライン（ns） } ABSL_CACHELINE_ALIGNED; ``` スケジューリングの対象となるスレッドごとに１つ用意される。 ||| |-|-| | sid | PrioTalbeの配列におけるこの要素のindex | wcid | このスレッドのWC（Work Class）のindex | gpid | このスレッドのgtid | flags | 後述 | seqcount | sched_itemのメンバにアクセスするときの同期処理に使われる | deadline | このタスクのデッドライン ### `flags` 現状、このフィールドにセットされるフラグは[`SCHED_ITEM_RUNNABLE`](https://github.com/google/ghost-userspace/blob/main/shared/prio_table.h#L59)のみ。++このフラグをセットすることでタスクはスケジューリングされるようになる++。逆にいうと、++このフラグをクリアすることでスケジューリングされないようにできる++。 ### [`seqcount`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/prio_table.h#L19-L30) `sched_item`構造体には[`seqcount_t`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/prio_table.h#L31)型のメンバが存在する。このメンバは`sched_item`の各要素に適切に読み書きを行うために利用される。`seqcount_t`型は`seqcount`構造体のエイリアスで定義されており、以下のような実装になっている。 ```cpp struct seqcount { std::atomic<uint32_t> seqnum; uint32_t write_begin(); std::pair<bool, uint32_t> try_write_begin(); void write_end(uint32_t begin); uint32_t read_begin() const; bool read_end(uint32_t begin) const; constexpr static int kLocked = 1; }; typedef struct seqcount seqcount_t; ``` 各メンバ関数の使い方をまとめた。 ||| |-|-| | [`write_begin()`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/prio_table.h#L133-L144) | 関連データの書き込み操作の前に実行するメンバ関数。この呼び出しが完了するとWロックを取得したことになる。返り値は最新の`seqnum`の値。この返り値は`write_end()`の引数に渡す。 | [`write_end(begin)`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/prio_table.h#L154-L157) | 書き込み操作が終了したときに実行するメンバ関数。 | `read_begin()` | 読み出し操作の前に実行するメンバ関数。返り値を覚えておき、`read_end()`の引数に渡す。 | `read_end(begin) `| 読み出し操作が終了したときに実行するメンバ関数。読み出しアクセス中に他で書き込みが起きていないかを検証し読み出しが成功していたかどうかの結果を返す。つまり使い方としては、以下のような感じになる。 ```cpp struct sched_item *si; uint32_t begin, flags, seq; bool success = false; // ☆ writeアクセスの例 seq = si->seqcount.write_begin(); si->flags |= SCHED_ITEM_RUNNABLE; // si->..のwriteアクセスを行う si->seqcount.write_end(seq); // ☆ readアクセスの例 while (!success) { begin = si->seqcount.read_begin(); flags = si->flags; // si->..のreadアクセスを行う success = src->seqcount.read_end(begin); } ``` ## [`work_class`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/prio_table.h#L65-L71) その次に配置されるデータは`work_class`構造体の配列である。この配列は初期化後はRead-Onlyとなる。++これは`sched_item`がどのような種類のタスクなのか、を識別するためのもの++であり、各`sched_item`オブジェクトは`wcid`フィールドによってどのWork Classのタスクなのかをスケジューラに伝える。具体的には、one-shot系のタスクなのか、repeat系のタスクなのか、など。 ```cpp struct work_class { uint32_t id; /* unique identifier for this work_class */ uint32_t flags; // このwork_classがどのようなタスクなのか（OneShot？Repeating？） uint32_t qos; // QoS uint64_t exectime; // 実行時間（ns） uint64_t period; // 実行周期（ns） } ABSL_CACHELINE_ALIGNED; ``` QoSにはそのタスクがどれほど重要か、を指定する。スケジューラの実装によるが、EDFスケジューラの実装ではqosが大きいタスクほど優先度が高いようにして実装されていた。 flagsに使えるフラグは以下のように定義されている。タスクの動作が周期的なのか、1回限りなのか、などを定義することができるが、これらのフラグの意味合いはスケジューラの実装に依存する。 ```cpp #define WORK_CLASS_ONESHOT (1U << 0) #define WORK_CLASS_REPEATING (1U << 1) ``` :::info 例えば、EDFスケジューラの場合は、REPEATINGクラスのタスクは（kBlocked状態にならない限り）常に kQueued または kOnCpu 状態にあるように制御する。 ::: ## stream 最後に配置されるデータは[`PrioTable::stream`](https://github.com/google/ghost-userspace/blob/9ca0a1fb6ed88f0c4b0b40a5a35502938efa567f/shared/prio_table.h#L107-L110)構造体である。これは以下のように定義されている。 ```cpp struct stream { std::atomic<int> scrape_all; std::atomic<int> entries[]; }; ``` ## PrioTableの実装 ```cpp class PrioTable { public: // ストリームの大きさの候補値。 // ストリームの大きさは素数のみとなっている。これはハッシュ値が衝突しにくくするため？ // ⇢ あまり効果なさそうだと思うんだけど、、（何なら2べきの方が割り算が高速になりそう） enum class StreamCapacity : uint32_t { kStreamCapacity11 = 11, kStreamCapacity19 = 19, kStreamCapacity31 = 31, kStreamCapacity43 = 43, kStreamCapacity53 = 53, kStreamCapacity67 = 67, kStreamCapacity83 = 83, kStreamCapacity97 = 97, }; PrioTable() {} // PrioTableを自プロセスに作成するコンストラクタ // num_items: sched_itemの数 // num_class: work_classの数 // stream_capacity: streamのエントリの数 PrioTable(uint32_t num_items, uint32_t num_classes, StreamCapacity stream_capacity); ~PrioTable(); // 他プロセス（remote）のPrioTableにこのクラスを接続するメンバ関数 bool Attach(pid_t remote); // i番目のsched_itemへのポインタを取得するメンバ関数 struct sched_item* sched_item(int i) const; struct work_class* work_class(int i) const; inline struct ghost_shmem_hdr* hdr() const { return hdr_; } inline int NumSchedItems() { return hdr()->si_num; } inline int NumWorkClasses() { return hdr()->wc_num; } // streamの定義 struct stream { std::atomic<int> scrape_all; std::atomic<int> entries[]; }; // NextUpdatedIndex()からの返り値 // kStreamNoEntries: updateされたitemがないことを意味する // kStreamOverflow: エントリがオーバーフローしたことを意味する // このときはすべてのitemがupdateされたとみなす static constexpr int kStreamNoEntries = -1; static constexpr int kStreamOverflow = -2; void MarkUpdatedIndex(int idx, int num_retries); int NextUpdatedIndex(); pid_t Owner() const { return shmem_ ? shmem_->Owner() : 0; } PrioTable(const PrioTable&) = delete; PrioTable(PrioTable&&) = delete; private: std::unique_ptr<GhostShmem> shmem_; struct ghost_shmem_hdr* hdr_ = nullptr; static constexpr int kStreamFreeEntry = std::numeric_limits<uint32_t>::max(); struct stream* stream(); }; ``` ### ☆ sched_item(sidx) sidx番目のsched_itemへのポインタを取得するメンバ関数。 ### MarkUpdatedIndex sched_itemに変更を加えたときに呼び出すメンバ関数。idxはsched_itemのインデックス、num_retriesはstreamの空きを探すリトライの数を意味する（streamは開番地方式のハッシュテーブル）。 ```cpp void PrioTable::MarkUpdatedIndex(int idx, int num_retries) { // それぞれのポインタを用意しておく struct stream* s = stream(); std::atomic<int>* scrape_all = &s->scrape_all; std::atomic<int>* entries = s->entries; // Already in overflow? Ensure we are covered by a scrape_all pass. if (scrape_all->load(std::memory_order_relaxed) > 0) { scrape_all->fetch_add(1, std::memory_order_release); return; } // num_entriesの回数だけ空き領域を探す。 // 空き領域は idx ⇢ idx+1 ⇢ idx+2 ⇢ ... ⇢ idx+num_retries-1 // のように探していく。（MOD streamのサイズ） for (int i = 0; i < num_retries + 1; i++) { std::atomic<int>* cell = &entries[(idx + i) % hdr()->st_cap]; int expected = kStreamFreeEntry; // エントリが空いていて、そのエントリにidxを書き込むことができたら終了 // （エントリにidxを書き込むことでエントリの占有権を取得するイメージ） if (cell->load(std::memory_order_relaxed) == kStreamFreeEntry && cell->compare_exchange_weak(expected, idx, std::memory_order_release, std::memory_order_relaxed)) { return; } } // num_retriesの回数だけ探したけど空きエントリが見つからなかった場合、 // scrape_allをインクリメントして、オーバーフローであることを示す scrape_all->fetch_add(1, std::memory_order_release); } ``` ### NextUpdatedIndex streamの中に届いている内容を先頭から走査していく。（あまり順番が重要ではない？） ## streamの構造 `uint32_t`型の配列と見ることができる。先頭の要素がscrape_allであり、その後がentriesのようになっている。すべてメモリオーダーが重要なので`std::atomic`で定義されている。 entriesの配列サイズは++素数++で実装されている。entriesの各要素はsched_itemのindexか、kStreamFreeEntry（=intの最大値？）が格納されている。 scrape_allの使われ方はややこしい。0より大きいときはオーバーフローを意味する。オーバーフローした場合、すべてのsched_itemが更新されたものとして扱う。 streamの使われている部分を見てみると理解できるかも、、 ## 使い方（なぐりがき）基本的なPrioTableの使い方としては以下の感じ。 1. PrioTable::sched_item(sidx)でsched_itemへのポインタを取得。 2. sched_itemに書き込みを行う。 3. MarkUpdatedIndex(sidx, m)を実行する（mは3とかでよい）。 ## メモ * PrioTableはプロセスごとに1つまでとなっている。 * 共有メモリの名前は kPrioTableShmemName = "priotable" * リモートのPrioTableへアタッチする場合はpidを指定するだけでいい。