task_struct に追加されたメンバ

# task_struct に追加されたメンバ ## gtid ghOSt用のタスク識別子。Linuxカーネルではtidは22bitまでという制限があったので、残りのbitを使ってghOSt用の情報を管理している。 ## inhibit_task_msgs タスクが**メッセージを送信しないように抑制する**ために使われる。このメンバが0のときは、メッセージの送信は抑制されない。また、このメンバは基本的には 0 または 1 が設定されるが、タスクが DEPARTED したときだけは enclave の ABI を保持することになっている。タスクがDEPARTEDするときに設定されたABIは、そのタスクが元のenclaveに戻ってきたときに重要な意味をもつ。詳しくは、タスクが再度 ghOSt ポリシーになったときに実行される ghost_prep_task を確認。 # rq に追加されたメンバ rq には ghost_rq のインスタンスが追加されている。ghost_rq については[後述](#ghost_rq)。 # gtid_t # ghost_status_word ステータスワードのデータ構造。スレッド1つにつきこのデータ構造のインスタンスが1つ用意され、カーネルから agent に対して共有メモリを介して公開される。スレッドの状態やCPUの情報などがこのデータ構造にまとめてある。 ```c struct ghost_status_word { uint32_t barrier; uint32_t flags; uint64_t gtid; int64_t switch_time; uint64_t runtime; } __attribute__((__aligned__(32))); ``` ## flags flags では、そのステータスワードが使用中なのか、すでに使用されていないのか、などの状態を意味する値がセットされる。ここに使えるフラグは GHOST_SW_F_* というようなマクロで定義されている。また、agentのステータスワードのときによみ使われるフラグもあり、以下にまとめる。 ||| |-|-| | **GHOST_SW_CPU_AVAIL** | ghOStクラスよりも優先度の高いクラスのスレッド（CFSスレッドなど）が rq に存在いるかどうか、を意味するフラグ。つまり、このフラグを確認することで、現在ghOStタスクがそのCPUを使えるかどうか、を知ることができる。 | **GHOST_SW_BOOST_PRIO** | agent スレッドの優先度がブーストされているときにセットされるフラグ。基本的には ghOSt スレッドは最低優先度でスケジュールされるため、CFSスレッドなどにプリエンプトされてしまう。ただし、agent スレッドだけは ghost_agent クラスでスケジュールされることがあるため、CFSスレッドなどからプリエンプとされずに実行できる。そのようなときにこのフラグがセットされる。agent スレッドの優先度ブーストに関しては、以下の図を参考にしてほしい。 ![CA35F275-C1F3-4FF3-8395-2E6D0ECE3D5F](https://hackmd.io/_uploads/ryiErsNNT.jpg) # ghost_sw_region ```c struct ghost_sw_region { struct list_head list; /* ghost_enclave glue */ uint32_t alloc_scan_start; /* allocator starts scan here */ struct ghost_sw_region_header *header; /* pointer to vmalloc memory */ size_t mmap_sz; /* size of mmapped region */ struct ghost_enclave *enclave; }; ``` ステータスワードを配置する領域を管理するためのデータ構造。 ghost_enclaveからリストで繋げられて管理される。enclaveが削除されるときに、このオブジェクトも解放される。 ++この構造体自体はagentに公開されず++、カーネル側の空間にのみ存在している。 **headerが指す先のメモリ領域が、実際に *status words* が配置される場所**である。 # ghost_sw_region_header ```c struct ghost_sw_region_header { uint32_t version; // （= GHOST_SW_REGION_VERSION） uint32_t id; // SW領域の番号（0, 1, ...） uint32_t numa_node; uint32_t start; // SWが置かれる場所までのこの構造体のポインタからのオフセット uint32_t capacity; // SW領域に格納することのできるSWの数 uint32_t available; // この領域内でまだ使われていないSWの数 } __ghost_cacheline_aligned; #define GHOST_SW_REGION_VERSION 0 ``` *status words* が置かれる領域を**SW領域**などと呼ぶことにするが、SW領域の先頭に置かれるデータ構造。 SW領域は `ghost_create_sw_region` 関数などを呼び出して作成する。 # ghost_sw_info SWに関する情報（SW領域のIDとテーブル上のインデックス）をまとめたデータ構造。 ```c struct ghost_sw_info { uint32_t id; /* status_word region id */ uint32_t index; /* index into the status_word array */ }; ``` # [sched_ghost_entity](https://github.com/google/ghost-kernel/blob/b948b132880c21c187eae8b0d2bf83b4d3eb80fc/include/linux/sched.h#L592-L646) ```c struct sched_ghost_entity { struct list_head run_list; ktime_t last_runnable_at; /* The following fields are protected by 'task_rq(p)->lock' */ // @dst_q: このタスクが関連付けられているキューへのポインタ // このタスク関連のメッセージはdst_qに送信される。 struct ghost_queue *dst_q; struct ghost_status_word *status_word; struct ghost_enclave *enclave; /* See ghost_destroy_enclave() */ int __agent_free_cpu_cmd; // タスクの状態変化を知らせるためのフラグ。 // pick_next_taskの直前に呼ばれるghost_produce_prev_msgs内で参照される。 // 例えば、タスクがスリープしたときには、sched_class.dequeue_task 内部で // blocked_task に true がセットされる。TASK_BLOCKED メッセージの送信は // この時点で行うのではなく、まとめてコンテキストスイッチの直前で行うように // 設計されているのだ。 uint blocked_task : 1; uint yield_task : 1; uint new_task : 1; // このタスクが agent のときは true がセットされる。 // sched_setscheduler で agent を登録するときにセットされる。 uint agent : 1; // どうやら _select_task_rq_ghost の中で使用されているっぽい。 // /* * Locking of 'twi' is awkward: * 1. wake_up_new_task: both select_task_rq() and task_woken_ghost() * are called with 'pi->lock' held. * 2. ttwu_do_activate: both select_task_rq() and task_woken_ghost() * are called with 'pi->lock' held when called via ttwu_queue() * (i.e. not a remote wakeup). * 3. ttwu_do_activate: only 'rq->lock' is held when called via * sched_ttwu_pending (i.e. indirectly via ttwu_queue_remote). * * (1) and (2) are easy because 'p->pi_lock' is held across both * select_task_rq() and task_woken_ghost(). * * (3) is tricky because 'p->pi_lock' is held when select_task_rq() * is called on the waker's cpu while 'rq->lock' is held when * task_woken_ghost() is called on the remote cpu. We rely on the * following constraints: * a. Once a task is woken up there cannot be another wakeup until * it gets oncpu and blocks (thus another wakeup cannot happen * until task_woken_ghost() has been called). * b. flush_smp_call_function_queue()->llist_del_all() pairs with * __ttwu_queue_wakelist()->llist_add() to guarantee visiblity * of changes made to 'p->ghost.twi' on the waker's cpu when * ttwu_do_activate() is called on the remote cpu. */ struct { int last_ran_cpu; int wake_up_cpu; int waker_cpu; bool skip_ttwu_queue; } twi; /* twi = task_wakeup_info */ struct list_head task_list; struct rcu_head rcu; }; ``` ## new_task このメンバは \_\_ghost_prep_task 関数の内部で true にセットされる。 :::warning タスクまたは agent が ghOSt に入ってくるときについて詳しくは[こちら](https://hackmd.io/MP636W0TQ5mqp-prlhTVkw?view) ::: # [ghost_rq](https://github.com/google/ghost-kernel/blob/b948b132880c21c187eae8b0d2bf83b4d3eb80fc/kernel/sched/sched.h#L107-L144) この構造体は rq のフィールドに使われる。 ```c struct ghost_rq { // このCPUに対応するagentのtask_structへのポインタ。 struct task_struct *agent; // メモリバリア // __ghost_wake_agent_on の中に書かれているコメントにこのフィールド。 // の使われ方が説明されている。 uint32_t agent_barrier; // agentがghost_run()内部でブロックしているときにtrueとなるフィールド。 // このメンバがtrueのときは、agentはCPUを自ら明け渡している、ということである。 bool blocked_in_run; [...] // agentを起こすべきときにtrueがセットされるフィールド。 // ghost_agent クラスは、他のいかなるsched_classよりも優先度が高いので // pick_next_task のタイミングで最初に呼び出される。そのときに、agentを // 返すべきかどうかをこのメンバで判断する。 bool agent_should_wake; [...] // must_reschedは、rq->currに再スケジューリング要求を直ちに出すべきなときに // trueがセットされる。must_reshedの値を直接書き換えるには、rq->lockを取得する // 必要があるが、それでは効率が良くないので、set_must_reshcedというメンバを導入する // ことで、効率を上げている。set_must_reschedはアトミックに読み書きをし、 // ここの値をもとに、PNTの最初の方でmust_reschedに値を反映している。 bool set_must_resched; bool must_resched; [...] // ghost_run() から渡されるフラグ int run_flags; // メッセージが到着する度にインクリメントされる値。 // 論文で A_seq のように表記されているものに対応する値を確認するフィールド。 uint64_t cpu_seqnum; [...] // pick_next_taskが呼ばれたときに返されるタスク。 struct task_struct *latched_task; // 現在は使われていない。 long switchto_count; int64_t rendezvous; }; ``` ## blocked_in_run & agent_on_rq & agent_should_wake ![CFD84152-C2A9-44E2-9DCC-F6E5D927203C.jpeg](https://hackmd.io/_uploads/Skg0LjQX6.jpg) ## tasks ghostで使われるタスクのリストは以下のようなイメージ。 ![](https://hackmd.io/_uploads/BJvCokpW6.jpg) ## rendezvous sync-group において重要になってくるフィールド。 | 値 | 意味 | | --- | --- | | = 0 | Sync Group に参加しているものがない？ | | < 0 | Sync Group がコミット処理中 | > 0 | Sync Group のコミット処理が正常に終了したこのフィールドは、符号付き64bit整数として実装されていて、以下のようなフォーマットで扱われる。 ![5C1B7F4D-DA18-4617-A8D9-DC862ECA3ACC](https://hackmd.io/_uploads/SkqiJyGST.jpg) ### 関連マクロ ※ 上の図はCPUのフィールドが11ビットで設定されているが、システムのCPU数に応じては14ビットとなったりする。このフォーマットに関係するマクロ定義を以下に示す。 ```c #if (CONFIG_NR_CPUS < 2048) #define SG_COOKIE_CPU_BITS 11 #else #define SG_COOKIE_CPU_BITS 14 #endif #define SG_COOKIE_CPU_SHIFT (63 - SG_COOKIE_CPU_BITS) ``` ```c #define GHOST_NO_RENDEZVOUS 0 #define GHOST_POISONED_RENDEZVOUS (1LL << 51) ``` ### sync_group_cookie **counterフィールドに設定される値を管理しているPer-CPU変数**。このカウンタはSync Group コミットを識別するためのカウンタである。このPer-CPU変数の値を読み出すための関数 ghost_sync_group_cookie の実装を見ていく。この変数をインクリメントした後に、その値を返していることが分かる。 ```c int64_t ghost_sync_group_cookie(void) { int64_t val; ... val = __this_cpu_inc_return(sync_group_cookie); ... return val; } ``` 初期化は以下のようになっている。 ```c void __init init_sched_ghost_class(void) { int64_t cpu; for_each_possible_cpu(cpu) per_cpu(sync_group_cookie, cpu) = cpu << SG_COOKIE_CPU_SHIFT; } ``` つまり ghost_sync_group_cookie が返す64ビット整数は以下のような値になっている（counter は 1, 2, 3, ... のように増えていく数）。 ![45D3C37F-F36F-4D15-AF1D-6219A55C54AA](https://hackmd.io/_uploads/BkumxfMrp.jpg) ## set_must_resched PNT の手前で実行される pnt_prologue にて **must_resched に true をセットするために使われるフィールド**。must_resched に直接 true をセットするときはロックが必要になってしまうが、このフィールドはロック不要で効率的。 pnt_prologueの実装を見てみる（[ソースコード](https://github.com/google/ghost-kernel/blob/edd5f9490d82df24c16f90a62f7be05c6c389867/kernel/sched/ghost.c#L8485-L8493)）。ここで set_must_resched の値を確認し、もし true がセットされていれば must_resched に true を書き込む、といった実装になっている。 ```c static void pnt_prologue(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { ... if (READ_ONCE(rq->ghost.set_must_resched)) { WRITE_ONCE(rq->ghost.set_must_resched, false); rq->ghost.must_resched = true; } ... } ``` ## must_resched 必ず再スケジューリングされるように要請するためのフィールド。 ## check_prev_preemption PNT において prev がプリエンプトされたかどうかを知らせるためのフィールド。 **このフィールドが true のときは、コンテキストスイッチ時点で TASK_PREEMPT メッセージが発行される**。 ```c static void prepare_task_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { ... if (rq->ghost.check_prev_preemption) { rq->ghost.check_prev_preemption = false; // リセット ghost_task_preempted(rq, prev); // メッセージの発行 ghost_wake_agent_of(prev); // agentを起床 } ... } ``` このフィールドに true がセットされるのは PNT の序盤で実行される pnt_prologue の内部。 ```c static void pnt_prologue(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { // prevの自発的な状態変化に関するメッセージを発行する。 // ここでのBLOCKEDやYIELDなど。 // 返り値は // - true: prevはプリエンプトされた // - false: prevはプリエンプトされていない rq->ghost.check_prev_preemption = ghost_produce_prev_msgs(rq, prev); ... } ``` ## ignore_prev_preemption check_prev_preemption の効果を無視するときに true をセットする。 PNT の序盤で呼ばれる pnt_prologue では以下のように処理されている。 ```c static void pnt_prologue(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { ... if (unlikely(rq->ghost.ignore_prev_preemption)) { // ignore_prev_preemptionがセットされていればcheck_prev_preemptionも // クリアする。 rq->ghost.check_prev_preemption = false; rq->ghost.ignore_prev_preemption = false; } ... } ``` # [ghost_queue](https://github.com/google/ghost-kernel/blob/b948b132880c21c187eae8b0d2bf83b4d3eb80fc/kernel/sched/ghost.c#L2008-L2037) ```c struct ghost_queue { [...] struct queue_notifier *notifier; [...] }; ``` ## notifier キューに新しいメッセージがプッシュされたときに起こす agent の情報を管理するフィールド。このフィールドは queue_notifier という構造体で定義されており、以下のようなフィールドを持つ。キューに新しいメッセージがプッシュされたときに起こす agent のリストを管理する。このフィールドは GHOST_IOC_CONFIG_QUEUE_WAKEUP システムコールで設定される。 ```c struct queue_notifier { struct ghost_agent_wakeup winfo[GHOST_MAX_WINFO]; struct rcu_head rcu; unsigned int wnum; }; ``` このフィールドは GHOST_IOC_CONFIG_QUEUE_WAKEUP システムコールで設定されるので、詳しくはそちらの実装を参考にしてください。 # ghost_ring ```c struct ghost_ring { _ghost_ring_index_t head; _ghost_ring_index_t tail; _ghost_ring_index_t overflow; struct ghost_msg msgs[0]; /* array of size 'header->nelems' */ }; ``` ## head 次にメッセージを追加する場所。 ## tail 次にメッセージを取り出す場所。 ## overflow リングがいっぱいでメッセージを追加できないときにインクリメントされるカウンタ。 # ghost_msg ghost_ring.msgs のバッファに置かれるメッセージのデータ構造。可変長の構造体であり、payloadにメッセージの追加情報が格納される。 ```c struct ghost_msg { uint16_t type; /* message type */ uint16_t length; /* length of this message including payload */ uint32_t seqnum; /* sequence number for this msg source */ uint32_t payload[0]; /* variable length payload */ }; ``` ![image](https://hackmd.io/_uploads/BkKwA_iVa.png) ## type メッセージの種類を識別するフィールド。このフィールドに指定される値は以下のマクロで定義されている。 ```c /* ghost message types */ enum { /* misc msgs */ MSG_NOP = 0, /* task messages */ MSG_TASK_DEAD = _MSG_TASK_FIRST, MSG_TASK_BLOCKED, MSG_TASK_WAKEUP, MSG_TASK_NEW, MSG_TASK_PREEMPT, MSG_TASK_YIELD, MSG_TASK_DEPARTED, MSG_TASK_SWITCHTO, MSG_TASK_AFFINITY_CHANGED, MSG_TASK_ON_CPU, MSG_TASK_PRIORITY_CHANGED, MSG_TASK_LATCH_FAILURE, /* cpu messages */ MSG_CPU_TICK = _MSG_CPU_FIRST, MSG_CPU_TIMER_EXPIRED, MSG_CPU_NOT_IDLE, /* requested via run_flags: NEED_CPU_NOT_IDLE */ MSG_CPU_AVAILABLE, MSG_CPU_BUSY, MSG_CPU_AGENT_BLOCKED, MSG_CPU_AGENT_WAKEUP, }; ``` ## payload payloadに置かれるデータ構造は、次に紹介する**ghost_msg_XXX**という名前の構造体で定義される。 ghost_msgのサイズは（可変長部分を考慮しないと）8bytesになる。このサイズを**スロットサイズ**と呼び、++メッセージはスロットサイズの倍数ごとに配置される++ことになっている（つまりメッセージのサイズは 8 の倍数 byte ということ）。ghost_ring の head と tail もスロットサイズの倍数で管理されるので、ghost_ring.msgs[ghost_ring.tail] のようにしてアクセスされる。カーネルがメッセージを送信するときは **task_deliver_msg_XXX** という名前の関数を経由する。この関数はメッセージのタイプごとに定義されていて、TASK_NEW メッセージを例にとってみると以下のようなコールグラフになる。 ```mermaid flowchart BT _switched_from_ghost --> _ghost_task_new ghost_task_new --> _ghost_task_new --> task_deliver_msg_task_new generate_task_new --> task_deliver_msg_task_new --> produce_for_task --> __produce_for_task --> ghost_bpf_msg_send __produce_for_task --> _produce ``` ## task_deliver_msg_XXX タスクに関連するメッセージをメッセージキューにpushする関数。XXX の部分にはメッセージの種類を意味する文字が指定される。 ```c static void task_deliver_msg_XXX(struct rq *rq, struct task_struct *p) { struct bpf_ghost_msg *msg = this_cpu_ghost_msg(); struct ghost_msg_payload_XXX *payload = &msg->XXXt; struct task_struct *parent; // task_deliver_msg_XXX という関数で共通の処理を行う。 // 具体的には、 // * taskのバリアをインクリメントする // これはメッセージを実際に送信するかどうかに関わらない // * メッセージを送信する必要がないかどうかを検証し、 // 必要なければ-1を返してアーリーリターン。 if (__task_deliver_common(rq, p)) return; msg->type = MSG_TASK_XXX; payload->gtid = gtid_of(p); /* payload のメンバの設定（メッセージの型ごとに実装が別れている） */ // メッセージを送信する produce_for_task(p, msg); } ``` ## MSG_TASK_ON_CPU ```mermaid flowchart BT prepare_task_switch --> ghost_task_got_oncpu --> task_deliver_msg_on_cpu --> produce_for_task ``` このメッセージはコンテキストスイッチの直前に呼ばれる prepare_task_switch からのコールパス上で発行される。コンテキストスイッチで next に指定されたタスクの状態変化として発行されることになっている。このメッセージの送信は run flag によって制御される。つまり run flag に SEND_TASK_ON_CPU がセットされていない場合はメッセージの送信はしない。 ## MSG_CPU_TICK `sched_class->task_tick` の処理の中で発行されるメッセージ（つまり、タイマー割り込みが起きるたびに発生するメッセージ）。ghOStの `task_tick` の処理を見てみると、 ```c static void _task_tick_ghost(struct rq *rq, struct task_struct *p, int queued) { [ ... ] // vruntimeなどの更新を行う __update_curr_ghost(rq, false); // CPU_TICKメッセージを送る。 // cpu_deliver_msg_tickの返り値は、メッセージをagentに送ったかどうかを示している。 // eBPFプログラムで処理が完結する場合は、やらない。 if (cpu_deliver_msg_tick(rq)) // もし、agentにメッセージを送ったなら、 // 対象のagentを起こす ghost_wake_agent_on(agent_target_cpu(rq)); } ``` payloadはこれ。 ```c struct ghost_msg_payload_cpu_tick { int cpu; }; ``` ## MSG_CPU_NOT_IDLE ```mermaid flowchart BT prepare_task_switch --> ghost_need_cpu_not_idle --> cpu_deliver_msg_not_idle --> produce_for_agent --> __produce_for_task ``` 呼び出しとしては、コンテキストスイッチの直前に呼ばれる。 CPUがもはやIDLEタスクではなくなったときに伝達されるらしい。 RUNフラグで使用されているらしい（よーわからん、、）。 ## switchtoチェインを理解する :::danger ❗❗❗❗❗❗ switchtoチェインは現在使われていないっぽい？ switchto_countをインクリメントする処理がどこにも存在していないし、、。なので、どのメッセージの from_switchto も、値は常に0であることが保証される。 ::: TASK関連のメッセージを見ていて度々登場するのが switchto チェインというもの。 TASK_PREEMPT・TASK_YIELD・TASK_BLOCKED・TASK_DEPARTEDのメッセージに **from_switchto** というメンバが用意されている。このメンバの値は以下のようにして設定されている。 ```c payload->from_switchto = ghost_in_switchto(rq); ``` ここで使われている ghost_in_switchto 関数の実装は以下のようになっていた。 ```c /* * When called from pick_next_task() context returns 'true' if 'rq->cpu' * is exiting switchto and 'false' otherwise (e.g. when producing the * TASK_BLOCKED/TASK_YIELD/TASK_PREEMPT msgs). * * When called outside pick_next_task() context returns 'true' if 'rq->cpu' * is currently in a switchto chain and 'false' otherwise (e.g. when producing * TASK_DEPARTED msg for an oncpu ghost task). * * Technically this could be split into two APIs one for 'switchto_count < 0' * and another for 'switchto_count > 0' but that feels like overkill. */ static bool ghost_in_switchto(struct rq *rq) { return rq->ghost.switchto_count ? true : false; } ``` コメントを要約すると、 * この関数の返り値は呼び出される場所によって意味が変わってくる。 * pick_next_task 内部 * true：switchtoチェインを抜けようとしている。 * false：switchtoチェインを抜けない。 * pick_next_task 外部 * true：CPUはswitchtoチェインの最中である。 * false：CPUはswitchtoチェインの最中ではない。 switchtoチェインを管理しているのは long 型の rq.ghost.switchto_count メンバである。このメンバの使われている部分を見ながら、理解を進めていく。まず、コメントにもある通り、このメンバは「負の値・0・正の値」の3つの状態を取りうる。 pick_next_task の直前に呼び出される pnt_prologue 関数では以下のような処理が挟まっていた。どうやら pnt_prologue が呼び出されるまでは、このメンバは非負の値になっているっぽい。 ```c static void pnt_prologue(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { [ ... ] /* a negative 'switchto_count' indicates end of the chain */ rq->ghost.switchto_count = -rq->ghost.switchto_count; WARN_ON_ONCE(rq->ghost.switchto_count > 0); ``` コンテキストスイッチの直前に呼ばれる prepare_task_switch 関数では、以下のような処理が最後に行われていた。 ```c static void prepare_task_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { [ ... ] done: [ ... ] /* * The last task in the chain scheduled (blocked/yielded/preempted). */ if (rq->ghost.switchto_count < 0) rq->ghost.switchto_count = 0; } ``` このメンバが0のときは、switchtoチェインが終了したことを意味する。 # ghost_abi ghOSt関連の関数などをまとめた構造体？この構造体のインスタンスはリンク時に一箇所にまとめられる。 ```c struct ghost_abi { int version; int (*abi_init)(const struct ghost_abi *abi); struct ghost_enclave * (*create_enclave)(const struct ghost_abi *abi, struct kernfs_node *dir, ulong id, char *cmd_extra); void (*enclave_release)(struct kref *k); void (*enclave_add_cpu)(struct ghost_enclave *e, int cpu); void (*group_dead)(pid_t tgid); int (*setscheduler)(struct ghost_enclave *e, struct task_struct *p, struct rq *rq, const struct sched_attr *attr, int *reset_on_fork); int (*fork)(struct ghost_enclave *e, struct task_struct *p); void (*cleanup_fork)(struct ghost_enclave *e, struct task_struct *p); void (*wait_for_rendezvous)(struct rq *rq); void (*pnt_prologue)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); void (*prepare_task_switch)(struct rq *rq, struct task_struct *prev, struct task_struct *next); void (*tick)(struct ghost_enclave *e, struct rq *rq); void (*switchto)(struct rq *rq, struct task_struct *prev, struct task_struct *next, int switchto_flags); void (*commit_greedy_txn)(int cpu); void (*copy_process_epilogue)(struct task_struct *p); void (*cpu_idle)(struct rq *rq); void (*timerfd_triggered)(int cpu, uint64_t type, uint64_t cookie); int (*bpf_wake_agent)(int cpu); // BPFヘルパー関数 bpf_ghost_run_gtid から呼び出される関数。 int (*bpf_run_gtid)(s64 gtid, u32 task_barrier, int run_flags, int cpu); int (*bpf_resched_cpu)(int cpu, u64 seqnum); /* DEPRECATED as of ABI 79. */ int (*bpf_resched_cpu2)(int cpu, int flags); bool (*ghost_sched_is_valid_access)(int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, struct bpf_insn_access_aux *info); bool (*ghost_msg_is_valid_access)(int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, struct bpf_insn_access_aux *info); bool (*ghost_select_rq_is_valid_access)( int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, struct bpf_insn_access_aux *info); int (*bpf_link_attach)(const union bpf_attr *attr, struct bpf_prog *prog, int ea_type, int ea_abi); /* ghost_agent_sched_class callbacks */ struct task_struct *(*pick_next_ghost_agent)(struct rq *rq); /* ghost_sched_class callbacks */ void (*update_curr)(struct rq *rq); void (*prio_changed)(struct rq *rq, struct task_struct *p, int old); void (*switched_to)(struct rq *rq, struct task_struct *p); void (*switched_from)(struct rq *rq, struct task_struct *p); void (*task_dead)(struct task_struct *p); void (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags); void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*enqueue_task)(struct rq *rq, struct task_struct *p, int flags); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); struct task_struct *(*pick_next_task)(struct rq *rq); void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int wake_flags); void (*yield_task)(struct rq *rq); #ifdef CONFIG_SMP int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); int (*select_task_rq)(struct task_struct *p, int cpu, int wake_flags); void (*task_woken)(struct rq *rq, struct task_struct *p); void (*set_cpus_allowed)(struct task_struct *p, const struct cpumask *newmask, u32 flags); #endif }; ``` sched_classと同様にldsによって以下のようにABIが埋め込まれる。 ```c #ifdef CONFIG_SCHED_CLASS_GHOST #define GHOST_ABI_RODATA \ STRUCT_ALIGN(); \ __begin_ghost_abi = .; \ *(.rodata.ghost_abi) \ __end_ghost_abi = .; #else #define GHOST_ABI_RODATA #endif ``` ghost_abiのインスタンスを作成するには、**DEFINE_GHOST_ABI** マクロを利用する。実装では、1つだけ宣言されていた。 ```c DEFINE_GHOST_ABI(current_abi) = { .version = GHOST_VERSION, // 83 .abi_init = abi_init, .create_enclave = create_enclave, .enclave_release = enclave_release, .enclave_add_cpu = ___enclave_add_cpu, .group_dead = _ghost_group_dead, .setscheduler = _ghost_setscheduler, .fork = _ghost_sched_fork, .cleanup_fork = _ghost_sched_cleanup_fork, .wait_for_rendezvous = wait_for_rendezvous, .pnt_prologue = pnt_prologue, .prepare_task_switch = prepare_task_switch, .tick = tick_handler, .switchto = _ghost_switchto, .commit_greedy_txn = _commit_greedy_txn, .copy_process_epilogue = ghost_initialize_status_word, .cpu_idle = cpu_idle, .timerfd_triggered = _ghost_timerfd_triggered, .bpf_wake_agent = bpf_wake_agent, .bpf_run_gtid = bpf_run_gtid, .bpf_resched_cpu2 = bpf_resched_cpu2, .ghost_sched_is_valid_access = ghost_sched_is_valid_access, .ghost_msg_is_valid_access = ghost_msg_is_valid_access, .ghost_select_rq_is_valid_access = ghost_select_rq_is_valid_access, .bpf_link_attach = _ghost_bpf_link_attach, /* ghost_agent_sched_class callbacks */ .pick_next_ghost_agent = pick_next_ghost_agent, /* ghost_sched_class callbacks */ .update_curr = _update_curr_ghost, .prio_changed = _prio_changed_ghost, .switched_to = _switched_to_ghost, .switched_from = _switched_from_ghost, .task_dead = _task_dead_ghost, .dequeue_task = _dequeue_task_ghost, .put_prev_task = _put_prev_task_ghost, .enqueue_task = _enqueue_task_ghost, .set_next_task = _set_next_task_ghost, .task_tick = _task_tick_ghost, .pick_next_task = _pick_next_task_ghost, .check_preempt_curr = _check_preempt_curr_ghost, .yield_task = _yield_task_ghost, #ifdef CONFIG_SMP .balance = _balance_ghost, .select_task_rq = _select_task_rq_ghost, .task_woken = _task_woken_ghost, .set_cpus_allowed = _set_cpus_allowed_ghost, #endif }; ``` ## pnt_prologue eBPFにて `__schedule --> pick_next_task` の呼び出しの直前に呼び出されるコールバック関数。この関数内で、直前のタスクが保留していたメッセージを作成する。 # ghost_enclave * enclave はいくつかのCPUのまとまりを管理するもの。 * cpus メンバが管理下にあるCPUの情報をもっている。 ```c /* * ghost_enclave is a container for the agents, queues and sw_regions * that express the scheduling policy for a set of CPUs. */ struct ghost_enclave { const struct ghost_abi *abi; /* * 'lock' serializes mutation of 'sw_region_list' as well as * allocation and freeing of status words within a region. * * 'lock' also serializes mutation of 'def_q'. * * 'lock' requires the irqsave variant of spin_lock because * it is called in code paths with the 'rq->lock' held and * interrupts disabled. */ spinlock_t lock; struct kref kref; struct list_head sw_region_list; struct ghost_cpu_data **cpu_data; struct cpumask cpus; struct ghost_queue *def_q; /* default queue */ struct list_head inhibited_task_list; struct list_head task_list; /* all non-agent tasks in the enclave */ unsigned long nr_tasks; struct work_struct task_reaper; struct enclave_work ew; /* to defer work while holding locks */ struct work_struct enclave_actual_release;/* work for enclave_release */ /* * max_unscheduled: How long a task can be runnable, but unscheduled, * before the kernel thinks the enclave failed and queues the * enclave_destroyer. */ ktime_t max_unscheduled; struct work_struct enclave_destroyer; bool switchto_disabled; bool wake_on_waker_cpu; bool commit_at_tick; bool deliver_agent_runnability; bool deliver_cpu_availability; bool deliver_ticks; bool live_dangerously; unsigned long id; int is_dying; bool agent_online; /* userspace says agent can schedule. */ struct kernfs_node *enclave_dir; kuid_t uid; kgid_t gid; /* * A non-zero value of 'ephemeral_tgid' indicates that the enclave * will be automatically destroyed when the thread group (aka process) * indicated by 'ephemeral_tgid' dies. */ pid_t ephemeral_tgid; struct list_head ephemeral_list; #ifdef CONFIG_BPF struct bpf_prog *bpf_pnt; struct bpf_prog *bpf_msg_send; struct bpf_prog *bpf_select_rq; #endif }; ``` # ghost_cpu_data 中身は次に紹介する `ghost_txn` そのままである。 ```c struct ghost_cpu_data { struct ghost_txn txn; } __ghost_page_aligned; ``` # ghost_txn トランザクションへ移動した