Nomad - HackMD

# Nomad ## Glossary ### Job 定義希望執行的任務狀態，只需要告訴 Nomad 我們希望他以什麼形式執行 job，舉例來說，我們會定義多久執行一次、執行什麼指令、重啟機制為何等等...，Nomad 會確保 job 滿足我們的需求，由一個或多個 task group 組成。 > A Job is a specification provided by users that declares a workload for Nomad. A Job is a form of desired state. The responsibility of Nomad is to make sure the actual state matches the user desired state. ### Task Group 在同個 task group 的 job 會在同一個 client node 一起運行，不會被分開。例如你需要透過 Fluentd 之類的方式送 log 就會需要用到 task group。 > A task group is the unit of scheduling, meaning the entire group must run on the same client node and cannot be split. ### Driver 用來描述 task 會被蝦咪挖溝執行，例如 Docker、Java 之類的。 ### Task Work 的最小單位，會被 driver 執行，在這裡會定義用什麼 driver、有什麼設定、需要多少 resource 等等。 ### Client 真正執行 Job 的勞工兄弟。在 client 上會執行 Nomad agent（工頭），負責去看 server 是否有發派新任務。 ### Allocation?? > An Allocation is a mapping between a task group in a job and a client node. A single job may have hundreds or thousands of task groups, meaning an equivalent number of allocations must exist to map the work to client machines. Allocations are created by the Nomad servers as part of scheduling decisions made during an evaluation. Job 的 instance，有點像是 Docker 的 Container 之於 Image？ ### Evaluation?? > Evaluations are the mechanism by which Nomad makes scheduling decisions. When either the desired state (jobs) or actual state (clients) changes, Nomad creates a new evaluation to determine if any actions must be taken. An evaluation may result in changes to allocations if necessary. > Anytime a job is updated, Nomad creates an evaluation to determine what actions need to take place. 一種狀態的感覺？ ```shell $ nomad job run example.nomad ==> Monitoring evaluation "13ebb66d" Evaluation triggered by job "example" Allocation "883269bf" created: node "e42d6f19", group "cache" Evaluation within deployment: "b0a84e74" Evaluation status changed: "pending" -> "complete" ==> Evaluation "13ebb66d" finished with status "complete" ``` ### Server Nomad 的老闆，管東管西，servers 互相會複製資料，而且會選出一個最大的大老闆。 > Nomad servers are the brains of the cluster. There is a cluster of servers per region and they manage all jobs and clients, run evaluations, and create task allocations. The servers replicate data between each other and perform leader election to ensure high availability. ## Quick Start ### 建立 Cluster 起一個 server 和兩個 client，撰寫設定檔： - `$ vim server.hcl` ``` log_level = "DEBUG" data_dir = "/tmp/server1" server { enabled = true bootstrap_expect = 1 } ``` - `$ vim client1.hcl` 、 `$ vim client2.hcl` ``` log_level = "DEBUG" data_dir = "/tmp/client{1, 2}" name = "client{1, 2}" client { enabled = true servers = ["127.0.0.1:4647"] } ports { http = {5656, 5657} } ``` - 執行指令起 node： ```shell $ nomad agent -config server.hcl $ nomad agent -config client1.hcl $ nomad agent -config client2.hcl ``` ### 設定 Job - `$ vim hello.nomad` ``` job "hello" { datacenters = ["dc1"] group "example" { task "server" { driver = "docker" config { image = "hashicorp/http-echo" args = ["-listen", ":5678", "-text", "hello"] } resources { memory = 128 network { mbits = 10 port "http" { static = "5678" } } } } } } ``` - 執行 job： ```shell $ nomad run hello.nomad ``` - 查看 job： ```shell $ nomad job status ID Type Priority Status Submit Date hello service 50 running 2020-07-23T12:22:08+08:00 $ nomad job status hello ID = hello Name = hello Submit Date = 2020-07-23T12:22:08+08:00 Type = service Priority = 50 Datacenters = dc1 Namespace = default Status = running Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost example 0 0 1 1 13 0 Latest Deployment ID = f7b6bd26 Status = successful Description = Deployment completed successfully Deployed Task Group Desired Placed Healthy Unhealthy Progress Deadline example 1 1 1 0 2020-07-23T12:32:22+08:00 Allocations ID Node ID Task Group Version Desired Status Created Modified ef92c6bb 09592e86 example 27 run running 13m17s ago 12m59s ago ``` - 查看 alloc： ```shell $ nomad alloc status ef92c6bb ID = ef92c6bb-fcd4-54a5-4253-536e33f0d2a1 Eval ID = 0302ca06 Name = hello.example[0] Node ID = 09592e86 Node Name = client1 Job ID = hello Job Version = 27 Client Status = running Client Description = Tasks are running Desired Status = run Desired Description = <none> Created = 20m13s ago Modified = 19m55s ago Deployment ID = f7b6bd26 Deployment Health = healthy Task "server" is "running" Task Resources CPU Memory Disk Addresses 0/100 MHz 788 KiB/128 MiB 300 MiB http: 172.30.200.189:5678 Task Events: Started At = 2020-07-23T04:22:12Z Finished At = N/A Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2020-07-23T12:22:12+08:00 Started Task started by client 2020-07-23T12:22:08+08:00 Driver Downloading image 2020-07-23T12:22:08+08:00 Task Setup Building Task Directory 2020-07-23T12:22:08+08:00 Received Task received by client ``` - 連線至 Addresses 查看成果： ```shell $ curl http://172.30.200.189:5678/ hello ``` - 查看 log： ```shell $ nomad alloc logs ef92c6bb 2020/07/23 04:37:14 172.30.200.189:5678 172.17.0.1:38550 "GET / HTTP/1.1" 200 6 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" 6.738µs 2020/07/23 04:37:14 172.30.200.189:5678 172.17.0.1:38550 "GET /favicon.ico HTTP/1.1" 200 6 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" 13.932µs ``` - 停止 job： ```shell $ nomad stop hello ``` ## Cluster Class ![](https://i.imgur.com/YBxNFxm.png) ## 參考資料 - [Nomad Getting Started / HashiCorp Learn](https://learn.hashicorp.com/nomad/getting-started/install) - [Glossary / Nomad](https://www.nomadproject.io/docs/internals/architecture) ## 問題 1. 之前請 SRE 幫忙加開到十組 consumer，是更改 [count](https://gitlab.kkinternal.com/kkbox/nomad-example/blob/master/example-erb/cron-batch-job-example/cron-batch-job-example.nomad.erb#L31) 就可以的嗎？