# Installation Guides ###### tags: `Installation` ## Check List - [ ] OS - [ ] Docker - [ ] Kubernetes - [ ] Dynamic volume provisioner - NFS - [ ] Kubeflow ## OS & Machines + Ubuntu 18.04 1. Create a bootable USB stick 1. download ubuntu 18.04 server version image file from [http://releases.ubuntu.com/18.04.4/?_ga=2.18215993.1854190319.1596973908-1484461180.1596601256](http://releases.ubuntu.com/18.04.4/?_ga=2.18215993.1854190319.1596973908-1484461180.1596601256) 2. follow the steps in [https://ubuntu.com/tutorials/create-a-usb-stick-on-windows#1-overview](https://ubuntu.com/tutorials/create-a-usb-stick-on-windows#1-overview) 2. install the os 1. Press F11 during startup to select the boot device. 2. Follow the OS installation wizard. 3. network: ubuntu use netplan to configure network ``` bash vim /etc/netplan/50-cloud-init.yaml ``` ```bash network: ethernets: enp4s0f3: addresses: [public ip/24] gateway4: 140.114.91.254 version: 2 ``` ```bash netplan try netplan apply ``` 4. InfiniBand (if needed) InfiniBand is a kind of interconnection widely used in supercomputers. It features high throughput and low latency. We use InfiniBand cards manufactured by Mellanox. * Driver and system setup Download latest MLNX_OFED driver ([driver download link](http://www.mellanox.com/page/mlnx_ofed_matrix?mtag=linux_sw_drivers)), decompress it and cd into the directory. Before installation proceeds, install some required libraries first. ```bash apt-get install tcl tk ``` Run install script. This might take some time. ```bash ./mlnxofedinstall ``` If required, unload old driver and load new driver. ```bash modprobe -rv ib_isert rpcrdma ib_srpt /etc/init.d/openibd restart ```` Start opensm (InfiniBand subnet manager). ```bash systemctl enable opensmd --now ``` Start opensmd on one of the cluster node only. If multiple nodes start opensmd at the same time, it may cause problem. Check InfiniBand status. ```bash ibstat # State: Active # Physical state: LinkUp ibstatus ``` ## Docker #### Command ``` bash $ wget https://lsalab.cs.nthu.edu.tw/~riya/deploy/docker/install_docker.sh $ sudo sh ./install_docker.sh ``` #### Script ``` bash #!/bin/sh apt-get update -y && apt-get upgrade -y sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common -y curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" sudo apt-get update -y # find the version that you want to install apt-cache madison docker-ce sudo apt-get install docker-ce=5:19.03.12~3-0~ubuntu-bionic docker-ce-cli=5:19.03.12~3-0~ubuntu-bionic containerd.io -y ``` ## Nvidia Driver https://hackmd.io/5K7EIPWgRqynexhxJ03mDQ ## Kubernetes ### All Nodes #### Command ``` bash $ wget https://lsalab.cs.nthu.edu.tw/~riya/deploy/kubernetes/install_k8s.sh $ sudo sh ./install_k8s.sh ``` #### Script ``` bash #!/bin/sh swapoff -a sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab ## To solve the problem of coredns crashloopbackoff sed -i -e 's/nameserver 127.0.1.1/nameserver 8.8.8.8 8.8.4.4/i' /etc/resolv.conf apt-get update -y #&& apt-get upgrade -y curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - echo deb http://apt.kubernetes.io/ kubernetes-xenial main | cat > /etc/apt/sources.list.d/kubernetes.list apt-get update -y apt-get install -y kubelet=1.14.10-00 kubeadm=1.14.10-00 kubectl=1.14.10-00 ``` --- ### Master Node #### Command ``` bash $ sh -c "$(wget https://lsalab.cs.nthu.edu.tw/~riya/deploy/kubernetes/install_k8s_master.sh -O -)" ``` #### Script ```bash echo "...Start master..." sudo kubeadm init --pod-network-cidr=10.244.0.0/16 echo "...Config kubeconfig..." mkdir -p $HOME/.kube sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config export KUBECONFIG=$HOME/.kube/config echo "...Install Flannel..." sudo sysctl net.bridge.bridge-nf-call-iptables=1 kubectl apply -f https://lsalab.cs.nthu.edu.tw/~riya/deploy/kubernetes/kube-flannel.yaml kubectl get pods --all-namespaces ``` --- ### Worker Nodes + Get the token to join the cluster (In **Master** node) ``` bash $ kubeadm token create --print-join-command ``` you will get the command like below, and enter the command in worker node ``` bash $ kubeadm join 192.168.132.144:6443 --token 42vyim.j4opt5tzazn425rj --discovery-token-ca-cert-hash sha256:49f781054a16861e48a121ea36df5cb0b72f489dc5d972e705f528774cbe6cd8 ``` --- ### TroubleShooting + **Change the hostname** ``` bash $ hostnamectl set-hostname 'new-hostname' ``` + **Control plane node isolation** By default, your cluster will not schedule Pods on the control-plane node for security reasons. If you want to be able to schedule Pods on the control-plane node, for example for a single-machine Kubernetes cluster for development, run: ```bash $ kubectl taint nodes --all node-role.kubernetes.io/master- ``` + **Enable connet the server** or **error: Error loading config file** ```bash $ mkdir -p $HOME/.kube $ sudo cp /etc/kubernetes/admin.conf ~/.kube/config $ sudo chown $(id -u):$(id -g) $HOME/.kube/config $ export KUBECONFIG=$HOME/.kube/config ``` --- ## Dynamic volume provisioner - NFS 設置Kubeflow需要Volumes來儲存資料, 在此選用NFS(視需求選擇) + 在 K8s Master Node 安裝 NFS server + 在 k8s Worker Node 安裝 NFS common + 利用 k8s 建立 PV 提供給 Kubeflow 使用 ``` bash # 在 master node執行,將nfs-server建立在K8s Master Node # xxx.xxx.xxx.xxx換成自己內網的IP $ sudo apt-get update && sudo apt-get install -y nfs-server $ sudo mkdir -p /nfs-share/kubeflow # 可以在 `/nfs-share`資料夾建立多個pv資料夾,提供kubeflow儲存使用 $ cd /nfs-share $ echo "/nfs-share xxx.xxx.xxx.xxx(rw,sync,no_root_squash,no_subtree_check)" | sudo tee -a /etc/exports $ sudo /etc/init.d/nfs-kernel-server restart # Check $ showmount -e xxx.xxx.xxxx.xxx # 在 node 執行 $ sudo apt-get update && sudo apt-get install -y nfs-common $ showmount -e xxx.xxx.xxx.xxx $ mkdir -p /nfs-share/kubeflow/ $ sudo mount -t nfs xxx.xxx.xxx.xxx:/nfs-share/kubeflow/ /nfs-share/kubeflow/ # Check $ mount | grep addr $ sudo vim /etc/fstab # 加入以下指令,使開機自動掛載 xxx.xxx.xxx.xxx:/nfs-share/kubeflow/ nfs defaults 0 0 ``` 透過K8s管理 + 新增Service Account`nfs-client-provisioner`,賦予NFS provisioner的權限 + 部署NFS provisioner讓其能夠 + 在NFS share中產生volume + 新增PV + 告訴PVC已經完成PV的設定,使兩者互相binding + 建立StorageClass使用部署好的NFS provisioner + 將新建立好的nfs設為default StorageClass ```bash $ cd $HOME $ wget https://lsalab.cs.nthu.edu.tw/~riya/deploy/deploy_nfs.tar.gz $ tar zxvf deploy_nfs.tar.gz $ cd deploy_nfs/ # Set the subject of the RBAC objects to the your own namespace where the provisioner is being deployed # 設定變數 $ NS=$(kubectl config get-contexts|grep -e "^\*" |awk '{print $5}') $ NAMESPACE=${NS:-nfs} # 替換namespace名稱 $ sed -i "s/namespace:.*/namespace: $NAMESPACE/g" ./rbac.yaml ./deployment.yaml # 換成自己內網的IP $ export NFS_ADDRESS='xxx.xxx.xxx.xxx' $ export NFS_DIR='\/nfs-share\/kubeflow' $ export PROVISIONER_NAME="nthu.laslab.com\/nfs" $ kubectl create namespace ${NAMESPACE} $ sed -i'' "s/\${PROVISIONER_NAME}/$PROVISIONER_NAME/g" ./class.yaml ./deployment.yaml $ sed -i'' "s/\${NFS_ADDRESS}/$NFS_ADDRESS/g" ./deployment.yaml $ sed -i'' "s/\${NFS_DIR}/$NFS_DIR/g" ./deployment.yaml $ kubectl apply -f ./rbac.yaml $ kubectl apply -f ./deployment.yaml $ kubectl apply -f ./class.yaml kubectl apply -f test-claim.yaml $ kubectl patch storageclass managed-nfs-storage -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' ``` 測試 ```bash $ kubectl apply -f test-claim.yaml $ kubectl apply -f test-pod.yaml # check the status $ ls /nfs-share/kubeflow/ $ kubectl get pod ``` ## Kubeflow **Note:** This is the version describing that how to install Kubeflow in the existing Kubernetes cluster on the ubuntu hosts. ### Before You Start + 如果有需要使用**GPU**進行計算: + Node需安裝`Nvidia driver`、`CUDA`、`Nvidia Docker` + K8s Cluster需安裝`nvidia device plugin` + 設置Kubeflow需要**Dynamic volume provisioner** + 透過Volumes來儲存資料,在此選用NFS(視需求選擇) ### 安裝kfctl工具 ``` bash $ wget https://lsalab.cs.nthu.edu.tw/~riya/deploy/kubernetes/kfctl_v1.0.2-0-ga476281_linux.tar.gz # https://github.com/kubeflow/kfctl/releases/download/v1.0.2/kfctl_v1.0.2-0-ga476281_linux.tar.gz $ tar zxvf kfctl_v1.0.2-0-ga476281_linux.tar.gz $ sudo mv ./kfctl /usr/local/bin/kfctl ``` ### 兩種Mode擇一安裝 + Install Kubeflow in existing Kubernetes cluster + Multi-user, auth-enabled Kubeflow with kfctl_istio_dex ### **Install Kubeflow in existing Kubernetes cluster** #### 設置環境變數 **Note:請使用環境變數的方式設定,因yaml會根據此來作設定** ``` bash # 在安裝kubeflow過程中,生成的設定檔存放的資料夾名稱 # 舉例來說, deployment name 可以是 'my-kubeflow' or 'kf-test'. $ export KF_NAME=<your choice of name for the Kubeflow deployment> # 放置Kubeflow project的資料夾路徑 $ export BASE_DIR=<path to a base directory> # 部署Kubeflow的資料夾絕對路徑 $ export KF_DIR=${BASE_DIR}/${KF_NAME} $ mkdir -p ${KF_DIR} $ cd ${KF_DIR} $ export CONFIG_URI=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.2.yaml $ kfctl apply -V -f ${CONFIG_URI} ``` #### Check ```bash $ kubectl -n kubeflow get all $ kubectl port-forward svc/istio-ingressgateway -n istio-system --address xxx.xxx.xxx.xxx 8081:80 ``` --- ### **Multi-user, auth-enabled Kubeflow with kfctl_istio_dex** #### **Note:** + **Disabling istio installation:** 如果k8s cluster已經安裝過Istio,可以選擇不安裝 kfctl_istio_dex.v1.0.2.yaml中的`istio-crds`,`istio-install` + **Istio configuration for trustworthy JWTs:** 這項配置使用於Istio v1.3.1 with SDS enabled ```bash $ sudo vim /etc/kubernetes/manifests/kube-apiserver.yaml # 在spec/containers/command中增加 - --feature-gates=TokenRequest=true - --service-account-signing-key-file=/etc/kubernetes/pki/sa.key - --service-account-issuer=kubernetes.default.svc - --service-account-api-audiences=api ``` + **Default password in static file configuration for Dex :** + 在kfctl_istio_dex.v1.0.2.yaml配置檔中,預設包含[staticPasswords](https://github.com/dexidp/dex/blob/0f8c4db9f61476a8f80e60f5950992149a1cc0cb/examples/config-dev.yaml#L91-L95)使用者 + email:`admin@kubeflow.org` + password:`12341234` + You should change this configuration or replace it with a [Dex connector](https://github.com/dexidp/dex#connectors) #### 設置環境變數 **Note:請使用環境變數的方式設定,因yaml會根據此來作設定** ``` bash # 在安裝kubeflow過程中,生成的設定檔存放的資料夾名稱 # 舉例來說, deployment name 可以是 'my-kubeflow' or 'kf-test'. $ export KF_NAME=<your choice of name for the Kubeflow deployment> # 放置Kubeflow project的資料夾路徑 $ export BASE_DIR=<path to a base directory> # 部署Kubeflow的資料夾絕對路徑 $ export KF_DIR=${BASE_DIR}/${KF_NAME} $ mkdir -p ${KF_DIR} $ cd ${KF_DIR} $ export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_istio_dex.v1.0.2.yaml" $ wget -O kfctl_istio_dex.yaml $CONFIG_URI $ export CONFIG_FILE=${KF_DIR}/kfctl_istio_dex.yaml # 預設使用者帳號密碼為: admin@kubeflow.org:12341234 # 透過配置檔修改 # 密碼加密:https://passwordhashing.com/BCrypt # 若config_file中metadata缺少clusterName,需手動補上 # clusterName可至~/.kube/config查看 $ vim $CONFIG_FILE $ sudo kfctl apply -V -f ${CONFIG_FILE} ``` ![](https://i.imgur.com/jxGnkeI.png) #### Check ```bash $ kubectl -n kubeflow get all $ kubectl port-forward svc/istio-ingressgateway -n istio-system 8081:80 ``` #### 增加使用者 ```bash # 取得當前dex使用的配置檔 $ kubectl get configmap dex -n auth -o jsonpath='{.data.config\.yaml}' > dex-config.yaml # 預設使用者帳號密碼為: admin@kubeflow.org:12341234 # 透過配置檔修改 # 密碼加密:https://passwordhashing.com/BCrypt # 新增使用者 $ vim dex-config.yaml # 更新Configmap $ kubectl create configmap dex --from-file=config.yaml=dex-config.yaml -n auth --dry-run -oyaml | kubectl apply -f - # 更新dex $ kubectl rollout restart deployment dex -n auth # 因版本關係沒有restart的話,直接刪除namespace auth 底下的的dex pod,讓deployment重新啟動一個 ``` ![](https://i.imgur.com/Sz7xuQc.png)