---
title: 'Digital Server Note'
disqus: hackmd
---
# Digital Server Note
---
**Update:** 113/4/01
---
[TOC]
---
# 兆軒科技 (合作廠商)
| 工程師 | Email | 手機 |
| --------------- | ------------------- | ------------ |
| Tony (主要聯絡人)| tony@joinet.com.tw | 0936-132-299 |
| Denny (主要工程師)| denny@joinet.com.tw | |
| Mike | mike@joinet.com.tw | |
# Remote Access (AnyDesk)
* **iMDL3 (140.114.93.63):** 1598132329
* **iMDL4 (140.114.93.64):** 298409107
* **iMDL5 (140.114.93.65):** 360415744
# System Information
## iMDL3 (140.114.93.63)
* CPU: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
* GPU
| PCIE | Model | RAM |
| ---- | ----------- | ---- |
| 0 | RTX A6000 | 48GB |
| 1 | Tesla V100 | 32GB |
| 2 | Tesla V100S | 32GB |
| 3 | RTX 2080ti | 11GB |
| 4 | RTX 2080ti | 11GB |
| 5 | RTX 2080ti | 11GB |
| 6 | RTX 2080ti | 11GB |
| 7 | RTX 2080ti | 11GB |
## iMDL4 (140.114.93.64)
* CPU: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
* GPU
| PCIE | Model | RAM |
| ---- | ----------- | ---- |
| 0 | RTX 2080ti | 11GB |
| 1 | RTX 2080ti | 11GB |
| 2 | RTX 2080ti | 11GB |
| 3 | RTX 2080ti | 11GB |
| 4 | RTX 2080ti | 11GB |
| 5 | RTX 2080ti | 11GB |
| 6 | RTX 2080ti | 11GB |
| 7 | RTX 2080ti | 11GB |
## iMDL5 (140.114.93.65)
* CPU: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
* GPU
| PCIE | Model | RAM |
| ---- | ----------- | ---- |
| 0 | RTX 3090 | 24GB |
| 1 | RTX 3090 | 24GB |
| 2 | RTX 3090 | 24GB |
| 3 | RTX 3090 | 24GB |
| 4 | RTX TITAN | 24GB |
| 5 | RTX TITAN | 24GB |
| 6 | RTX TITAN | 24GB |
| 7 | RTX TITAN | 24GB |
---
# System Update
## System Packages
```gherkin=
sudo apt update
sudo apt upgrade
```
## GPU Driver
* 因為我們 server 不同世代顯卡混插的原因 :warning: <font color="#f00">用網路常見的方法更新 GPU 驅動</font> :warning: 會有 bug。
```gherkin=
sudo apt install nvidia-driver-XXX
```
* 從 GUI 更新
* 找到 Software & Updates。

* 別裝帶有 open-kernel, tested 的版本。Server 版的更新週期與支持比會比普通版多一年。i.e., 為求穩定。

# File System
## iMDL3 (140.114.93.63) (NFS Server)
```gherkin=
/ root
+- home # <SSD> <system> 別動
+- home1 # <SSD> <RAID10> < nfs> 登入後預設 nfs home (imdl3)
| +- dataset
| +- user1
| +- user2
| +- ...
```
## iMDL4 (140.114.93.64)
```gherkin=
/ root
+- home # <SSD> <system> 別動
+- home1 # <SSD> <RAID10> < nfs> 登入後預設 nfs home (imdl3)
| +- dataset # <nfs> dataset storage
| +- user1
| +- user2
| +- ...
+- dataset # <SSD> <RAID0> <local> dataset storage
| +- dir_a
| +- dir_b
| +- ...
```
## iMDL5 (140.114.93.65)
```gherkin=
/ root
+- home # <SSD> <system> 別動
+- home1 # <SSD> <RAID10> < nfs> 登入後預設 nfs home (imdl3)
| +- dataset # <nfs> dataset storage
| +- user1
| +- user2
| +- ...
+- dataset # <SSD> <RAID0> <local> dataset storage
| +- dir_a
| +- dir_b
| +- ...
```
請善用該指令查看硬碟容量
```gherkin=
df -h
```
# SSH
:warning: <font color="#f00">系統重裝第一步</font> :warning:
## 安裝 Packages
```gherkin=
sudo apt-get update
sudo apt-get install openssh-server
```
## 啟動 SSH 服務
```gherkin=
sudo systemctl start ssh
```
## 開機自動啟動 SSH 服務
```gherkin=
sudo systemctl enable ssh
```
## 查看 SSH 服務狀態
```gherkin=
sudo systemctl status ssh
```

---
# Uncomplicated Firewall (UFW)
## 查看 UFW 允許的 IP
```gherkin=
sudo ufw status numbered
```

## 新增 UFW 白名單 IP
```gherkin=
sudo ufw allow from <IP>
```

## 移除 UFW 白名單 IP
```gherkin=
sudo ufw delete <編號>
```

## Apply Change
```gherkin=
sudo ufw enable
```

## 其它
```gherkin=
sudo ufw help
```

---
# RAID
:warning: <font color="#f00">不會設定可以找 Tony 請工程師過來處理</font> :warning:
## Hard RAID (Only on iMDL3)
* 只有充當 NFS server 的 63 有裝磁碟陣列卡,需要在 BIOS 設定。
* 除非有<font color="#f00">重大事故</font>發生 (系統重裝/硬碟壞掉),一般不會去動。
* 假設需要重新設定但不會,這個部分是可以請<font color="#f00">兆軒科技</font>的工程師來處理。
* 目前是 RAID-10 的設定。
## Soft RAID
### Step 1. 組 RAID
```gherkin=
sudo mdadm --verbose --create /dev/md0 --level=0 --raid-devices=3 /dev/sdb /dev/sdc /dev/sdd
```
* **create + <RAID名>**: 通常取 /dev/md{?}
* **level=[?]**: RAID 種類 (如 0, 1, 5, 10),我們 server 的 soft RAID 通常是用做 dataset 的存取,資料遺失也無所謂。所以通常組效能最好 RAID-0。
* **raid-devices=[?] + disk name**: 組 RAID 的數量以及其 disk name。
* 查看 disk name 指令
```gherkin=
sudo blkid
```
* 透過 GUI 查看

### Step 2. 格式化 RAID
* 透過指令
```gherkin=
sudo mkfs.ext4 /dev/md0
```
* 透過 GUI (suggested)

### Step 3. 查看 RAID 狀況
```gherkin=
sudo mdadm --detail /dev/md0
```

### Step 4. 掛載 RAID 到指定資料夾
* 創建 RAID 掛載資料夾
* 根目錄創建 dataset 為例
```gherkin=
sudo mkdir /dataset
```
* 更改其權限讓大家都可以存取
```gherkin=
sudo chmod 777 -R /dataset
sudo setfacl -R -m d:u::rwx /dataset
sudo setfacl -R -m d:g::rwx /dataset
sudo setfacl -R -m d:o::rwx /dataset
```
* 查看 RAID UUID
* 透過指令
```gherkin=
sudo blkid
```

* 透過 GUI

* 更改系統文件 (設定開機自動掛載)
```gherkin=
sudo vim /etc/fstab
```

* 手動掛載 (不重開機)
```gherkin=
sudo mount -a
```
* 檢查是否掛載成功
```gherkin=
df -h
```

---
# NFS (Server)
:warning: <font color="#f00">不會設定可以找 Tony 請工程師過來處理</font> :warning:
## 安裝 Packages
```gherkin=
sudo apt update
sudo apt install nfs-kernel-server
```
## 創建 NFS 資料夾
```gherkin=
sudo mkdir /home1
```
## 更改系統文件
```gherkin=
sudo vim /etc/exports
```

imdl4 與 imdl5 實際上是 114.14.93.64 與 114.14.93.65 的 alias。透過以下指令查看或更改。
```gherkin=
sudo vim /etc/hosts
```

## Apply Changes
```gherkin=
sudo exportfs -a
```
## 啟動 NFS 服務
```gherkin=
sudo systemctl start nfs-kernel-server
```
## 開機自動啟動 NFS 服務
```gherkin=
sudo systemctl enable nfs-kernel-server
```
## 查看 NFS 狀態
```gherkin=
sudo systemctl status nfs-kernel-server
```

---
# NFS (Client)
:warning: <font color="#f00">不會設定可以找 Tony 請工程師過來處理</font> :warning:
## 安裝 Packages
```gherkin=
sudo apt-get update
sudo apt-get install nfs-common
```
## 創建 NFS 資料夾 (跟 NFS server 上的同名)
```gherkin=
sudo mkdir /home1
```
## 更改系統文件 (設定開機自動掛載)
```gherkin=
sudo vim /etc/fstab
```

imdl3 實際上是 114.14.93.63 的 alias。透過以下指令查看或更改。
```gherkin=
sudo vim /etc/hosts
```
\
## 手動掛載 (不重開機)
```gherkin=
sudo mount -a
```
## 檢查是否掛載成功
```gherkin=
df -h
```

---
# NIS (Server)
:warning: <font color="#f00">不會設定可以找 Tony 請工程師過來處理</font> :warning:
## 安裝 Packages
```gherkin=
sudo apt update
sudo apt install nis
```
## 設定 NIS 系統文件
## 設定是否是 NIS server
```gherkin=
sudo vim /etc/default/nis
```

```gherkin=
sudo vim /etc/yp.conf
```

* 設定 NIS 安全性 (可略)
```gherkin=
sudo vim /etc/ypserv.securenets
```

## 初始化 NIS 設定
```gherkin=
sudo ypinit -m
```
## 啟動 NIS 服務
```gherkin=
sudo systemctl start rpcbind
sudo systemctl start ypserv
```
## 開機自動啟動 NIS 服務
```gherkin=
sudo systemctl enable rpcbind
sudo systemctl enable ypserv
```
## 查看 NIS 狀態
```gherkin=
sudo systemctl status ypserv
```

---
# NIS (Client)
:warning: <font color="#f00">不會設定可以找 Tony 請工程師過來處理</font> :warning:
## 安裝 Packages
```gherkin=
sudo apt update
sudo apt install nis
```
## 設定 NIS 系統文件
* 設定是否是 NIS server
```gherkin=
sudo vim /etc/default/nis
```

```gherkin=
sudo vim /etc/yp.conf
```

* 設定 NIS 安全性 (可略)
```gherkin=
sudo vim /etc/ypserv.securenets
```

## 初始化 NIS 設定
```gherkin=
sudo ypinit -m
```
## 啟動 NIS 服務
```gherkin=
sudo systemctl start ypbind
```
## 查看 NIS Client 是否有 Pin 到 Server
```gherkin=
sudo systemctl status ypbind
```

## 查看 NIS 狀態
```gherkin=
sudo systemctl status ypbind
```

# NIS Account Create (Only on iMDL3)
:warning: <font color="#f00">禁止使用 ubuntu 原生創建新帳號方法 </font> :warning:
```gherkin=
sudo useradd -u 1013 newbie
```
:warning: <font color="#f00">新帳號最好都指定 uid </font> :warning:
```gherkin=
sudo cat /etc/passwd
```

紅色數字及目前的 uid,下個新帳號的 uid 應為最大<font color="#f00"> uid + 1</font>。
## 創建 NIS 帳號
* 設定跟目錄 /home1/{username} 以及 uid
```gherkin=
sudo adduser --home /home1/newbie --uid 1013 newbie
```

* 輸入生效指令
```gherkin=
sudo make -C /var/yp
```

## 登入新帳號並設定 Anaconda 環境
```gherkin=
vim .bashrc
```
在最後面添加以下 code。
```gherkin=
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home1/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/home1/anaconda3/etc/profile.d/conda.sh" ]; then
. "/home1/anaconda3/etc/profile.d/conda.sh"
else
export PATH="/home1/anaconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate newbie
```
## 重新登入確認進到 Base 環境
* Before

* After

## 創建個人 Default 環境
```gherkin=
conda create -n newbie python=3.8
```
## 重新登入確認進到個人 Default 環境
* Before

* After

## 刪除 NIS 帳號
* 刪除指定帳號。
```gherkin=
sudo userdel newbie
```
* 刪除指定帳號 + 根目錄。 :warning: <font color="#f00">慎用</font> :warning:
```gherkin=
sudo userdel -r newbie
```
* 輸入生效指令
```gherkin=
sudo make -C /var/yp
```

---
# Clean Cache
有時候系統會因為有 users 以非正常的方式關閉程式,導致 swap memory 爆炸,占用 memory。
如果覺得系統卡卡的用 nvitop 檢查 SWP。
```gherkin=
nvitop
```

用以下 script 清 cache。:warning: <font color="#f00">必須確認沒人用</font> :warning:
```gherkin=
sudo /home1/clear.sh
```
---
# Useful Tool
## Nvitop 系統資源檢視

```gherkin=
pip install nvitop
```
## Anaconda with Mamba

https://github.com/mamba-org/mamba?tab=readme-ov-file
原生 conda 安裝或尋找 packages 有時候很慢,用 mamba 替代 conda 指令。
```gherkin=
conda install mamba -c conda-forge
```
裝完 mamba 後把所有 conda 指令換成 mamaba 即可。
* Before
```gherkin=
conda install <package> -c <channel>
```
* After
```gherkin=
mamba install <package> -c <channel>
```
---
# Anaconda3 Tutorial (Basic)
## Check Anaconda3
```gherkin=
conda --version
```

## Create New Environment
```gherkin=
conda create -n <env_name> python=<version>
conda create -n jack python=3.7.9
```

* Type "y"

## List Environment
```gherkin=
conda env list
```

## Activate Environment
```gherkin=
conda activate <env_name>
conda activate jack
```

* If you wan to activate your environment automatically every time you login
* Add "**conda activate <env_name>**" in "**~/.bashrc**"
* You can use command:
```gherkin=
echo "conda activate <env_name>" >> ~/.bashrc
echo "conda activate jack" >> ~/.bashrc
```
* You will see the command added in the last line of **.bashrc**

* **<font color="#f00">Do not modify the colored code, they are environment setup for Anaconda3</font>**
## Install Packages
* Use conda command
```gherkin=
conda install <pkg_name>
```
* Conda packages can be easily searched
```gherkin=
conda search <pkg_name>
conda search cudnn
```

```gherkin=
conda install cudnn=<version>=<build>
conda install cudnn=7.6.5
# or conda install cudnn=7.6.5=cuda10.0_0 <for specific node>
```
* Or pip command as you known (no need --user flag)
```gherkin=
pip install <pkg_name>
```
* Either conda or pip command won't change the system's configs <font color="#f00">as long as</font> you are in your enviroment.
## List Packages
```gherkin=
conda list
```

## Upgrade Packages
* Upgrade all packages
```gherkin=
conda upgrade --all
```
* Upgrade specific package
```gherkin=
conda upgrade <pkg_name>
conda upgrade cudnn
```

## Remove Packages
```gherkin=
conda uninstall <pkg_name>
conda uninstall cudnn
```

## Deactivate Environment
```gherkin=
conda deactivate
```

## Remove Environment
```gherkin=
conda env remove -n <emv_name>
conda env remove -n jack
```

---
# Anaconda3 Tutorial (Advanced)
## Create Environment from File
* This is useful and important because you can easily manage, duplicate, and share your environment. Typically, you will find the **.yml** file in GitHub repositories, which is the setup file for the Anaconda3 environment.
Use the command to export your environment.
```gherkin=
conda env export > environment.yml
```
The contents of the file may look like this:
```gherkin=
name: jack
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- cudatoolkit=10.0.130=0
- cudnn=7.6.5=cuda10.0_0
- libedit=3.1.20191231=h14c3975_1
- libffi=3.2.1=hf484d3e_1007
- libgcc-ng=9.1.0=hdf63c60_0
- libllvm10=10.0.1=hbcb73fb_5
- libstdcxx-ng=9.1.0=hdf63c60_0
- llvmlite=0.35.0=py36h612dafd_4
- pip=21.0.1=py36h06a4308_0
- python=3.6.9=h265db76_0
- pip:
- keras-applications==1.0.8
- keras-preprocessing==1.1.2
- numpy==1.19.5
- opt-einsum==3.3.0
- pandas==1.1.5
- scipy==1.5.4
- tensorboard==1.15.0
- tensorflow-compression==1.3
- tensorflow-estimator==1.15.1
- tensorflow-gpu==1.15.0
# prefix: /home/user/anaconda3/envs/
```
You can use the command to set up the environment.
```gherkin=
conda env create -f <filename>.yml
conda env create -f env.yml
```
**Flag explanations**:
* **name**: environment name
* **channels**: packages search channel
* **defaults**: source provided by <font color="#f00">official team</font> (more robust)
* **conda-forge**: source provided by <font color="#f00"> community</font> (more powerful)
* **dependencies**: conda packages
* **pip**: the packages under pip flag will be install by pip rather than conda. <font color="#f00">pip flag is optional</font>. Usually, you can see **requirement.txt** in GitHub repositories, which contain required pip packages for the project. The contents of the file may look like this:
```gherkin=
keras-applications==1.0.8
keras-preprocessing==1.1.2
numpy==1.19.5
opt-einsum==3.3.0
pandas==1.1.5
scipy==1.5.4
tensorboard==1.15.0
tensorflow-compression==1.3
tensorflow-estimator==1.15.1
tensorflow-gpu==1.15.0
```
You can use the command to install pip packages after setting up the environment.
```gherkin=
pip install -r requirements.txt
```
* **prefix**: environment path (better comment out)
---
### CUDA Toolkit Installation
Sometimes, different projects require different versions of PyTorch. However, each PyTorch version has its own required packages such as torchvision, torchaudio, pytorch-cuda, or drivers such as cudatoolkit and cudnn. These are usually managed by the server manager and installed system-wide for all users. Upgrading or downgrading PyTorch packages can have catastrophic impacts on all users. Thanks to Anaconda3, users can change the PyTorch and CUDA version under a virtual environment.
The CUDA version reported from <font color="#f00">**nvidia-smi**</font> refers to the highest version supported by that driver. Do not install a version higher than the system supports.

Use conda search to find a version that meets the requirements.
```gherkin=
conda search cudatoolkit -c <channel>
# <channel> = conda-forge, bioconda, anaconda, pytorch
conda install -c conda-forge cudatoolkit=11.1.1
```
Usually cudnn will be installed along with it, and the installation of cudnn can be skipped in the next step.

---
### CUDNN Installation (skip if installed)
After CUDA toolkit is installed, you need to install corresponded cudnn library.

Use conda search to find a version that meets the requirements.
```gherkin=
conda search cudnn -c <channel>
# <channel> = conda-forge, bioconda, anaconda, pytorch
```
## Install Specific PyTorch Version
The PyTorch version can really affect the performance. Perhaps it's time to upgrade to the new PyTorch version on our server.
https://wandb.ai/gladiator/PyTorch%202.0%20Benchmarks%20v2/reports/Is-PyTorch-2-0-Faster-Than-PyTorch-1-13---VmlldzozNDA2MDQz
Before installing, make sure to check the Python version.

### PyTorch Installation
**Newest version**:
* Go to https://pytorch.org/ and search for the required PyTorch version

* Remember to activate your environment
* Copy & paste the command
**Previous version**:
* Go to https://pytorch.org/get-started/previous-versions/ and search for the required PyTorch version


...
* Remember to activate your environment
* Copy & paste the command
**Check version**:
```gherkin=
python3
import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())
exit()
```
