Remove and Rejoin a node on Proxmox Cluster

# Remove and Rejoin a node on Proxmox Cluster ## 情境描述如果直接把叢集中每一台 node 的 IP 或 Hostname 改掉，導致整個叢集的 node 之間互相失連，然後又遇到 `/etc/pve/corosync.conf` 完全無法修改，請根據以下步驟進行故障還原 ## 1. 將叢集中所有其他的 node 從 corosync 中移除 ```bash! # 1. SSH 連到第一台 node $ ssh root@p1 # 2. Start your pve filesystem into local mode: $ systemctl stop corosync pve-cluster $ pmxcfs -l # 3. Edit your corosync on a newly created file: ## 這裡只保留自己這台 node 的資訊 ## totem.config_version 版號要 +1 $ nano correct_corosync.conf logging { debug: off to_syslog: yes } nodelist { node { name: p1 nodeid: 1 quorum_votes: 1 ring0_addr: 192.168.37.50 } } quorum { provider: corosync_votequorum } totem { cluster_name: topgun config_version: 8 interface { linknumber: 0 } ip_version: ipv4-6 link_mode: passive secauth: on version: 2 } # 4. pull your new corosync configuration over your previous one $ cp correct_corosync.conf /etc/pve/corosync.conf $ cp correct_corosync.conf /etc/corosync/corosync.conf # 5. Exit the local mode pve filesystem: $ killall pmxcfs # 6. Restart the services: $ systemctl start pve-cluster corosync # 7. 把叢集中其他節點的資料夾移除 $ rm -rf /etc/pve/nodes/{p2,p3,p4,p5} ``` ## 2. Separate a Node Without Reinstalling ```bash! # 1. First, stop the corosync and pve-cluster services on the node: $ systemctl stop pve-cluster corosync # 2. Start the cluster file system again in local mode: $ pmxcfs -l [main] notice: resolved node name 'p2' to '192.168.37.51' for default node IP address [main] notice: forcing local mode (although corosync.conf exists) # 3. Delete the corosync configuration files: $ rm /etc/pve/corosync.conf && \ rm -r /etc/corosync/* # 4. You can now start the file system again as a normal service: $ killall pmxcfs $ systemctl start pve-cluster # 5. The node is now separated from the cluster. You can deleted it from any remaining node of the cluster with: $ pvecm delnode oldnode # 6. If the command fails due to a loss of quorum in the remaining node, you can set the expected votes to 1 as a workaround: $ pvecm expected 1 # 7. And then repeat the pvecm delnode command. # 8. Now switch back to the separated node and delete all the remaining cluster files on it. This ensures that the node can be added to another cluster again without problems. $ rm /var/lib/corosync/* # 9. As the configuration files from the other nodes are still in the cluster file system, you may want to clean those up too. After making absolutely sure that you have the correct node name, you can simply remove the entire directory recursively from /etc/pve/nodes/NODENAME ## 先備份 $ cp -r /etc/pve/nodes /root $ rm -rf /etc/pve/nodes/* # 10. 確認網卡設定符合預期 $ cat /etc/network/interfaces # 11. 確認 /etc/hosts 設定正確 $ cat /etc/hosts # 12. 重啟 node $ reboot ``` ## 3. Rejoin node to proxmox cluster ```bash! # 1. 確認沒有 Failed 的服務 $ systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION 0 loaded units listed. ``` ### Join Node to Cluster via GUI Log in to the web interface on an existing cluster node. Under **Datacenter** → **Cluster**, click the **Join Information** button at the top. Then, click on the button **Copy Information**. Alternatively, copy the string from the Information field manually. ![image](https://hackmd.io/_uploads/ry5B6WHU0.png) ![image](https://hackmd.io/_uploads/BkBwT-HI0.png) Next, log in to the web interface on the node you want to add. Under **Datacenter** → **Cluster**, click on **Join Cluster**. Fill in the Information field with the Join Information text you copied earlier. Most settings required for joining the cluster will be filled out automatically. For security reasons, the cluster password has to be entered manually. ![image](https://hackmd.io/_uploads/HJx2a-SLC.png) 將虛擬機器的設定檔放回指定目錄區 ```bash! $ cp /root/nodes/$(hostname)/qemu-server/* /etc/pve/nodes/$(hostname)/qemu-server ``` > 如果有換 Hostname 要複製舊的主機節點的設定檔到新的 Hostname 的目錄區底下，舉例 : > Hsotname 從 p2 改成 p3，那麼命令就要改成 > ```! > cp /root/nodes/p2/qemu-server/* /etc/pve/nodes/$(hostname)/qemu-server > ```