OKD/OpenShift Node service overload alerting script

--- date-created: 20230606 alias: "" tags: - k8s - public - script - bash description: | "" version: 0.0.1 author: "Tiriyon Continnuum" --- # OKD/OpenShift Node service overload alerting script When systemd is overwhelmed with 128k services in it's list it hangs and the system freezes. Due to a known bug in OKD 4.7, systemd does not clear its list and service unit enries remain even after stopped or teminated. This causes an overflow and freezes the system (the node) as described above. The purpose of this script is to run as crontab periodically and monitor the state of systemd list, and send an alert email using `mutt` to relevant personnal. ```bash= #!/bin/bash clusters=( "/path/to/cluster1/auth/kubeconfig" "/path/to/cluster2/auth/kubeconfig" "/path/to/cluster3/auth/kubeconfig" "/path/to/cluster4/auth/kubeconfig" ) export PATH="/usr/local/bin:$PATH" export PATH="/usr/local/bin/oc:$PATH" OC_PATH="/usr/local/bin/oc" MUTT_PATH="/usr/bin/mutt" for cluster in "${clusters[@]}"; do echo "Running for cluster: $cluster" # Set KUBECONFIG to the cluster's configuration file export KUBECONFIG="$cluster" # Extract the cluster name from the kubeconfig path cluster_name=$(echo "$cluster" | awk -F'/' '{print $(NF-2)}') logfile_node_ips="/home/user/adirectory/nodeconnected.log" # Run the script for the current cluster declare -A node_ips # Get node name and IP while read -r line; do name=$(echo $line | awk '{print $1}') ip=$(echo $line | awk '{print $7}') node_ips["$name"]=$ip # OBT instead of echoing, create log echo "$(date +"%Y-%m-%d %H:%M:%S") - $name is in list with ip: $ip" echo "$(date +"%Y-%m-%d %H:%M:%S") - $name is in list with ip: $ip" >> "$logfile_node_ips" done <<< "$($OC_PATH get nodes -o wide | awk 'NR>1{print}')" # Log nodes and their unit statuses log_file="/home/user/adirectory/serviceCountOverClusters.log" for name in "${!node_ips[@]}"; do ip="${node_ips[$name]}" # Execute the commands via ssh on each node sysoutput=$(ssh -q -o "StrictHostKeyChecking no" core@"$ip" "sudo systemctl list-units --all") inactive=$(echo "$sysoutput" | grep -wc "inactive") active=$(echo "$sysoutput" | grep -wc "active") total=$((inactive + active)) # Log the output for each node log_entry="$(date +"%Y-%m-%d %H:%M:%S") - $name: Number of inactive svcs: $inactive, Number of active svcs: $active, Total number of svcs: $total" echo "$log_entry" >> "$log_file" # Check if total services exceed the threshold for alert if (( total > 100 )); then # Create the HTML formatted alert email with gruvbox theme temp_alert_file="/tmp/temp_alert.html" alert_title="<span class='title'>Node $name in cluster $cluster_name has reached over 100,000 services and requires manual intervention</span>" cat >"$temp_alert_file" <<EOF <html> <head> <style> body { background-color: #3c3836; color: #ebdbb2; font-family: monospace; font-size: 16px; padding: 20px; } .title-box { background-color: #d79921; border: 2px solid #d65d0e; padding: 10px; color: #fbf1c7; font-weight: bold; text-align: center; } .alert-body { background-color: #3c3836; padding: 10px; } .title { background-color: #d79921; padding: 5px; } p { margin-bottom: 10px; } </style> </head> <body> <div class="title-box">$alert_title</div> <div class="alert-body"> <p>Cluster name: $cluster_name</p> <p>Number of active services: $active</p> <p>Number of inactive services: $inactive</p> <p>Instructions to resolve:</p> <p>SSH to node: core@$ip</p> <p>Run command: systemctl daemon-reload</p> </div> </body> </html> EOF # Send the alert email to the specified email addresses mail_subject="Alert: Node $name in cluster $cluster_name has exceeded 100,000 services" $MUTT_PATH -e "set content_type=text/html" -s "$mail_subject" user@organization.com < "$temp_alert_file" # Clean up temporary alert file rm "$temp_alert_file" fi done done ``` ## Email template ![](https://hackmd.io/_uploads/S1E9iY2Ln.png) # Collaboration This note is open to global collaboration, feel free to add and change stuff. I can always revert if required.