# 162697767 ## Issue - SCG Connectivity can't be enabled, no master PSNT is shown for Access Key tab. ## Cause - VXM can't gather hosts' details from hosts' Platform Service. ## Troubleshooting - Hosts' Platform Service can't connect to iDRAC for some reason. - Network connections via iSM failed. - Can't ping iDRAC internal IP: 169.254.0.1 - iSM Status: Running (Limited functionality) - This because network connection between the host and iDRAC on iSM/vmk1 failed. ## Plan - Restore iDRAC network connection on iSM/vmk1 - Restore Platform Service - Ensure VXM can gather hosts' hw information via platform service - Ensure redis cache have hosts' details keys. ## What has been done - Can't ping iDRAC's internal IP, this is why Platform Service failed to work. - For some reason, 169.254.0.1 is not reachable - No specific firewall rule or security setting blocks this connection. - Checked and found that iSM status on iDRAC is shown as: - iSM Status: Running (Limited functionality) - This is the reason why the connection fails. - Follow kb 000042093, workaround gaved by PSE-19686 on VXEE-11377 - Restarted iDRAC and all related components - This makes vmk1 to iDRAC connection works properly - This bring up Platform Service to normal operation - No VxRM is able to connect to all the hosts' platform service. - Redis cache has been filled up with new hw keys. ### New Issue found: - No Access Key tab is shown when enabling SCG Connectivity What has been done: - Follow KB: https://www.dell.com/support/kbdoc/en-us/000201895 - Delete hosts' universal keys. ## The idea: - For any SCG Connectivity to work, - Ensure cluster health is OK first. - This means from iDRAC to iSM to Platform Service to do-host and do-cluster are working properly. - Note: some thing is running does not always means it's working properly. ## Putty Log ### Failed to ping 169.254.0.1 ``` [root@MH-ESX-02:~] vmkping -I vmk1 169.254.0.1 PING 169.254.0.1 (169.254.0.1): 56 data bytes sendto() failed (Host is down) ``` ### Even Network looks OK Network: ``` [root@MH-ESX-02:~] esxcfg-route -l VMkernel Routes: Network Netmask Gateway Interface 10.63.104.0 255.255.255.0 Local Subnet vmk4 10.63.110.0 255.255.255.0 Local Subnet vmk3 169.254.0.0 255.255.255.0 Local Subnet vmk1 172.23.109.0 255.255.255.0 10.63.110.254 vmk3 10.63.100.0 255.255.252.0 Local Subnet vmk2 default 0.0.0.0 10.63.103.254 vmk2 ``` ### Platform Service status Log states that pservice fails to connect to iDRAC: ``` [root@MH-ESX-02:~] tail -n 30 /var/log/platform_svc.log 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - Traceback (most recent call last): 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/dispatch.py", line 65, in handle_msg 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - hdlr(params, conn) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/dispatch.py", line 94, in __call__ 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - return self.handle_msg(msg, conn) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/dispatch.py", line 88, in handle_msg 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - conn.send_msg(status, resp) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/network.py", line 132, in send_msg 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - sent_bytes = self.sock.send(b) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - BrokenPipeError: [Errno 32] Broken pipe 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - Exception dispatching message: ('platform', {'subaction': 'get'}) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - Traceback (most recent call last): 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/dispatch.py", line 65, in handle_msg 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - hdlr(params, conn) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/dispatch.py", line 94, in __call__ 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - return self.handle_msg(msg, conn) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/dispatch.py", line 88, in handle_msg 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - conn.send_msg(status, resp) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/network.py", line 132, in send_msg 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - sent_bytes = self.sock.send(b) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - BrokenPipeError: [Errno 32] Broken pipe 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - During handling of the above exception, another exception occurred: 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - Traceback (most recent call last): 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/dispatch.py", line 68, in handle_msg 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - conn.send_msg(STATUS_INTERNAL_FAILURE, str(ex)) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - File "/opt/vxrail/bin/service/network.py", line 132, in send_msg 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - sent_bytes = self.sock.send(b) 2023-02-21T03:56:20Z platform_svc: [esxcli-plugin-service-thread] ERROR - BrokenPipeError: [Errno 32] Broken pipe ``` ### iDRAC status is still OK ``` [root@MH-ESX-02:~] /opt/vxrail/tools/ipmitool mc info Device ID : 32 Device Revision : 1 Firmware Revision : 6.00 IPMI Version : 2.0 Manufacturer ID : 674 Manufacturer Name : DELL Inc Product ID : 256 (0x0100) Product Name : Unknown (0x100) Device Available : yes Provides Device SDRs : yes Additional Device Support : Sensor Device SDR Repository Device SEL Device FRU Inventory Device IPMB Event Receiver Bridge Chassis Device Aux Firmware Rev Info : 0x00 0x23 0x1e 0x00 ``` ``` [root@MH-ESX-02:~] /opt/vxrail/tools/ipmitool lan print Set in Progress : Set Complete Auth Type Support : MD5 Auth Type Enable : Callback : MD5 : User : MD5 : Operator : MD5 : Admin : MD5 : OEM : IP Address Source : Static Address IP Address : 10.63.103.225 Subnet Mask : 255.255.252.0 MAC Address : b4:45:06:e8:b5:9f SNMP Community String : rymansnmpro IP Header : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10 BMC ARP Control : ARP Responses Enabled, Gratuitous ARP Disabled Gratituous ARP Intrvl : 2.0 seconds Default Gateway IP : 10.63.103.254 Default Gateway MAC : 00:00:00:00:00:00 Backup Gateway IP : 0.0.0.0 Backup Gateway MAC : 00:00:00:00:00:00 802.1q VLAN ID : Disabled 802.1q VLAN Priority : 0 RMCP+ Cipher Suites : 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14 Cipher Suite Priv Max : Xaaaaaaaaaaaaaa : X=Cipher Suite Unused : c=CALLBACK : u=USER : o=OPERATOR : a=ADMIN : O=OEM Bad Password Threshold : Not Available ``` ### Recovery iSM connectivity - New Hardware key has been created in the redis cache. ``` mh-vxm-01:/var/log # kubectl exec -it $(kubectl get pods -o name | grep cache | cut -d '/' -f 2) -- redis-cli -n 1 Defaulted container "cacheservice" out of: cacheservice, dependency (init) 127.0.0.1:6379[1]> keys ** 1) "GSMWXT3:is_linzhi" 2) "pre_fetch_task" 3) "GSMWXT3:mh-esx-01.internal.rymanhealthcare.co.nz:hardware.model" 4) "HSMWXT3:is_linzhi" 127.0.0.1:6379[1]> keys * 1) "GSMWXT3:is_linzhi" 2) "pre_fetch_task" 3) "GSMWXT3:mh-esx-01.internal.rymanhealthcare.co.nz:hardware.model" 4) "HSMWXT3:is_linzhi" 127.0.0.1:6379[1]> mh-vxm-01:/var/log # # NOW WE HAVE NODE 01 IN REDIS CACHE ``` ### After iDRAC restart - Stop dcism - vmk1 is gone. ``` [root@MH-ESX-02:~] vmkping -I vmk1 169.254.0.1 PING 169.254.0.1 (169.254.0.1): 56 data bytes sendto() failed (Host is down) [root@MH-ESX-02:~] less /var/run/log/vxps_main.log [root@MH-ESX-02:~] /etc/init.d/sfcbd-watchdog stop sfcbd-init[2536935]: args ('stop') sfcbd-init[2536935]: Getting Exclusive access, please wait... sfcbd-init[2536935]: Exclusive access granted. sfcbd-init[2536935]: Request to stop sfcbd-watchdog, pid 2536935 sfcbd-init[2536935]: Invoked kill 2102166 sfcbd-init[2536935]: stop sfcbd process completed. [root@MH-ESX-02:~] /etc/init.d/dcism-netmon-watchdog stop [root@MH-ESX-02:~] esxcfg-vmknic -l Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack vmk2 11 IPv4 10.63.103.227 255.255.252.0 10.63.103.255 00:50:56:64:e9:76 1500 65535 true STATIC defaultTcpipStack vmk2 11 IPv6 fe80::250:56ff:fe64:e976 64 00:50:56:64:e9:76 1500 65535 true STATIC, PREFERRED defaultTcpipStack vmk0 4 IPv4 N/A N/A N/A 00:50:56:61:d5:f9 1500 65535 true NONE defaultTcpipStack vmk0 4 IPv6 fe80::250:56ff:fe61:d5f9 64 00:50:56:61:d5:f9 1500 65535 true STATIC, PREFERRED defaultTcpipStack vmk3 19 IPv4 10.63.110.232 255.255.255.0 10.63.110.255 00:50:56:61:15:26 1500 65535 true STATIC defaultTcpipStack vmk3 19 IPv6 fe80::250:56ff:fe61:1526 64 00:50:56:61:15:26 1500 65535 true STATIC, PREFERRED defaultTcpipStack vmk4 3 IPv4 10.63.104.232 255.255.255.0 10.63.104.255 00:50:56:67:79:4e 9000 65535 true STATIC defaultTcpipStack vmk4 3 IPv6 fe80::250:56ff:fe67:794e 64 00:50:56:67:79:4e 9000 65535 true STATIC, PREFERRED defaultTcpipStack vmk5 11 IPv4 10.63.105.232 255.255.255.0 10.63.105.255 00:50:56:61:6a:14 9000 65535 true STATIC vmotion vmk5 11 ``` ### Restarted dcism - vmk1 appeared back. ``` [root@MH-ESX-02:~] esxcfg-vmknic -l Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack vmk2 11 IPv4 10.63.103.227 255.255.252.0 10.63.103.255 00:50:56:64:e9:76 1500 65535 true STATIC defaultTcpipStack vmk2 11 IPv6 fe80::250:56ff:fe64:e976 64 00:50:56:64:e9:76 1500 65535 true STATIC, PREFERRED defaultTcpipStack vmk0 4 IPv4 N/A N/A N/A 00:50:56:61:d5:f9 1500 65535 true NONE defaultTcpipStack vmk0 4 IPv6 fe80::250:56ff:fe61:d5f9 64 00:50:56:61:d5:f9 1500 65535 true STATIC, PREFERRED defaultTcpipStack vmk3 19 IPv4 10.63.110.232 255.255.255.0 10.63.110.255 00:50:56:61:15:26 1500 65535 true STATIC defaultTcpipStack vmk3 19 IPv6 fe80::250:56ff:fe61:1526 64 00:50:56:61:15:26 1500 65535 true STATIC, PREFERRED defaultTcpipStack vmk4 3 IPv4 10.63.104.232 255.255.255.0 10.63.104.255 00:50:56:67:79:4e 9000 65535 true STATIC defaultTcpipStack vmk4 3 IPv6 fe80::250:56ff:fe67:794e 64 00:50:56:67:79:4e 9000 65535 true STATIC, PREFERRED defaultTcpipStack vmk1 iDRAC Network IPv4 169.254.0.2 255.255.255.0 169.254.0.255 00:50:56:6f:45:47 1500 65535 true STATIC defaultTcpipStack vmk1 iDRAC Network IPv6 fe80::250:56ff:fe6f:4547 64 00:50:56:6f:45:47 1500 65535 true STATIC, PREFERRED defaultTcpipStack vmk5 11 IPv4 10.63.105.232 255.255.255.0 10.63.105.255 00:50:56:61:6a:14 9000 65535 true STATIC vmotion vmk5 11 IPv6 fe80::250:56ff:fe61:6a14 64 00:50:56:61:6a:14 9000 65535 true STATIC, PREFERRED vmotion ``` ### Platform Service started and working normally platform_svc.log ``` 2023-02-21T04:37:54Z platform_svc: [MainThread] INFO - ---------- Platform Service ----------------- 2023-02-21T04:37:54Z platform_svc: [MainThread] INFO - ism address is: [169.254.0.1] 2023-02-21T04:37:54Z platform_svc: [MainThread] INFO - Registering "tasks" 2023-02-21T04:37:55Z platform_svc: [MainThread] INFO - iDracOpt. timestamp: 1676954274 2023-02-21T04:37:55Z platform_svc: [MainThread] INFO - Registering "platform" 2023-02-21T04:37:55Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Waiting for iSM ready 2023-02-21T04:37:55Z platform_svc: [MainThread] INFO - Setting up listener 2023-02-21T04:37:55Z platform_svc: [esxcli-plugin-service-thread] INFO - Starting endpoints 2023-02-21T04:37:55Z platform_svc: [esxcli-plugin-service-thread] INFO - Starting network loop 2023-02-21T04:37:55Z platform_svc: [esxcli-plugin-service-thread] INFO - Start network loop 2023-02-21T04:37:55Z platform_svc: [esxcli-plugin-service-thread] INFO - Backend started 2023-02-21T04:37:55Z platform_svc: [ThreadPoolExecutor-0_0] INFO - USB nic is ready 2023-02-21T04:37:55Z platform_svc: [ThreadPoolExecutor-0_0] INFO - iSM is ready 2023-02-21T04:37:55Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Found service acccount in slot 15 2023-02-21T04:37:55Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Refresh Password 2023-02-21T04:37:56Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Enable the account 2023-02-21T04:37:56Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Setting up the public key for the service account 2023-02-21T04:38:01Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Service account setup successfully 2023-02-21T04:38:01Z platform_svc: [MainThread] INFO - the Idrac is ready. 2023-02-21T04:38:01Z platform_svc: [MainThread] INFO - Add sticky bit for PS config file 2023-02-21T04:38:01Z platform_svc: [MainThread] INFO - started the restful API service. 2023-02-21T04:38:04Z platform_svc: [ThreadPoolExecutor-0_0] INFO - The updateable firmware info collected 2023-02-21T04:38:05Z platform_svc: [ThreadPoolExecutor-0_0] INFO - PTAgent Account setup done 2023-02-21T04:38:05Z platform_svc: [ThreadPoolExecutor-0_0] INFO - ConfigSetting: agent_backend='auto' 2023-02-21T04:38:05Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Detected Agent Backend: 'none' 2023-02-21T04:38:05Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Bypass the firmware inventory initialization 2023-02-21T04:38:05Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Start PSAPI service thread 2023-02-21T04:38:05Z platform_svc: [ThreadPoolExecutor-0_0] INFO - resistered api list: /node/discoveryinfo$ /node/info$ /node/apis$ /node/fw$ /node/fwupgrade/staging$ /node/fwupgrade/apply$ /node/idrac/reset$ /node/inventory/hard_refresh$ /node/tasks$ /node/tasks/(?P<taskid>[\w-]+)$ /node/agent/status$ 2023-02-21T04:38:05Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Registering "dell.idrac_account" 2023-02-21T04:38:05Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Registering "techsuprep" 2023-02-21T04:38:05Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Registering "dell.esxi_tmp_account" 2023-02-21T04:38:08Z platform_svc: [ThreadPoolExecutor-0_0] INFO - Platform initialized ``` ### redis cache after fixed both 2 hosts ``` mh-vxm-01:/var/log # kubectl exec -it $(kubectl get pods -o name | grep cache | cut -d '/' -f 2) -- redis-cli -n 1 Defaulted container "cacheservice" out of: cacheservice, dz`ependency (init) 127.0.0.1:6379[1]> keys * 1) "pre_fetch_task" 2) "GSMWXT3:mh-esx-01.internal.rymanhealthcare.co.nz:hardware.model" 3) "day2:cluster:node:cache" 4) "HSMWXT3:is_linzhi" 5) "GSMWXT3:is_linzhi" 6) "day2:cluster:appliance:cache" 7) "HSMWXT3:mh-esx-02.internal.rymanhealthcare.co.nz:hardware.model" 8) "day2:index:moid" 127.0.0.1:6379[1]> mh-vxm-01:/var/log # # NOW WE HAVE BOTH 02 HOSTS in the REDIS CACHE ``` ### ESE Missing the Access Key tab - Something with hosts' universal keys. ``` mh-vxm-01:/var/log # tail -f /var/log/microservice_log/short.term.log | grep -E 'dell-ese|rcs-service' "2023-02-21 04:42:44,871" microservice.dell-ese "2023-02-21T04:42:44.192483171Z stderr F 8 2023-02-21 04:42:44,192 CP Server Thread-9 INFO cherrypy.access.140315519420784 LN: 283 ::ffff:172.28.175.102 - - [21/Feb/2023:04:42:44] ""GET /ese/status HTTP/1.1"" 200 23 """" ""kube-probe/1.23""" "2023-02-21 04:42:44,871" microservice.dell-ese "2023-02-21T04:42:44.192484743Z stdout F ::ffff:172.28.175.102 - - [21/Feb/2023:04:42:44] ""GET /ese/status HTTP/1.1"" 200 23 """" ""kube-probe/1.23""" "2023-02-21 04:43:00,860" microservice.rcs-service "2023-02-21T04:43:00.185602377Z stderr F 2023-02-21 04:43:00,182 [INFO] <ThreadPoolExecutor-0_5:139719195203328> commonservice.py value_from_configservice() (30): get key ese_state from configservice value = unconfigured" "2023-02-21 04:43:04,859" microservice.rcs-service "2023-02-21T04:43:04.163401412Z stderr F 2023-02-21 04:43:04,161 [INFO] <ThreadPoolExecutor-0_6:139718708688640> universal_key_service.py get_universal_key() (162): the request :HSMWXT3 CLUSTER" ``` Follow KB: - https://www.dell.com/support/kbdoc/en-us/000201895 - Delete hosts' universal keys.