OCP Nodes - HackMD

Debug ===== ```bash= aicli -U 0.0.0.0:8090 list hosts aicli -U 0.0.0.0:8090 list events clustername aicli -U 0.0.0.0:8090 info cluster clustername ``` Debug BF2 ========= - https://github.com/bn222/dpu-tools?tab=readme-ov-file#bluefield-tools ```bash= sudo podman run --pull always --replace --pid host --network \ host --user 0 --name bf -dit --privileged -v /dev:/dev quay.io/bnemeth/bf sudo podman exec -it bf /bin/bash # python listbf # python get_mode ``` - `mstconfig -e -d <pci address> q` shows current state / next boot state. Verify current state has been properly set to nic mode: https://github.com/openshift/sriov-network-operator/blob/master/bindata/scripts/bf2-switch-mode.sh#L71 - `minicom --color on --baudrate 115200 --device /dev/rshim0/console` Links ===== - switch config tool: https://gitlab.cee.redhat.com/nst/hw-enablement/nhe/-/blob/master/lab/switch/apply_vlans.py?ref_type=heads TASKS ===== - [SPRINT](https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=15601&projectKey=NHE&quickFilter=124602#) - [tft] [running the flow test on the Intel IPU](https://app.slack.com/client/E030G10V24F/C06QXSN6VA5) - [details](https://app.slack.com/client/E030G10V24F/C06QXSN6VA5) - [tft] [Traffic flow backlog](https://app.slack.com/client/E030G10V24F/C06QXSN6VA5) - [JIRA epic](https://issues.redhat.com/browse/NHE-357) - fix up the evaluator for Netperf test and HTTP test - [watch demo recordings](https://redhat-internal.slack.com/archives/C06QXSN6VA5/p1718725370474769) - https://drive.google.com/drive/folders/1Kyz2-y7fczXEoMRk0GvOdd5YUApB--_1 - https://www.youtube.com/watch?v=znWbFZ0ax3U - [Add learnings to our confluence page so that the next person has an easier time ramping up](https://app.slack.com/client/E030G10V24F/D06MASDKCBY) - https://spaces.redhat.com/pages/viewpage.action?spaceKey=NHE&title=Network+Hardware+Enablement - https://spaces.redhat.com/display/NHE/NHE+Onboarding+Guide - [jira bot](https://app.slack.com/client/E030G10V24F/D06MASDKCBY) - We have a jira bot (written by vrinda) that automatically replaces "blocker requested" by "blocker rejected". Recently, I've had a jira that had the request set, but the bot didn't unset it so there is something wrong with that. Look into that and make sure it works correctly - [SAST scan issues](https://app.slack.com/client/E030G10V24F/D06MASDKCBY) - https://issues.redhat.com/browse/OCPBUGS-28529 - https://issues.redhat.com/browse/OCPBUGS-28523 - https://issues.redhat.com/browse/OCPBUGS-28523 - [increase the density of our clusters](https://app.slack.com/client/E030G10V24F/D06MASDKCBY) - https://issues.redhat.com/browse/NHE-772 - BF-3 Verification: - [epic](https://issues.redhat.com/browse/NHE-1133) - [BF3 Report](https://docs.google.com/document/d/1Pa1KnhVxcsQ0KgCxQdCXV1DUChdiIKaFaTNZ1Wt4eeU/edit) - [Notes from William](https://docs.google.com/document/d/16KIgRtrslAmj-Jjv_3Z5NtTrvgDaGvliJS1IPzPK9hQ/edit) - [dpu-tools](https://github.com/bn222/dpu-tools/blob/main/rhel_on_bf2.md) - William uses machine 222, Thomas uses 223 [OCPBUGS-30549](https://issues.redhat.com/browse/OCPBUGS-30549) ======== # Notes - [sriov-cni feature](https://github.com/openshift/sriov-cni/commit/c241dcb4367c14578b57dd139079bde1f6e997c2) - file `/var/lib/cni/bin/sriov` - [cluster list](https://docs.google.com/spreadsheets/d/1lXvcodJ8dmc_hcp0hzbPDU8t6-hCnAlEWFRdM2r_n0Q/edit#gid=0) - [doc installing SRIOV operator](https://docs.openshift.com/container-platform/4.8/networking/hardware_networks/installing-sriov-operator.html) # Reproduce ## `cluster.yaml` ```yaml= cat <<EOF > cluster.yaml clusters: - name : "x1cluster" api_vip: "192.168.122.99" ingress_vip: "192.168.122.101" kubeconfig: "/root/kubeconfig.x1cluster" version: "4.14.0-nightly" network_api_port: "eno12399" preconfig: - name: "bf_bfb_image" postconfig: - name: "sriov_network_operator" sriov_network_operator_local: True - name: "switch_to_nic_mode" masters: - name: "x1cluster-master-2" kind: "vm" node: "localhost" ip: "192.168.122.2" - name: "x1cluster-master-3" kind: "vm" node: "localhost" ip: "192.168.122.3" - name: "x1cluster-master-4" kind: "vm" node: "localhost" ip: "192.168.122.4" workers: - name: "worker-246" kind: "physical" node: "wsfd-advnetlab246.anl.eng.bos2.dc.redhat.com" bmc_user: "root" bmc_password: "calvin" bmc: "10.26.16.89" - name: "worker-247" kind: "physical" node: "wsfd-advnetlab247.anl.eng.bos2.dc.redhat.com" bmc_user: "root" bmc_password: "calvin" bmc: "10.26.16.91" EOF ``` ## Label nodes ```bash oc label node worker-246 --overwrite=true feature.node.kubernetes.io/sriov-capable=true oc label node worker-247 --overwrite=true feature.node.kubernetes.io/sriov-capable=true ``` ## pci-alloc ```yaml= cat <<EOF | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-pci-realloc-workers spec: config: ignition: version: 3.2.0 kernelArguments: - pci=realloc EOF ``` ## SriovNetworkNodePolicy ```yaml= cat <<EOF | oc apply -f - apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: e810c1 namespace: openshift-sriov-network-operator spec: deviceType: netdevice isRdma: false nicSelector: pfNames: - ens7f0np0#0-3 nodeSelector: feature.node.kubernetes.io/sriov-capable: "true" numVfs: 4 priority: 99 resourceName: e810c EOF ``` ## Namespace z1 ```bash oc create namespace z1 ``` ## SriovNetwork ```yaml= cat <<EOF | oc apply -f - apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: sriovnetwork namespace: openshift-sriov-network-operator spec: capabilities: | { "mac": true, "ips": true } ipam: | { "type": "whereabouts", "range":"10.31.0.0/30" } logLevel: debug networkNamespace: z1 resourceName: e810c spoofChk: "off" trust: "on" EOF ``` ## Check NetworkAttachmentDefinition ```bash oc describe net-attach-def -n z1 ``` ## Test Pods ```yaml= cat <<EOF > pod.yaml apiVersion: v1 kind: Pod metadata: generateName: testpod1 namespace: z1 labels: env: test annotations: k8s.v1.cni.cncf.io/networks: sriovnetwork spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: test-pod image: quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4 securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] automountServiceAccountToken: false nodeSelector: "kubernetes.io/hostname": worker-246 EOF ``` ```bash= for f in sriov.rebuild.no-announce sriov.rebuild.with-announce sriov.rebuild.with-modified-announce ; do curl "https://tynq.net/$f" > ~/"$f" ; chmod +x ~/"$f" done if [ ! -f ~/sriov.orig ] ; then cp /var/lib/cni/bin/sriov ~/sriov.orig fi x_copy_cni() { if [ -n "$1" ] ; then cp "$1" /var/lib/cni/bin/sriov fi sha256sum ~/sriov* /var/lib/cni/bin/sriov } x_copy_cni ~/sriov.rebuild.with-announce x_restart() { oc delete pod/w247 oc delete pod/w246-1 oc delete pod/w246-2 cat pod.yaml | \ sed -e 's/worker-246/worker-246/' \ -e 's/generateName: .*/name: w247/' | \ oc apply -f - sleep 3 cat pod.yaml | \ sed -e 's/generateName: .*/name: w246-1/' | \ oc apply -f - sleep 3 cat pod.yaml | \ sed -e 's/generateName: .*/name: w246-2/' | \ oc apply -f - oc get all -n z1 -o wide } x_restart # Tcpdump: x_tcpdump_start() { PID=$(oc debug -q node/worker-246 -- ps aux | grep /hello | awk '{print $2}') echo ">>>> pid on worker 247: $PID" oc debug -q node/worker-246 -- sh -c 'ip netns attach x '"$PID"' ; ip -n x l ; ip -n x a; rm -rf /host/tmp/dump' oc debug -q node/worker-246 -- sh -c 'ip netns attach x '"$PID"' ; exec ip netns exec x tcpdump -n -i net1 -w /host/tmp/dump' } x_tcpdump_start # CTRL-C x_tcpdump_show() { oc debug -q node/worker-247 -- sh -c 'cat /host/tmp/dump | xz | base64' } x_tcpdump_show cmd() { echo "CMD[${cmdnr:+"$i:"}$(date '+%s.%N')]>> $*" "$@" } x_run() { echo -e '\033[0;31m START #####\033[0m' cmd oc rsh pod/w247 ip a cmd oc rsh pod/w246-1 ip a cmd oc rsh pod/w247 arp -n cmd oc rsh pod/w246-1 ping -c 1 10.31.0.1 cmd oc rsh pod/w247 arp -n cmd oc delete pod/w246-1 & for i in `seq 1 100` ; do cmdnr="$i" cmd oc rsh pod/w247 arp -n; done cmd oc rsh pod/w246-2 ip a } x_run ``` Deployment History ================== - 20240517-090000 advnetlab248, 4.15-nightly, FAILED: had wrong cluster name in yaml, so I ended up having duplicate master nodes - 20240517-100000 advnetlab248, 4.15-nightly, FAILED: an error about out of disk space, altough, df did not show out of disk. Unclear why. Proceeded to redeploy machine - 20240517-110000 advnetlab248, 4.15-nightly, FAILED: fresh machine. An error early on updating virsh net during treardown (about IP address range). I suspected this may have been due to a patch I used (though it shouldn't), so I wanted to test with `main`. - 20240517-111000 advnetlab248, 4.15-nightly, FAILED: this time, previous error is not hit. CDA exists because network_api_port was wrong (it hit the error after 15 minutes, but the threads kept running for another 30 minutes) - 20240517-120000 advnetlab248, 4.15-nightly, FAILED: still wrong network_api_port. I didn't spot the error previously, because it was somewhere at the beginnig of the log. - 20240517-140000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request" - 20240517-190000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request" - 20240517-210000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request" - 20240519-100000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request" - 20240520-100000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request" - 20240521-100000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request" (this was after rebooting idrac and testing it manually) - 20240521-120000 advnetlab248, 4.15-nightly, FAILED: "RuntimeError: all attempts to login to wsfd-advnetlab247.anl.eng.bos2.dc.redhat.com failed with authentication error"