Debug
=====
```bash=
aicli -U 0.0.0.0:8090 list hosts
aicli -U 0.0.0.0:8090 list events clustername
aicli -U 0.0.0.0:8090 info cluster clustername
```
Debug BF2
=========
- https://github.com/bn222/dpu-tools?tab=readme-ov-file#bluefield-tools
```bash=
sudo podman run --pull always --replace --pid host --network \
host --user 0 --name bf -dit --privileged -v /dev:/dev quay.io/bnemeth/bf
sudo podman exec -it bf /bin/bash
# python listbf
# python get_mode
```
- `mstconfig -e -d <pci address> q` shows current state / next boot state. Verify current state has been properly set to nic mode: https://github.com/openshift/sriov-network-operator/blob/master/bindata/scripts/bf2-switch-mode.sh#L71
- `minicom --color on --baudrate 115200 --device /dev/rshim0/console`
Links
=====
- switch config tool: https://gitlab.cee.redhat.com/nst/hw-enablement/nhe/-/blob/master/lab/switch/apply_vlans.py?ref_type=heads
TASKS
=====
- [SPRINT](https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=15601&projectKey=NHE&quickFilter=124602#)
- [tft] [running the flow test on the Intel IPU](https://app.slack.com/client/E030G10V24F/C06QXSN6VA5)
- [details](https://app.slack.com/client/E030G10V24F/C06QXSN6VA5)
- [tft] [Traffic flow backlog](https://app.slack.com/client/E030G10V24F/C06QXSN6VA5)
- [JIRA epic](https://issues.redhat.com/browse/NHE-357)
- fix up the evaluator for Netperf test and HTTP test
- [watch demo recordings](https://redhat-internal.slack.com/archives/C06QXSN6VA5/p1718725370474769)
- https://drive.google.com/drive/folders/1Kyz2-y7fczXEoMRk0GvOdd5YUApB--_1
- https://www.youtube.com/watch?v=znWbFZ0ax3U
- [Add learnings to our confluence page so that the next person has an easier time ramping up](https://app.slack.com/client/E030G10V24F/D06MASDKCBY)
- https://spaces.redhat.com/pages/viewpage.action?spaceKey=NHE&title=Network+Hardware+Enablement
- https://spaces.redhat.com/display/NHE/NHE+Onboarding+Guide
- [jira bot](https://app.slack.com/client/E030G10V24F/D06MASDKCBY)
- We have a jira bot (written by vrinda) that automatically replaces "blocker requested" by "blocker rejected". Recently, I've had a jira that had the request set, but the bot didn't unset it so there is something wrong with that. Look into that and make sure it works correctly
- [SAST scan issues](https://app.slack.com/client/E030G10V24F/D06MASDKCBY)
- https://issues.redhat.com/browse/OCPBUGS-28529
- https://issues.redhat.com/browse/OCPBUGS-28523
- https://issues.redhat.com/browse/OCPBUGS-28523
- [increase the density of our clusters](https://app.slack.com/client/E030G10V24F/D06MASDKCBY)
- https://issues.redhat.com/browse/NHE-772
- BF-3 Verification:
- [epic](https://issues.redhat.com/browse/NHE-1133)
- [BF3 Report](https://docs.google.com/document/d/1Pa1KnhVxcsQ0KgCxQdCXV1DUChdiIKaFaTNZ1Wt4eeU/edit)
- [Notes from William](https://docs.google.com/document/d/16KIgRtrslAmj-Jjv_3Z5NtTrvgDaGvliJS1IPzPK9hQ/edit)
- [dpu-tools](https://github.com/bn222/dpu-tools/blob/main/rhel_on_bf2.md)
- William uses machine 222, Thomas uses 223
[OCPBUGS-30549](https://issues.redhat.com/browse/OCPBUGS-30549)
========
# Notes
- [sriov-cni feature](https://github.com/openshift/sriov-cni/commit/c241dcb4367c14578b57dd139079bde1f6e997c2)
- file `/var/lib/cni/bin/sriov`
- [cluster list](https://docs.google.com/spreadsheets/d/1lXvcodJ8dmc_hcp0hzbPDU8t6-hCnAlEWFRdM2r_n0Q/edit#gid=0)
- [doc installing SRIOV operator](https://docs.openshift.com/container-platform/4.8/networking/hardware_networks/installing-sriov-operator.html)
# Reproduce
## `cluster.yaml`
```yaml=
cat <<EOF > cluster.yaml
clusters:
- name : "x1cluster"
api_vip: "192.168.122.99"
ingress_vip: "192.168.122.101"
kubeconfig: "/root/kubeconfig.x1cluster"
version: "4.14.0-nightly"
network_api_port: "eno12399"
preconfig:
- name: "bf_bfb_image"
postconfig:
- name: "sriov_network_operator"
sriov_network_operator_local: True
- name: "switch_to_nic_mode"
masters:
- name: "x1cluster-master-2"
kind: "vm"
node: "localhost"
ip: "192.168.122.2"
- name: "x1cluster-master-3"
kind: "vm"
node: "localhost"
ip: "192.168.122.3"
- name: "x1cluster-master-4"
kind: "vm"
node: "localhost"
ip: "192.168.122.4"
workers:
- name: "worker-246"
kind: "physical"
node: "wsfd-advnetlab246.anl.eng.bos2.dc.redhat.com"
bmc_user: "root"
bmc_password: "calvin"
bmc: "10.26.16.89"
- name: "worker-247"
kind: "physical"
node: "wsfd-advnetlab247.anl.eng.bos2.dc.redhat.com"
bmc_user: "root"
bmc_password: "calvin"
bmc: "10.26.16.91"
EOF
```
## Label nodes
```bash
oc label node worker-246 --overwrite=true feature.node.kubernetes.io/sriov-capable=true
oc label node worker-247 --overwrite=true feature.node.kubernetes.io/sriov-capable=true
```
## pci-alloc
```yaml=
cat <<EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-pci-realloc-workers
spec:
config:
ignition:
version: 3.2.0
kernelArguments:
- pci=realloc
EOF
```
## SriovNetworkNodePolicy
```yaml=
cat <<EOF | oc apply -f -
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: e810c1
namespace: openshift-sriov-network-operator
spec:
deviceType: netdevice
isRdma: false
nicSelector:
pfNames:
- ens7f0np0#0-3
nodeSelector:
feature.node.kubernetes.io/sriov-capable: "true"
numVfs: 4
priority: 99
resourceName: e810c
EOF
```
## Namespace z1
```bash
oc create namespace z1
```
## SriovNetwork
```yaml=
cat <<EOF | oc apply -f -
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: sriovnetwork
namespace: openshift-sriov-network-operator
spec:
capabilities: |
{
"mac": true,
"ips": true
}
ipam: |
{
"type": "whereabouts",
"range":"10.31.0.0/30"
}
logLevel: debug
networkNamespace: z1
resourceName: e810c
spoofChk: "off"
trust: "on"
EOF
```
## Check NetworkAttachmentDefinition
```bash
oc describe net-attach-def -n z1
```
## Test Pods
```yaml=
cat <<EOF > pod.yaml
apiVersion: v1
kind: Pod
metadata:
generateName: testpod1
namespace: z1
labels:
env: test
annotations:
k8s.v1.cni.cncf.io/networks: sriovnetwork
spec:
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: test-pod
image: quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: [ALL]
automountServiceAccountToken: false
nodeSelector:
"kubernetes.io/hostname": worker-246
EOF
```
```bash=
for f in sriov.rebuild.no-announce sriov.rebuild.with-announce sriov.rebuild.with-modified-announce ; do
curl "https://tynq.net/$f" > ~/"$f" ;
chmod +x ~/"$f"
done
if [ ! -f ~/sriov.orig ] ; then
cp /var/lib/cni/bin/sriov ~/sriov.orig
fi
x_copy_cni() {
if [ -n "$1" ] ; then
cp "$1" /var/lib/cni/bin/sriov
fi
sha256sum ~/sriov* /var/lib/cni/bin/sriov
}
x_copy_cni ~/sriov.rebuild.with-announce
x_restart() {
oc delete pod/w247
oc delete pod/w246-1
oc delete pod/w246-2
cat pod.yaml | \
sed -e 's/worker-246/worker-246/' \
-e 's/generateName: .*/name: w247/' | \
oc apply -f -
sleep 3
cat pod.yaml | \
sed -e 's/generateName: .*/name: w246-1/' | \
oc apply -f -
sleep 3
cat pod.yaml | \
sed -e 's/generateName: .*/name: w246-2/' | \
oc apply -f -
oc get all -n z1 -o wide
}
x_restart
# Tcpdump:
x_tcpdump_start() {
PID=$(oc debug -q node/worker-246 -- ps aux | grep /hello | awk '{print $2}')
echo ">>>> pid on worker 247: $PID"
oc debug -q node/worker-246 -- sh -c 'ip netns attach x '"$PID"' ; ip -n x l ; ip -n x a; rm -rf /host/tmp/dump'
oc debug -q node/worker-246 -- sh -c 'ip netns attach x '"$PID"' ; exec ip netns exec x tcpdump -n -i net1 -w /host/tmp/dump'
}
x_tcpdump_start
# CTRL-C
x_tcpdump_show() {
oc debug -q node/worker-247 -- sh -c 'cat /host/tmp/dump | xz | base64'
}
x_tcpdump_show
cmd() {
echo "CMD[${cmdnr:+"$i:"}$(date '+%s.%N')]>> $*"
"$@"
}
x_run() {
echo -e '\033[0;31m START #####\033[0m'
cmd oc rsh pod/w247 ip a
cmd oc rsh pod/w246-1 ip a
cmd oc rsh pod/w247 arp -n
cmd oc rsh pod/w246-1 ping -c 1 10.31.0.1
cmd oc rsh pod/w247 arp -n
cmd oc delete pod/w246-1 &
for i in `seq 1 100` ; do cmdnr="$i" cmd oc rsh pod/w247 arp -n; done
cmd oc rsh pod/w246-2 ip a
}
x_run
```
Deployment History
==================
- 20240517-090000 advnetlab248, 4.15-nightly, FAILED: had wrong cluster name in yaml, so I ended up having duplicate master nodes
- 20240517-100000 advnetlab248, 4.15-nightly, FAILED: an error about out of disk space, altough, df did not show out of disk. Unclear why. Proceeded to redeploy machine
- 20240517-110000 advnetlab248, 4.15-nightly, FAILED: fresh machine. An error early on updating virsh net during treardown (about IP address range). I suspected this may have been due to a patch I used (though it shouldn't), so I wanted to test with `main`.
- 20240517-111000 advnetlab248, 4.15-nightly, FAILED: this time, previous error is not hit. CDA exists because network_api_port was wrong (it hit the error after 15 minutes, but the threads kept running for another 30 minutes)
- 20240517-120000 advnetlab248, 4.15-nightly, FAILED: still wrong network_api_port. I didn't spot the error previously, because it was somewhere at the beginnig of the log.
- 20240517-140000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request"
- 20240517-190000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request"
- 20240517-210000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request"
- 20240519-100000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request"
- 20240520-100000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request"
- 20240521-100000 advnetlab248, 4.15-nightly, FAILED: "Got exception from future HTTP Error 400: Bad Request" (this was after rebooting idrac and testing it manually)
- 20240521-120000 advnetlab248, 4.15-nightly, FAILED: "RuntimeError: all attempts to login to wsfd-advnetlab247.anl.eng.bos2.dc.redhat.com failed with authentication error"