FBC OPM cache digest problem

# FBC OPM cache digest problem I did some testing with one of our Operator Catalogs that failed to start with the `cache requires rebuild` error. To test, I created a pod as follows, which is based on the pod that OLM would normally create, but sanitized a bit: ```yaml kind: Pod apiVersion: v1 metadata: annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: 'true' name: ibm-operator-catalog-chris3 namespace: openshift-marketplace spec: nodeSelector: kubernetes.io/os: linux restartPolicy: Always serviceAccountName: redhat-operators priority: 0 schedulerName: default-scheduler enableServiceLinks: true terminationGracePeriodSeconds: 30 preemptionPolicy: PreemptLowerPriority securityContext: seLinuxOptions: level: 's0:c15,c10' containers: - resources: requests: cpu: 10m memory: 50Mi terminationMessagePath: /dev/termination-log name: registry-server securityContext: capabilities: drop: - MKNOD readOnlyRootFilesystem: false ports: - name: grpc containerPort: 50051 protocol: TCP imagePullPolicy: Always entrypoint: '/bin/sh' command: [ "/bin/sh", "-c", "--" ] args: [ "while true; do sleep 30; done;" ] # entrypoint: '/bin/opm' # command: [ "/bin/opm"] # args: [ "serve", "/configs"] terminationMessagePolicy: FallbackToLogsOnError image: 'icr.io/cpopen/ibm-operator-catalog@sha256:a3f35ce30b9470ac44eba2898192de0f819fbf9c5993e2c389b547f734c5bc98' serviceAccount: ibm-operator-catalog dnsPolicy: ClusterFirst tolerations: - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute tolerationSeconds: 300 - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 300 - key: node.kubernetes.io/memory-pressure operator: Exists effect: NoSchedule ``` Then I exec into the pod and run some commands (in the Console just open a terminal): **4.13:** This fails, showing that the Cache Digest is different: ``` sh-4.4$ /bin/opm serve /configs --cache-dir=/cache INFO[0000] starting pprof endpoint address="localhost:6060" FATA[0001] cache requires rebuild: cache reports digest as "f7002ba0f3181b19", but computed digest is "e3b1caa360a78035" sh-4.4$ /bin/opm serve /configs --cache-dir=/tmp/cache INFO[0000] starting pprof endpoint address="localhost:6060" FATA[0000] cache requires rebuild: cache reports digest as "19c47e8aa8053480", but computed digest is "887db2be85abc546" sh-4.4$ /bin/opm serve /configs --cache-dir=/tmp/cache INFO[0000] starting pprof endpoint address="localhost:6060" FATA[0000] cache requires rebuild: cache reports digest as "19c47e8aa8053480", but computed digest is "887db2be85abc546" # Try copying the /cache directory to see if there is something odd... Nope. Same result as the first. sh-4.4$ mkdir /tmp/test sh-4.4$ cp -r /cache /tmp/test sh-4.4$ /bin/opm serve /configs --cache-dir /tmp/x/cache INFO[0000] starting pprof endpoint address="localhost:6060" FATA[0001] cache requires rebuild: cache reports digest as "f7002ba0f3181b19", but computed digest is "e3b1caa360a78035" ``` **4.14:** This works, so the cache matches: ``` sh-4.4$ /bin/opm serve /configs --cache-dir=/cache INFO[0000] starting pprof endpoint address="localhost:6060" INFO[0002] serving registry configs=/configs port=50051 INFO[0002] stopped caching cpu profile data address="localhost:6060" ``` This fails: ``` sh-4.4$ /bin/opm serve /configs --cache-dir=/tmp/cache INFO[0000] starting pprof endpoint address="localhost:6060" FATA[0000] cache requires rebuild: cache reports digest as "19c47e8aa8053480", but computed digest is "ca5ae87034dd5252" ``` This tells me that the correct cache value is `19c47e8aa8053480` and it is being generated incorrectly when the exact same pod and container image is running on 4.13.32 vs. 4.14. Examining the [opm code](https://github.com/operator-framework/operator-registry/blob/4c69b4af4437039212dd83fbe049fa30efb0b2cc/pkg/cache/json.go#L181-L219 "https://github.com/operator-framework/operator-registry/blob/4c69b4af4437039212dd83fbe049fa30efb0b2cc/pkg/cache/json.go#L181-L219"), they calculate the hash: 1. consistent-tar the /configs directory 2. get a FNVa hash 3. consistent-tar the /cache/cache directory 4. add it to the hash Try regenerating the cache in the process: `/bin/opm serve /configs --cache-dir /tmp/gen --cache-only` It matches the computed digest now: ``` cat gen/digest e3b1caa360a78035 ``` And serving it up works too: ``` /bin/opm serve /configs --cache-dir /tmp/gen INFO[0000] starting pprof endpoint address="localhost:6060" INFO[0001] serving registry configs=/configs port=50051 INFO[0001] stopped caching cpu profile data address="localhost:6060" ``` So, what's different about the two directories? This code demonstrates that the number of characters in all the files are the same. ``` cd /tmp/gen/cache ls > /tmp/genfiles.txt tr -s '\n' '\000' < /tmp/genfiles.txt | wc -c --files0-from=- | awk '/total/ {print $1}' 580962291 cd /cache/cache ls > /tmp/cachefiles.txt tr -s '\n' '\000' < /tmp/cachefiles.txt | wc -c --files0-from=- | awk '/total/ {print $1}' 580962291 ``` I also examined the /configs directories from both the 4.13 and 4.14 running containers and got the same result: ``` ls -ltR | grep "^\-" | wc -l 4858 ``` **Summary**: The cache tarball or digest from the tarball that is generated by opm in the build is slightly different than what's generated in the image itself, but only when run on certain versions of OCP. To fully debug the problem, I would need to build a variant of the code the opm uses to see what's going on under the covers. It feels like the tar command is stumbling over the filesystem overlays.