# FBC OPM cache digest problem
I did some testing with one of our Operator Catalogs that failed to start with the `cache requires rebuild` error.
To test, I created a pod as follows, which is based on the pod that OLM would normally create, but sanitized a bit:
```yaml
kind: Pod
apiVersion: v1
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
name: ibm-operator-catalog-chris3
namespace: openshift-marketplace
spec:
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
serviceAccountName: redhat-operators
priority: 0
schedulerName: default-scheduler
enableServiceLinks: true
terminationGracePeriodSeconds: 30
preemptionPolicy: PreemptLowerPriority
securityContext:
seLinuxOptions:
level: 's0:c15,c10'
containers:
- resources:
requests:
cpu: 10m
memory: 50Mi
terminationMessagePath: /dev/termination-log
name: registry-server
securityContext:
capabilities:
drop:
- MKNOD
readOnlyRootFilesystem: false
ports:
- name: grpc
containerPort: 50051
protocol: TCP
imagePullPolicy: Always
entrypoint: '/bin/sh'
command: [ "/bin/sh", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
# entrypoint: '/bin/opm'
# command: [ "/bin/opm"]
# args: [ "serve", "/configs"]
terminationMessagePolicy: FallbackToLogsOnError
image: 'icr.io/cpopen/ibm-operator-catalog@sha256:a3f35ce30b9470ac44eba2898192de0f819fbf9c5993e2c389b547f734c5bc98'
serviceAccount: ibm-operator-catalog
dnsPolicy: ClusterFirst
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/memory-pressure
operator: Exists
effect: NoSchedule
```
Then I exec into the pod and run some commands (in the Console just open a terminal):
**4.13:**
This fails, showing that the Cache Digest is different:
```
sh-4.4$ /bin/opm serve /configs --cache-dir=/cache
INFO[0000] starting pprof endpoint address="localhost:6060"
FATA[0001] cache requires rebuild: cache reports digest as "f7002ba0f3181b19", but computed digest is "e3b1caa360a78035"
sh-4.4$ /bin/opm serve /configs --cache-dir=/tmp/cache
INFO[0000] starting pprof endpoint address="localhost:6060"
FATA[0000] cache requires rebuild: cache reports digest as "19c47e8aa8053480", but computed digest is "887db2be85abc546"
sh-4.4$ /bin/opm serve /configs --cache-dir=/tmp/cache
INFO[0000] starting pprof endpoint address="localhost:6060"
FATA[0000] cache requires rebuild: cache reports digest as "19c47e8aa8053480", but computed digest is "887db2be85abc546"
# Try copying the /cache directory to see if there is something odd... Nope. Same result as the first.
sh-4.4$ mkdir /tmp/test
sh-4.4$ cp -r /cache /tmp/test
sh-4.4$ /bin/opm serve /configs --cache-dir /tmp/x/cache
INFO[0000] starting pprof endpoint address="localhost:6060"
FATA[0001] cache requires rebuild: cache reports digest as "f7002ba0f3181b19", but computed digest is "e3b1caa360a78035"
```
**4.14:**
This works, so the cache matches:
```
sh-4.4$ /bin/opm serve /configs --cache-dir=/cache
INFO[0000] starting pprof endpoint address="localhost:6060"
INFO[0002] serving registry configs=/configs port=50051
INFO[0002] stopped caching cpu profile data address="localhost:6060"
```
This fails:
```
sh-4.4$ /bin/opm serve /configs --cache-dir=/tmp/cache
INFO[0000] starting pprof endpoint address="localhost:6060"
FATA[0000] cache requires rebuild: cache reports digest as "19c47e8aa8053480", but computed digest is "ca5ae87034dd5252"
```
This tells me that the correct cache value is `19c47e8aa8053480` and it is being generated incorrectly when the exact same pod and container image is running on 4.13.32 vs. 4.14.
Examining the [opm code](https://github.com/operator-framework/operator-registry/blob/4c69b4af4437039212dd83fbe049fa30efb0b2cc/pkg/cache/json.go#L181-L219 "https://github.com/operator-framework/operator-registry/blob/4c69b4af4437039212dd83fbe049fa30efb0b2cc/pkg/cache/json.go#L181-L219"), they calculate the hash:
1. consistent-tar the /configs directory
2. get a FNVa hash
3. consistent-tar the /cache/cache directory
4. add it to the hash
Try regenerating the cache in the process:
`/bin/opm serve /configs --cache-dir /tmp/gen --cache-only`
It matches the computed digest now:
```
cat gen/digest
e3b1caa360a78035
```
And serving it up works too:
```
/bin/opm serve /configs --cache-dir /tmp/gen
INFO[0000] starting pprof endpoint address="localhost:6060"
INFO[0001] serving registry configs=/configs port=50051
INFO[0001] stopped caching cpu profile data address="localhost:6060"
```
So, what's different about the two directories? This code demonstrates that the number of characters in all the files are the same.
```
cd /tmp/gen/cache
ls > /tmp/genfiles.txt
tr -s '\n' '\000' < /tmp/genfiles.txt | wc -c --files0-from=- | awk '/total/ {print $1}'
580962291
cd /cache/cache
ls > /tmp/cachefiles.txt
tr -s '\n' '\000' < /tmp/cachefiles.txt | wc -c --files0-from=- | awk '/total/ {print $1}'
580962291
```
I also examined the /configs directories from both the 4.13 and 4.14 running containers and got the same result:
```
ls -ltR | grep "^\-" | wc -l
4858
```
**Summary**:
The cache tarball or digest from the tarball that is generated by opm in the build is slightly different than what's generated in the image itself, but only when run on certain versions of OCP. To fully debug the problem, I would need to build a variant of the code the opm uses to see what's going on under the covers. It feels like the tar command is stumbling over the filesystem overlays.