Measuring the impact of a shared model cache on Quay pods

# Measuring the impact of a shared model cache on Quay pods ## Hypothesis Suspected cause of most recent outage was that during rollout (config/image change), new Quay pods start up with empty cache and have to hit database for every request. **Null Hypothesis**: There will be no difference in startup cache hit rate between pods using shared model cache and pods using local model cache. **Alternative Hypothesis**: Pods using shared model cache will have higher cache hit rate on startup than pods using local model cache. ## Experiment Setup 1. Start with Quay `Deployment` with `replicas: 1` 2. Run push/pull scripts for 10 min to generate load and populate cache 3. Record cache metrics 4. Set `replicas: 5` to add new pods 5. Run push/pull scripts for 10 min to generate load 6. Record cache metrics ## Results ### Local Cache | Pod | Initial cache hit rate | Stabilized cache hit rate (10m) | Notes |---|---|---|---| | `memcached-quay-app-787fb4d9f7-b6skr` | 56.4% | 71.6% | Starting pod when `replicas: 1`. | | `memcached-quay-app-787fb4d9f7-dts8h` | 52.8% | 67.6% | New pod when `replicas: 5`. | | `memcached-quay-app-787fb4d9f7-tzlpr` | 37.5% | 54.0% | New pod when `replicas: 5`. | | `memcached-quay-app-787fb4d9f7-xndrf` | 25.0% | 25.0% | New pod when `replicas: 5`. Logs 3 misses, 1 hit during entire pod lifetime. | | `memcached-quay-app-787fb4d9f7-xwhz9` | N/a | N/a | New pod when `replicas: 5`. Zero data points, pod never logs traffic. | ### Shared Model Cache | Pod | Initial cache hit rate | Stabilized cache hit rate (10m) | Notes |---|---|---|---| | `redis-quay-app-85d6875f96-fcb68` | 65.7% | 66.4% | Starting pod when `replicas: 1`. | | `redis-quay-app-85d6875f96-dg6f8` | 90.9% | 74.4% | New pod when `replicas: 5`. | | `redis-quay-app-85d6875f96-j8z2b` | 100.0% | 97.7% | New pod when `replicas: 5`. | | `redis-quay-app-85d6875f96-lfmmr` | 66.6% | 73.2% | New pod when `replicas: 5`. Spikes to a peak of 84.4%, then drops. | | `redis-quay-app-85d6875f96-2hb2x` | N/a | N/a | New pod when `replicas: 5`. Zero data points, pod never logs traffic. | ### Notes While Conducting Experiment #### Prometheus Queries **Average cache hit rate for all pods (Memcached):** ``` sum(quay_model_cache_total{host=~"memcached-quay-app-.*", type="hit"})/sum(quay_model_cache_total{host=~"memcached-quay-app-.*"}) ``` **Average cache hit rate for all pods (Redis):** ``` sum(quay_model_cache_total{host=~"redis-quay-app-.*", type="hit"})/sum(quay_model_cache_total{host=~"redis-quay-app-.*"}) ``` **Average cache hit rate, grouped by pod (Memcached):** ``` sum(quay_model_cache_total{host=~"memcached-quay-app-.*", type="hit"}) by (host)/sum(quay_model_cache_total{host=~"memcached-quay-app-.*"}) by (host) ``` **Average cache hit rate, grouped by pod (Redis):** ``` sum(quay_model_cache_total{host=~"redis-quay-app-.*", type="hit"}) by (host)/sum(quay_model_cache_total{host=~"redis-quay-app-.*"}) by (host) ``` **Total cache misses, grouped by pod (Memcached):** ``` sum(quay_model_cache_total{host=~"memcached-quay-app-.*", type="miss"}) by (host) ``` **Total cache misses, grouped by pod (Redis):** ``` sum(quay_model_cache_total{host=~"redis-quay-app-.*", type="miss"}) by (host) ``` ## Conclusion After reviewing the results, we reject the null hypothesis and accept the alternative hypothesis. New pods that start up in the Quay `Deployment` are significantly more likely to have a higher cache hit rate if they are using a shared model cache. This is true of both the initial cache hit rate on startup, which can be 90% or even 100%, and also stabilized hit rate after over ten minutes. #### Caveats - Only a certain type of traffic was used for testing (pushes/pulls). - Differences in push/pull times were not monitored, but appeared similar during observation. - Results aren't very fine-grained because we only export Prometheus metrics every 30 seconds (`PROMETHEUS_PUSH_INTERVAL_SECONDS`)