# Measuring the impact of a shared model cache on Quay pods
## Hypothesis
Suspected cause of most recent outage was that during rollout (config/image change), new Quay pods start up with empty cache and have to hit database for every request.
**Null Hypothesis**: There will be no difference in startup cache hit rate between pods using shared model cache and pods using local model cache.
**Alternative Hypothesis**: Pods using shared model cache will have higher cache hit rate on startup than pods using local model cache.
## Experiment Setup
1. Start with Quay `Deployment` with `replicas: 1`
2. Run push/pull scripts for 10 min to generate load and populate cache
3. Record cache metrics
4. Set `replicas: 5` to add new pods
5. Run push/pull scripts for 10 min to generate load
6. Record cache metrics
## Results
### Local Cache
| Pod | Initial cache hit rate | Stabilized cache hit rate (10m) | Notes
|---|---|---|---|
| `memcached-quay-app-787fb4d9f7-b6skr` | 56.4% | 71.6% | Starting pod when `replicas: 1`. |
| `memcached-quay-app-787fb4d9f7-dts8h` | 52.8% | 67.6% | New pod when `replicas: 5`. |
| `memcached-quay-app-787fb4d9f7-tzlpr` | 37.5% | 54.0% | New pod when `replicas: 5`. |
| `memcached-quay-app-787fb4d9f7-xndrf` | 25.0% | 25.0% | New pod when `replicas: 5`. Logs 3 misses, 1 hit during entire pod lifetime. |
| `memcached-quay-app-787fb4d9f7-xwhz9` | N/a | N/a | New pod when `replicas: 5`. Zero data points, pod never logs traffic. |
### Shared Model Cache
| Pod | Initial cache hit rate | Stabilized cache hit rate (10m) | Notes
|---|---|---|---|
| `redis-quay-app-85d6875f96-fcb68` | 65.7% | 66.4% | Starting pod when `replicas: 1`. |
| `redis-quay-app-85d6875f96-dg6f8` | 90.9% | 74.4% | New pod when `replicas: 5`. |
| `redis-quay-app-85d6875f96-j8z2b` | 100.0% | 97.7% | New pod when `replicas: 5`. |
| `redis-quay-app-85d6875f96-lfmmr` | 66.6% | 73.2% | New pod when `replicas: 5`. Spikes to a peak of 84.4%, then drops. |
| `redis-quay-app-85d6875f96-2hb2x` | N/a | N/a | New pod when `replicas: 5`. Zero data points, pod never logs traffic. |
### Notes While Conducting Experiment
#### Prometheus Queries
**Average cache hit rate for all pods (Memcached):**
```
sum(quay_model_cache_total{host=~"memcached-quay-app-.*", type="hit"})/sum(quay_model_cache_total{host=~"memcached-quay-app-.*"})
```
**Average cache hit rate for all pods (Redis):**
```
sum(quay_model_cache_total{host=~"redis-quay-app-.*", type="hit"})/sum(quay_model_cache_total{host=~"redis-quay-app-.*"})
```
**Average cache hit rate, grouped by pod (Memcached):**
```
sum(quay_model_cache_total{host=~"memcached-quay-app-.*", type="hit"}) by (host)/sum(quay_model_cache_total{host=~"memcached-quay-app-.*"}) by (host)
```
**Average cache hit rate, grouped by pod (Redis):**
```
sum(quay_model_cache_total{host=~"redis-quay-app-.*", type="hit"}) by (host)/sum(quay_model_cache_total{host=~"redis-quay-app-.*"}) by (host)
```
**Total cache misses, grouped by pod (Memcached):**
```
sum(quay_model_cache_total{host=~"memcached-quay-app-.*", type="miss"}) by (host)
```
**Total cache misses, grouped by pod (Redis):**
```
sum(quay_model_cache_total{host=~"redis-quay-app-.*", type="miss"}) by (host)
```
## Conclusion
After reviewing the results, we reject the null hypothesis and accept the alternative hypothesis. New pods that start up in the Quay `Deployment` are significantly more likely to have a higher cache hit rate if they are using a shared model cache. This is true of both the initial cache hit rate on startup, which can be 90% or even 100%, and also stabilized hit rate after over ten minutes.
#### Caveats
- Only a certain type of traffic was used for testing (pushes/pulls).
- Differences in push/pull times were not monitored, but appeared similar during observation.
- Results aren't very fine-grained because we only export Prometheus metrics every 30 seconds (`PROMETHEUS_PUSH_INTERVAL_SECONDS`)