owned this note
owned this note
Published
Linked with GitHub
# Content app performance problems
## What is the problem:
Pulp 3's content app architecture appears to be inefficient, using large amounts of CPU to serve a much smaller number of concurrent requests than Pulp 2 was capable of. The difference is highly variable but roughly on the order one order of magnitude (10x) worse in terms of both throughput and CPU usage.
There are some obvious causes that can be mitigated without much effort, however, a complete solution will likely require a more comprehensive effort.
https://pulp.plan.io/issues/8180
### Causes:
* Pulp 2 serves files (symlinks) directly from disk through the webserver while Pulp 3 services multiple database roundtrips
* Pulp servers binary data through the WSGI process, and it flows back through the reverse proxy
* Pulp 3 is using some blocking database IO in the async code!
* The default configuration doesn't seem to be properly tuned; modifying certain httpd and gunicorn parameters has positive impacts.
### Mitigations:
* "Easy"
* Avoid using blocking code in async code
* Use more gunicorn workers
* Other Apache/Nginx tunables?
* "Hard"
* Caching
* Use a different strategy for serving content
A related concern is that some users such as RHUI would like Pulp to be capable of serving content even when offline for upgrades. A "read-only mode".
https://pulp.plan.io/issues/7810
This is something we should keep in mind as design solutions, and we can look to them to see whether they have solved this problem already and how, or which solutions might also fulfull their needs.
## Test Design
High level goals:
1. Identify the most pressing bottlenecks in the content app, so that we understand where to focus our efforts.
2. Identify the true performance delta between Pulp 2 and 3 in realistic usage scenarios, to judge need for more comprehensive changes vs. small iterative ones.
### System Under Test
We need to develop a standard platform for conducting performance tests and experiments, so that we can accurately judge relative performance differences between Pulp 2/Pulp 3/modified Pulp 3.
#### What things do we measure?
* average throughput (req / sec)
* average latency (ms / req)
* cpu and ram - systat
* ..?
#### How do we want to measure them?
* [Locust.io](https://locust.io/) load testing framework
* Industry-standard http load-testing tool
* Streamlined, easy to use, lots of performance details & statistics
* Not a "real" yum client, possibly less realistic?
* Jan Hutar's Satellite performance testing docker images
* Real yum clients
* No good way to get responsiveness metrics
Conclusion 2/12/21: Use Locust, try to emulate the workload of real clients as well as we can.
Take RPM repositories, load them into the file plugin as file repositories, write a Locust client to process the repository as if it were YUM - download metadata, then X number of RPM packages.
**TODO:**
* should we download full repositories and/or partial updates? How to decide which subsets of packages?
* nail down the test scenarios, needs to be repeatable
* test different hardware profiles?
* optimize httpd/nginx/gunicorn configurations
#### Hardware
Vagrant boxes (VMs):
* pulp3-source-centos7 for pulp3
* pulp_dev for pulp2
Centos 7, 6 vCPU, 10gb mem
#### Configuration
* Gunicorn documentation suggests using 2x ($NUM_CPUs) + 1 workers
TODO:
* httpd config options? (adam winberg improvements?) https://pulp.plan.io/issues/8180
* gunicorn workers?
* log level?
* ..?
* Strip unused httpd modules?
### Experiment 0:
#### Call Flow
Client -> Apache -> Gunicorn (wsgi) -> Modified Pulp Code
#### Modified Pulp
**Test:** Modify Pulp content app to do no work other than reply "Hello world" in response to requests (no routing, no database, no files).
Apply the [hello world patch](https://gist.github.com/dralley/c65aa75d34978cb3357ce8301e3718d5) to Pulp.
#### Validating the Test Tooling
Do we get consistent results between multiple different testing tools?
Locust is a load testing tool written in Python. It supports two backends for handling http requests, the standard one which uses the "requests" library, and a "fast" one which uses the "geventhttpclient" library.
https://docs.locust.io/en/stable/increase-performance.html
Goose is a clone of Locust written in Rust. It's not as convenient and doesn't support the same featureset quite yet, but it is *very* fast.
https://www.tag1consulting.com/blog/goose-locust-inspired-load-testing-tool-rust
#### Results
| Tool | 10 "users" | 20 "users" | 30 "users" |
|:----------------- |:----------:|:----------:|:----------:|
| Locust (standard) | 2000 | - | - |
| Locust (fast) | 7300 | 7740 | 7650 |
| Goose | 8800 | 9200 | 9450 |
#### Conclusions
1. 20 "users" seemed to drive the most load against the system (on my particular hardware configuration)
2. Locust is likely "fast enough" given the pulp 2 benchmark in early testing showed ~1.2K requests / sec max. Using the FastHttpClient is advisable.
* Update - Locust is fine. When using more significant request payloads, there is very little difference.
### Experiment 1:
**Question:** Are Gunicorn or Apache limiting throughput?
#### Test
Stop all the Pulp services, run a standalone Gunicorn "hello world" directly on a separate port, compare to experiment 0 results (which included Apache and same patch for Pulp). All tests with 20 "users".
#### Results
Aiohttp + Gunicorn (Locust): 7,850 (roughly the same as standard config w/ httpd)
Aiohttp + Gunicorn (Goose): 23,000 !
#### Conclusion
1. Both Apache and Gunicorn are performing well within an acceptable range in this test - neither are bottlenecks in terms of raw HTTP performance.
2. With locust <-> gunicorn as the System Under Test, the rate limiting component is actually the client testing framework, locust is the rate limiting component with a maximum of ~7.6 K requests / sec.
### Experiment 2:
**Question:** What portion of the problem is contributed to by the binary data streaming back through the ASGI environment?
**Test:** Modify Pulp content app to stream random files from content storage, without going through the datatabase. Track CPU, throughput, and latency. Compare performance between Pulp setup (behind Apache) to manually running the same Gunicorn configuration on a separate port without Apache (without Pulp services running).
[no database gist](https://gist.github.com/bmbouter/511118447fa78a5fc8d18f9831d117e6)
#### Baseline:
Repositories hosted behind nginx, measured with locust, 20 "users":
centos - 220 RPS
0 byte file - 6800 RPS
5 mb files - 41 RPS
(for 80 users, no difference in throughput, latency roughly 4x worse)
Resource consumption:
* around 20% CPU,
* 10% memory directly, but entire memory space being used as IO cache
* *No disk I/O whatsoever*. I think all the files were being served directly from memory/cache ^
* 200 MiB/s == ~1.6 Gbps
![](https://i.imgur.com/jocBXhD.png)
#### Pulp
##### CentOS 7, 20 "users"
| Stack | Requests / second | Median Latency | 95th percentile latency |
| ----------------- |:-----------------:|:--------------:|:-----------------------:|
| Gunicorn + Apache | ~190 (highly variable) | ~75 | ~220 (highly variable) |
~255 (highly variable) 63 160
Resource consumption:
* ~90% on all 6 CPUs
* 20% memory directly, but entire memory space being used as IO cache
* ~10-120 MiB/s file IO - cache not so effective?
* 200 MiB/s == ~1.6 Gbps
*
![](https://i.imgur.com/vwktKwA.png)
![](https://i.imgur.com/3dwKtI6.png)
before /after content app
![](https://i.imgur.com/5vjRifF.png)
##### 5mb files, 20 "users"
20 "users"
| Stack | Requests / second |
| ----------------- |:-----------------:|
| Gunicorn + Apache | |
| Gunicorn | |
##### 0-byte file, 20 "users"
20 "users"
| Stack | Requests / second |
| ----------------- |:-----------------:|
| Gunicorn + Apache | |
| Gunicorn | |
## Experiment 3:
**Question:** What impact does content guard (auth checking) have?
**Test:**
**Answer:** TBD
## Experiment 4:
**Question:** What impact does lack of proper asynchronicity have?
**Test:**
**Answer:** TBD
## Experiment 5:
**Question:** What impact does hardware (SSD, HDD) have?
**Test:**
**Answer:** TBD