Pulp content serving Performance Test

# Pulp content serving Performance Test ## Goals 1. Characterize the relationship between sustained requests / sec to the pulp-content app and the number of pulp-content apps without increases in latency. 2. Identify rate limiting components in various test scenarios. ## System Under Test ### Physical hardware All in EU-West-2 1 Load generator - m4.2xlarge instance: * 8 64-bit CPUs, Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz * 32GB Ram 1 Pulp test machine - m4.2xlarge instance: * 8 64-bit CPUs, Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz * 32GB Ram Minio providing artifact storage also running on the system under test. ### Software Under Test Pulp will be deployed as containers on the "Pulp Test Machine" hardware above: * 1 container for pulp-api * N containers for pulp-content * 2 containers for pulp-worker * 1 redis container * 1 postgresql container Pulp will be configured to Mini which is also deployed on the same machine as Pulp. These artifacts aren't going to actuall be served during testing, so this is expected to use a minimum amount of hardware resources on the system under test. The versions under test are: pulpcore 3.22.0 and pulp_rpm 3.18.9 ### Architecture Components: * The load generator ([locust](https://locust.io/) with fasthttp backend) * The pulp-content processes Locust ---HTTP-Requests---> pulpcore-content Locust <---302-Redirect--- pulpcore-content Locust will not follow the redirect. Mini is not the software under test, only pulp-content is. The reverse proxy is entirely bypassed as we do not want to test nginx, only Pulp. ### Locust configuration Deployed with 16 workers on the load generator machine. The urls from the RHEL9 baseos are gathered into a text file using this script: ``` import argparse import requests from bs4 import BeautifulSoup from urllib.parse import urljoin def process_page(url): page = requests.get(url) soup = BeautifulSoup(page.content, "html.parser") links = soup.find_all("a") for link in links: if link.text == "../": continue link_url = urljoin(url, link["href"]) if link_url.endswith(".rpm"): print(link_url) elif "://" in link_url: process_page(link_url) parser = argparse.ArgumentParser(description='Get the full URLs of .rpm files at a specific url.') parser.add_argument('base_url', help='The base url to fetch links from') args = parser.parse_args() base_url = args.base_url process_page(base_url) ``` Here is the locustfile.py ``` import random import re import time from locust import FastHttpUser, task NUM_CONTENT_APPS = 2 START_PORT = 24737 END_PORT = START_PORT + NUM_CONTENT_APPS class QuickstartUser(FastHttpUser): def on_start(self): with open("/mnt/locust/urls.txt") as file: self.urls = file.readlines() @task def random_rpm(self): original_url = random.choice(self.urls) new_port = str(random.randint(START_PORT, END_PORT - 1)) new_url = re.sub(r':\d+', ':' + new_port, original_url) self.client.get( new_url, allow_redirects=False ) ``` ### Metrics * sustained requests / sec - The number of redirects issued per second. * latency median - Pulp's most typical respose time. * latency 95th percentiles paired with the requests / sec - Pulp's majority case response time. * redis container CPU percentage - An indicator of how heavily loaded Redis is. * PostgreSQL Process count - An indicator of how heavily loaded PostgreSQL is. It forks more when postgreSQL is processing more load. * PostgreSQL average CPU usage across all PostgreSQL processes. - An indicator of how heavily loaded postgreSQL is. * Number of pulp-content Gunicorn workers. In practice, how many did gunicorn fork. - Indicates the actual number of processes handling requests. * Average pulp-content Gunicorn workers CPU. Averaged across all the pulp-content gunicorn workers. - An indiciator of how heavily loaded the pulp-content processes are. ## Scenarios For each scenario increase the number of Locust users until the pulp-content app no longer shows increases in req/sec with additional increases in locust users. 1. 1 pulp-content process (2 gunicorn processes when heavily loaded). Redis caching enabled. 2. 2 pulp-content process (4 gunicorn processes when heavily loaded). Redis caching enabled. 3. 4 pulp-content process (8 gunicorn processes when heavily loaded). Redis caching enabled. This is the maximum number of cpus on the machine. 4. 2 pulp-content process (4 gunicorn processes when heavily loaded). Redis caching disabled. This is useful for comparing against scenario (2) to determine the impact of caching. ## Caching Enabled Results 1. Each pulp-content gunicorn worker process having exclusive access to an idle 2.50GHz Xeon CPU can serve 625 req / sec. 2. The pulp-content gunicorn worker process uses a stable 16MB each. 3. The pulp-content gunicorn worker is CPU bound with CPU directly correlating to req / sec. 4. A single container Redis was able to serve about 3K req / sec with 20% CPU in use. 5. postgreSQL was almost entirely unloaded and never spawned more than 1 process and never even registered on top during any caching-enabled run. 6. The maximum median response time in almost all runs was 25ms. 7. The 95th percentile response time in almost all runs was 70ms. ## Caching Disable Results 1. Each pulp-content gunicorn worker process having exclusive access to an idle 2.50GHz Xeon CPU can serve 108 req / sec. 2. The pulp-content gunicorn worker process uses a stable 16MB each. 3. The pulp-content gunicorn worker is both 3/4 CPU and 1/4 DB bound with both resources closely correlating to req / sec. 4. postgreSQL forked workers and began loading them. 6. The maximum median response time in almost all runs was 16ms. 7. The 95th percentile response time in almost all runs was 28ms. ## Conclusions 1. Using Redis caching for Pulp should allow for a cluster to horizontally scale to the size most if not all req / sec rates. * In doing so, just add more CPU for additional pulp-content processes * Make sure your Redis also doesn't run out of CPU. You could horizontally scale this component too. * The response times are well within the 2 second yum/dnf response timeout, which should ensure their redirects at least without any issue. ### Appendix A: Test Setup Sync with policy="immediate" a RHEL9 baseos. To do this: 1. Create a debug CDN cert with [these instructions](https://source.redhat.com/groups/public/release-engineering/release_engineering_rcm_wiki/how_to_generate_cdn_debug_certs). 2. Create and sync rhel9 baseos with these commands ``` pulp rpm remote create --name rhel9_baseos --url https://cdn.redhat.com/content/dist/rhel9/9/x86_64/baseos/os/ --policy immediate --client-cert @cdn_1056.pem --client-key @cdn_1056.pem --ca-cert @redhat-uep.pem pulp rpm repository create --name rhel9_baseos --autopublish --remote rhel9_baseos pulp rpm distribution create --base-path rhel9_baseos --name rhel9_baseos --repository rhel9_baseos pulp rpm repository sync --name rhel9_baseos ``` ### Appendix B: Notes #### Commands on system under test Install docker https://docs.docker.com/engine/install/fedora/#install-using-the-repository sudo dnf install docker-compose git setup portainer via docker-compose setup pulp via docker compose install minio and configure pulp to talk to it Make sure to create the bucket manually for pulp to use also #### Commands on load generator Install docker https://docs.docker.com/engine/install/fedora/#install-using-the-repository sudo dnf install docker-compose setup portainer via docker-compose setup locust via docker-compose https://docs.locust.io/en/stable/running-in-docker.html