owned this note changed 2 years ago
Published Linked with GitHub

Pulp content serving Performance Test

Goals

  1. Characterize the relationship between sustained requests / sec to the pulp-content app and the number of pulp-content apps without increases in latency.

  2. Identify rate limiting components in various test scenarios.

System Under Test

Physical hardware All in EU-West-2

1 Load generator - m4.2xlarge instance:

  • 8 64-bit CPUs, Intel® Xeon® Platinum 8259CL CPU @ 2.50GHz
  • 32GB Ram

1 Pulp test machine - m4.2xlarge instance:

  • 8 64-bit CPUs, Intel® Xeon® Platinum 8259CL CPU @ 2.50GHz
  • 32GB Ram

Minio providing artifact storage also running on the system under test.

Software Under Test

Pulp will be deployed as containers on the "Pulp Test Machine" hardware above:

  • 1 container for pulp-api
  • N containers for pulp-content
  • 2 containers for pulp-worker
  • 1 redis container
  • 1 postgresql container

Pulp will be configured to Mini which is also deployed on the same machine as Pulp. These artifacts aren't going to actuall be served during testing, so this is expected to use a minimum amount of hardware resources on the system under test.

The versions under test are: pulpcore 3.22.0 and pulp_rpm 3.18.9

Architecture

Components:

  • The load generator (locust with fasthttp backend)
  • The pulp-content processes

Locust -HTTP-Requests-> pulpcore-content
Locust <-302-Redirect- pulpcore-content

Locust will not follow the redirect. Mini is not the software under test, only pulp-content is.

The reverse proxy is entirely bypassed as we do not want to test nginx, only Pulp.

Locust configuration

Deployed with 16 workers on the load generator machine.

The urls from the RHEL9 baseos are gathered into a text file using this script:

import argparse
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def process_page(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    links = soup.find_all("a")
    for link in links:
        if link.text == "../":
            continue
        link_url = urljoin(url, link["href"])
        if link_url.endswith(".rpm"):
            print(link_url)
        elif "://" in link_url:
            process_page(link_url)

parser = argparse.ArgumentParser(description='Get the full URLs of .rpm files at a specific url.')
parser.add_argument('base_url', help='The base url to fetch links from')
args = parser.parse_args()

base_url = args.base_url
process_page(base_url)

Here is the locustfile.py

import random
import re
import time

from locust import FastHttpUser, task


NUM_CONTENT_APPS = 2

START_PORT = 24737
END_PORT = START_PORT + NUM_CONTENT_APPS


class QuickstartUser(FastHttpUser):

    def on_start(self):
        with open("/mnt/locust/urls.txt") as file:
            self.urls = file.readlines()

    @task
    def random_rpm(self):
        original_url = random.choice(self.urls)
        new_port = str(random.randint(START_PORT, END_PORT - 1))
        new_url = re.sub(r':\d+', ':' + new_port, original_url)
        self.client.get(
            new_url,
            allow_redirects=False
        )

Metrics

  • sustained requests / sec - The number of redirects issued per second.
  • latency median - Pulp's most typical respose time.
  • latency 95th percentiles paired with the requests / sec - Pulp's majority case response time.
  • redis container CPU percentage - An indicator of how heavily loaded Redis is.
  • PostgreSQL Process count - An indicator of how heavily loaded PostgreSQL is. It forks more when postgreSQL is processing more load.
  • PostgreSQL average CPU usage across all PostgreSQL processes. - An indicator of how heavily loaded postgreSQL is.
  • Number of pulp-content Gunicorn workers. In practice, how many did gunicorn fork. - Indicates the actual number of processes handling requests.
  • Average pulp-content Gunicorn workers CPU. Averaged across all the pulp-content gunicorn workers. - An indiciator of how heavily loaded the pulp-content processes are.

Scenarios

For each scenario increase the number of Locust users until the pulp-content app no longer shows increases in req/sec with additional increases in locust users.

  1. 1 pulp-content process (2 gunicorn processes when heavily loaded). Redis caching enabled.
  2. 2 pulp-content process (4 gunicorn processes when heavily loaded). Redis caching enabled.
  3. 4 pulp-content process (8 gunicorn processes when heavily loaded). Redis caching enabled. This is the maximum number of cpus on the machine.
  4. 2 pulp-content process (4 gunicorn processes when heavily loaded). Redis caching disabled. This is useful for comparing against scenario (2) to determine the impact of caching.

Caching Enabled Results

  1. Each pulp-content gunicorn worker process having exclusive access to an idle 2.50GHz Xeon CPU can serve 625 req / sec.
  2. The pulp-content gunicorn worker process uses a stable 16MB each.
  3. The pulp-content gunicorn worker is CPU bound with CPU directly correlating to req / sec.
  4. A single container Redis was able to serve about 3K req / sec with 20% CPU in use.
  5. postgreSQL was almost entirely unloaded and never spawned more than 1 process and never even registered on top during any caching-enabled run.
  6. The maximum median response time in almost all runs was 25ms.
  7. The 95th percentile response time in almost all runs was 70ms.

Caching Disable Results

  1. Each pulp-content gunicorn worker process having exclusive access to an idle 2.50GHz Xeon CPU can serve 108 req / sec.
  2. The pulp-content gunicorn worker process uses a stable 16MB each.
  3. The pulp-content gunicorn worker is both 3/4 CPU and 1/4 DB bound with both resources closely correlating to req / sec.
  4. postgreSQL forked workers and began loading them.
  5. The maximum median response time in almost all runs was 16ms.
  6. The 95th percentile response time in almost all runs was 28ms.

Conclusions

  1. Using Redis caching for Pulp should allow for a cluster to horizontally scale to the size most if not all req / sec rates.
    • In doing so, just add more CPU for additional pulp-content processes
    • Make sure your Redis also doesn't run out of CPU. You could horizontally scale this component too.
  • The response times are well within the 2 second yum/dnf response timeout, which should ensure their redirects at least without any issue.

Appendix A: Test Setup

Sync with policy="immediate" a RHEL9 baseos. To do this:

  1. Create a debug CDN cert with these instructions.

  2. Create and sync rhel9 baseos with these commands

pulp rpm remote create --name rhel9_baseos --url https://cdn.redhat.com/content/dist/rhel9/9/x86_64/baseos/os/ --policy immediate --client-cert @cdn_1056.pem --client-key @cdn_1056.pem --ca-cert @redhat-uep.pem
pulp rpm repository create --name rhel9_baseos --autopublish --remote rhel9_baseos
pulp rpm distribution create --base-path rhel9_baseos --name rhel9_baseos --repository rhel9_baseos
pulp rpm repository sync --name rhel9_baseos

Appendix B: Notes

Commands on system under test

Install docker https://docs.docker.com/engine/install/fedora/#install-using-the-repository
sudo dnf install docker-compose git
setup portainer via docker-compose
setup pulp via docker compose
install minio and configure pulp to talk to it
Make sure to create the bucket manually for pulp to use also

Commands on load generator

Install docker https://docs.docker.com/engine/install/fedora/#install-using-the-repository
sudo dnf install docker-compose
setup portainer via docker-compose
setup locust via docker-compose https://docs.locust.io/en/stable/running-in-docker.html

Select a repo