Pulp3 Deployment Considerations

Current Usage

Pulp2 used for a large deployment, serves content to
* Pulls down content from content sources, e.g. RH or other channels
* Snapshot content on bi-weekly or weekly cadence
* Multiple PoPs
* Desiring to roll out RHEL 9
* Custom tooling to organize the repos and promotion using Pulp APIs
* Peforms some quality checks, e.g. linting, signature checks, etc
* Copies content between repos
* Uses rsync distributor to a webserver

Goal

Desiring the ability to have the content live natively on the cloud instead of having cloud
Having PoPs serve content when they are disconnected from the other PoPs

Use Cases

Snapshot Use Cases

As a user I can …
- Define snapshot RedHat CDN content via console.redhat.com to
- Easily connect systems to any snapshot with my existing RH credentials

Point of Presence Use Cases

As a user I can …
- Launch a point of presence (PoP) which will auto-register with console.redhat.com
- Configure the PoP to sync one or more c.rh.c snapshot repositories
  - On Demand - Metadata only, binary data delivered as pull-through cache
  - Full Sync - Metadata and binary data synchronized

Operator notes

Images:
Currently the operator makes use of the following images:
- operator itself: https://quay.io/repository/pulp/pulp-operator
- pulp application: https://quay.io/repository/pulp/pulp
- reverse proxy: https://quay.io/repository/pulp/pulp-web
You can use other application/webserver images by declaring image and image_web in the Pulp CR, those images can be built from: https://github.com/pulp/pulp-oci-images/tree/latest/images/pulp/stable
And the operator image can be built from the operator repo: https://github.com/pulp/pulp-operator/blob/main/Dockerfile
Operator RBAC:
- https://github.com/pulp/pulp-operator/blob/main/bundle/manifests/pulp-operator.clusterserviceversion.yaml#L1122
Docs: https://docs.pulpproject.org/pulp_operator/

Architecture

Questions (Pulp)

"quality check the package"
- Won't clash with existing package name
- has changelog
- signed with the correct key
How much third-party content, custom content
- Mostly Red Hat
What are the primary compose workflows
- Repos are managed as bundles, treated in a sense as immutable snapshots
- Only use newest versions, no "incremental update" with errata
Is rollback an aspect of the Pulp3 feature set that is useful?
- Yes, but currently the Pulp2 distributor allowed them to publish a point in time. It did take a long time though
If Pulp3 had an Rsync Exporter (like the Pulp2 rsync distributor) would you use that instead of launching a container based Pulp on the Pop?
- One is a push model, the other is a pull model
Filesystem export + Rsync, or native Rsync
What is the high-availability need?

AI

paul: Issue discovered: checksums of pulp_rpm repos aren't available for on_demand repos. Need a reproducer reported
- How to reproduce:
  - Create a repo
  - Create a remote
  - Sync the repo with policy on_demand
  - …
bmbouter: discuss with pulpcore if we can prioritize 1817 or 3155
- biggest issue is security - how do we make sure sensitive data is always censored appropriately
bmbouter: to organize a cost estimate calculator

March 20, 2024

Have a PR open to fix replication bug: https://github.com/pulp/pulpcore/pull/5140

March 6, 2024

Need to prioritize https://github.com/pulp/pulpcore/issues/4637

Nov 1

pulpcon is next week Nov 6 - 9, agenda here
slides for multi-geo pulpcon talk here
pulpcore 3.40 released
- now contains Pulp file
- will require an update to any plugin for compatibility reasons, e.g. pulp_rpm
- upgrading should be done with a planned outage still, in the future it can be done online
note the pulp-oci-images now runs the migrations as a separate container
- https://github.com/pulp/pulp-oci-images/blob/latest/images/compose/compose.yml#L34-L43
what to do with a replica server that has had changes made to it?
- please file a bug on this and we'll look at it

Oct 18

Production is going well
How to store some arbitrary data on a Repository, e.g. notes about a specific package being present in a repository
- recommendation: use the label API on a Repository and use NEVRA as the key and whatever needs to be stored as the value
Telemetry Update
- Pulp tests are being merged: https://github.com/pulp/pulpcore/pull/4414
- OTEL upstream PR for aiohttp is near merging: https://github.com/open-telemetry/opentelemetry-python-contrib/pull/1800
  - after merging ^ OTEL needs to release it, then Pulp needs to "enable" it in pulp-content, document it, and release it
Pulpcon coming up Nov 6-9th

Oct 4

Went into production
The replica workflow gets failed tasks and it's difficult to see from the task listing what objects were failed to update
- next step: @chida to file the bug as a usability bug, and @dkliban will fix
Want to have the '/' resource on the pulp service be a dynamic html page
- suggestion: use the reverse proxy to deliver that page
How to configure authZ that is active directory or LDAP based?
- https://hackmd.io/ED9UpscNSRW86Le3xNzVeg
- https://docs.pulpproject.org/pulpcore/authentication/webserver.html

August 23

Does Pulp have an LTS?
- No, but we do backport to some recent branches, which are listed here
How do we configure open telemetry support
- There are some docs here: https://docs.pulpproject.org/pulpcore/components.html#telemetry-support
- Also you can mimic what the otel profile in the dev environment does: https://github.com/pulp/oci_env/tree/main/profiles/opentelemetry_dev
Replication is working nicely
- How do we enable the RBAC permissions to mimic the upstream ones?
  - We currently don't support this, you can only do this with admin currently
  - Maybe there is an upstream issue already filed for this?
    - https://github.com/pulp/pulpcore/issues/3994
Can I create an admin user via the API?
- Not currently
- @Chida will file an issue for this request
Likely switching our Python runtime to 3.9

August 9

Clusters are spun up with testing and replication being used
Having an issue with password rotation of the database
- password changes and Pulp needs to be restarted
replication updates
- commands have been added to the CLI
- bugfixes for replication have been released, please let us know if anything else isn't working
Need to revisit the OTEL work soon

May 3

Performance testing is showing the the S3 object storage with PULP_REDIRECT_OBJECT_STORAGE=False causes high memory and CPU relative to a clustered backend solution
pulp_rpm == 3.20 to release today containing the replication bits. It'll be ready for testing

April 19

Experimenting with solutions to timeouts - is it S3 or not?
Replica support for pulp_rpm should be released by early next week
Metrics work ongoing

March 16

Pulp 3.23 released with replica support
Metrics Work Ongoing
- What do we know we'll get?
  - for pulp-content and pulp-api we'll get response status, url, and latency raw data
- What else would we want?
  - for tasking we'll get a 1 second summary of:
    - busy/free proportion
    - top/sar style resource metrics like cpu usage, ram, network usage within the 1 second summary
  - for tasking we'll also get event based metrics:
    - task uuid, task start time, stop time, task name

March 1

Issue filed: https://github.com/pulp/pulpcore/issues/3621
pulp-replica, when will it be released?
- goal: to be included in 3.23
gave overview of domains
- they are interested in using domains, what happens today is lots of content comes in and sometimes it clashes, e.g. with nevra. Domains would solve this problem
updates on the image tag changes that have been made
use case: get secrets from KMS via a sidecar container that gets the secrets and loads a config map
open telemetry update
- metrics and tracing are working well for pulp-api
- next step: add support for pulp-worker and pulp-content
- I'll record a youtube video showing off the tracing and metrics for pulp
- next time: let's discuss feedback / ideas on the metrics for Pulp to find what would be useful

Feb 15

updated replication demo with labels
performance and scale testing blog post
- https://pulpproject.org/2023/02/14/rpm-redirect-serving-perf-scale-testing/
starting on adding metrics
- https://github.com/pulp/pulpcore/issues/3445

Feb 1

issue from last week about 0 bytes returned from pulp-content app was legitimately a 0 byte package! So no issue there
the yum/dnf timeout was increased from 2 seconds to 10 second. It was failing with 2 seconds which was for the occasional package just a little too slow
pulp-replica demo
- https://youtu.be/ehrd2kawmN0
talked through the pulp_concurrent setting some
- It's concurrent TCP connections from 1 task

Jan 18

Need to have Pulp proxy the data from S3 because the clients can't reach S3, but they can reach Pulp
- suggestion: use the REDIRECT_TO_OBJECT_STORAGE feature
  - https://docs.pulpproject.org/pulpcore/configuration/settings.html#redirect-to-object-storage
Looking to do first prod deployment towards the end of the month
FYI we provide pytest
Open Telemetry
- PoC will be started soon
Demo of CLI base replication
- https://youtu.be/aIIgrNILNIk

Dec 14th

Not an issue to continue using non-clustered Redis
Status of deployment
- Integrated with additional orchestration/mush APIs
- Testing against the RHEL 9 release streams
Open Telemetry
- kick off meeting minutes
  - https://discourse.pulpproject.org/t/monitoring-telemetry-working-group/700/9
- Desired questions to answer:
  - How much traffic are we getting on this endpoint?
  - Is there a significant wait for clients to receive downloads?
  - Am I denying requests from clients due to load?
Cost Analysis
- desire: to have a cost estimator for running a Pulp installation on AWS in terms of infra and network storage + network delivery costs

Nov 30th

Issue discovered: checksums of pulp_rpm repos aren't available for on_demand repos. Need a reproducer reported
Redis issue figured out: It was a clustered Redis install, but Pulp doesn't support clustered Redis
- <discourse link needed>
Hard to tell when a sync task is on_demand versus immediate.
- known issue https://github.com/pulp/pulpcore/issues/1817
Identified that FS Exports may be an option for their geo distribution
- problem is: it doesn't deduplicate RPMs and that's a lot…
Difficult to know when querying a task about what the task is doing. Kind of the only thing to go on is the created resources, which don't get created until the end
- https://github.com/pulp/pulpcore/issues/3155
- idea: allow for querying through tasking back to repository attributes

Nov 16th

metrics are needed
- health metrics: requests / second, latency
- capacity metrics:
  - interested in https://github.com/pulp/pulpcore/issues/3389
  - interested in https://github.com/pulp/pulp-operator/issues/761
AI: bmbouter to share the open telemetry working group
k8s health checks
- This would be helpful: https://github.com/pulp/pulpcore/issues/2844
- here's the operator's use of existing health checks: https://github.com/pulp/pulp-oci-images/blob/latest/images/assets/readyz.py
issues to report:
- AWS ElasticCache is not yet working, needs an issue filed
- Temp directories during RPM sync are getting to ~50 GB. This is unexpected and problematic, needs an issue filed so pulp dev team can try to reproduce
  - might be this issue https://github.com/pulp/pulpcore/issues/1936
multi-geo:
- not yet underway, still focused on getting the main pulp server productionized
- requested feedback on this brainstorm doc
  - https://hackmd.io/isQ6Rf73Q56ucscoIbSNyw
update: zero-downtime working group underway

Nov 2 Updates

Updates on the evaluation of the operator? still in progress
- Operator questions:
  - Auto-scale of pods for pulp-operator - not avaialble as of today but it is possible to scale up and down manually. It would be great to have auto scale pulp-workers pods based on the queue of waiting tasks
  - go based docs: https://docs.pulpproject.org/pulp_operator/
  - everything is namespaced in pulp-operator https://docs.pulpproject.org/pulp_operator/en/ansible/roles/pulp-api/#role-variables
  - https://hackmd.io/SRZmd5L3SMWWyvvjHNE3rQ?view#Tuesday-November-8-User-day-2 talks about pulp-operator k8s deployments
Chida's Installation Discussion and questions
- worker's heartbeat
- excessive mem usage during sync
- redis caching

Oct 12th Updates

What permissions does the operator need?
- https://github.com/pulp/pulp-operator/blob/main/bundle/manifests/pulp-operator.clusterserviceversion.yaml#L1122
zero-downtime is a concern
- https://discourse.pulpproject.org/t/support-zero-downtime-updates/645
https://app.element.io/?updated=1.11.5#/room/#pulp:matrix.org
https://pulpproject.org/help/#chat-to-us

Sept 28th Updates

general updates
Propose we shorten to a 30 minute, 2 week call
[question from pulp operator team] Could you share some more details about the permissions that k8s operators typically require that are not acceptable for your environment?
- it downloads a lot of untrusted assets, but that could be gotten around by pointing to your own registry
- the permissions would need a more specific review
- they mostly use helm charts today
- uncomfortable with the API access to k8s itself because the "deployer" here is not the admin, they are general users
- we should be offering a helm chart
Documented the dockerfile, see updates here https://docs.pulpproject.org/pulp_oci_images/
Two upcoming goals (likely):
- combine the single container and the operator images to have one set of technology
- product an operations manual
pulpcon coming up Nov 7-12: CFP is open until Monday, we'd welcome any talk about how Pulp3 is being used
- https://discourse.pulpproject.org/t/pulpcon-2022-call-for-proposals/590
- Some not yet posted talks:
  - using the operator
  - running pulp in containers without an operator
  - operations guide for pulp

Sept 14th Updates

Using single container to pull RHEL content
Deployed on AWS and using AWS RDS as the db backend for it
Having some issues with running on k8s
Enjoying to associate a repo version and distribution
- improvement from pulp2
next step: to try to use the pulp_installer roles to built a container
need identified: desire a dockerfile that we would share

HCaaS open questions

privacy - some content is deeply sensitive, questions about which systems are allowed to touch it, which systems are allowed to store it
third party (potentially licensed) content, e.g. content from VMWare, Nvidia
reliability - cybersecurity is critical, updates are critical, the infra must be available, SLAs are important
quality checking (as described earlier) - where would custom, organization-defined quality checks fit into a hosted service model

AI

bmbouter to share dockerfile
Investigate container privileges - running without root
tiho to setup followup time to explore use cases and operational needs for a SaaS model

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.