# OpenTelemetry Working Group
See also: https://discourse.pulpproject.org/t/monitoring-telemetry-working-group/700/8
## Overview
* Purpose: Brainstorm/design how to make Pulp3 more monitor-able. Prioritize the kinds of monitoring that would be useful to administrators.
* Attendees: Pulp dev-team and any interested Pulp administrators
## Template
```
## YYYY-MM-DD 1300-1330 GMT-5
### Attendees:
### Regrets:
### Agenda:
* Previous AIs:
* (insert topics here)
### Notes
### Action Items:
* add notes to discourse
```
## 2023-05-11 1400-1430 GMT-3
### Attendees: decko, bmbouter, dalley
### Regrets:
### Agenda:
* PR merging Otel dependencies and WSGI instrumentation - Done
* PR for opentelemetry-instrumentation-aiohttp-server. TBD
* We broke pulpcore CI!
* Need to create some kind of exception mechanism to allow whitelisting dependencies???
* https://github.com/pulp/plugin_template/blob/main/templates/github/.ci/scripts/check_requirements.py.j2
### Notes
* We need to add the OpenTelemetry libs as exception on the check_requirements plugin for the CI
### Action Items:
* [decko] add notes to discourse
* [decko] Merge Otel oci-env profile
* [decko] Push the opentelemetry aiohttp-server PR
* [bmbouter] Fixes the plugin_template's check_requirements.py
* [bmbouter] Gonna add worker telemetry
## 2023-05-04 1300-1330 GMT-4
### Attendees: decko, bmbouter, ggainey, dalley
### Regrets:
### Agenda:
* Previous AIs:
### Notes
* "what does 'done' looks like?"
* complete aiohttp-instr-PR
* base-version of grafana panels done
* make a decision on how the packaging is going to work
* record two demo videos (jaeger/grafana)
* user-documentation
* pulpcore-docs?
* pulp-oci-images-docs?
* pulp-to-collector overview/connection
* "maybe" otel-cfg we're using in oci-env?
* blog-post - can be ultra-specific, if/as desired
* api, pulp-specific instr, aiohttp, task-system
* api is in a great spot
* aiohttp is pretty good, but we really want it released upstream to consume
* nothing for tasking yet - phase-2 perhaps?
* pulp-specific hooks - phase-3?
* decko: need to have solid decision on packaging-discussion
* What we need to have OpenTelemetry working with Pulp?
* Collector container
* Bunch of dependencies
* opentelemetry-api
* opentelemetry-distro[otlp]
* opentelemetry-exporter-otlp-proto-http
* opentelemetry-exporter-otlp
* opentelemetry-instrumentation-wsgi
* opentelemetry-instrumentation-django
* opentelemetry-semantic-conventions
* opentelemetry-proto
* opentelemetry-sdk
* Instrumenting WSGI entrypoint (wsgi.py)
* The Faster Release Cycle...
* "We can undo it later if needed"
* https://github.com/pulp/pulpcore/pull/3632
* we could accept the import of `opentelemetry.instrumentation.wsgi` **today** (it exists)
* the import of `opentelemetry.instrumentation.aiohttp_server` needs work that isn't released in that upstream yet
* what can we do while waiting (if we have to)?
* fork the project? <== NO
* depend on the source-checkout? <== better
* vendor the PR?
* we'll decide when we've done *everything else* we can do
* Decision:
* split aiohttp-mods out of current core-PR
* add trimmed dep-list
* release
* aiohttp-into-core gets its own PR
* requirements.txt relies on src-checkout
* merges once upstream accepts the needed changes
* bring this to pulpcore-mtg next week
### Action Items:
* add notes to discourse
* schedule next mtg: The Road To Merge
## 2023-04-27 1300-1330 GMT-5
### Attendees: decko, bmbouter, dralley, ggainey
### Regrets:
### Agenda:
* Previous AIs:
### Notes
* decko showed off his progress
* discussion around what can we do further?
* can we set things up to pre-load Grafana dashboard from prepared JSON, at start-time
* what are, say, "4 things we want in a Grafana dashboard"?
* content-app
* response-codes over time
* organize by class (400/500/OK)?
* show my 500s only?
* req-latency
* avg
* P95
* P99
* "cost" items
* how many bytes have been served
* 202 is diff than 301
* can we gather metrics **per-domain**?
* where do/can we record that?
* /pulp/content/DOMAIN/content-URL
* upstream (aiohttp) vs downstream (in-pulp)
* can we attach this data as a header to the request, and record that header?
* proposal: 'launch' w/ response/-codes/latency, in pretty "basic" graphs, preloaded into oci-env profile
* discussion: what needs to happen to get aiohttp-instrumentation-PR merged?
* open a new PR from decko's branch w/ orig commits against aiohttp repo?
* (remember, based on https://github.com/open-telemetry/opentelemetry-python-contrib/pull/942)
* https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1714
* discussion (brief) around a "pulp_telemetry" 'shim'
### Action Items:
* add notes to discourse
## 2023-04-20 1300-1330 GMT-5
### Attendees: decko, dralley, ggainey
### Regrets:
### Agenda:
* Previous AIs:
* AI: ggainey to start using oci_env profile PR for this
* https://github.com/pulp/oci_env/pull/98
* no progress to report
* ~~AI: review https://github.com/pulp/pulp-oci-images/pull/469~~
* merged
* Tabled for later investigation:
* AI: decko to add what is missing from the #pulpcore/3632
* AI: [decko] get tests running locally to see why docker-side fails when podman-side succeeds
* still can't figure out why docker "occasionally" fails
### Notes
* see instructions in the profile-readme in the oci-env PR for a how-to
### Action Items:
* add notes to discourse
## 2023-04-13 1300-1330 GMT-5
### Attendees: ggainey, decko, ggainey
### Regrets:
### Agenda:
*
### Notes
* review/discussion of some test failures
* current kind-of-a-plan for aiohttp-metrics-work
* move tests to pytest, get them running clean (#soon)
* add metrics taking advantage of this fork
* think on what tests we prob should have in addition, write them, get them running clean
* submit otel-aiohttp PR upstream
* continue adding metrics to "our" fork independently
* AI: ggainey to start using oci_env profile PR for this
* https://github.com/pulp/oci_env/pull/98
* AI: [decko] get tests running locally to see why docker-side fails when podman-side succeeds
* AI: review https://github.com/pulp/pulp-oci-images/pull/469
* AI: decko to add what is missing from the #pulpcore/3632
### Action Items:
* [ggainey] to start using oci_env profile PR for this
* [decko] get tests running locally to see why docker-side fails when podman-side succeeds
* [any] review https://github.com/pulp/pulp-oci-images/pull/469
* [decko] to add what is missing from the #pulpcore/3632
* [ggainey] schedule next mtg for next week
* [ggainey]add notes to discourse
## 2023-04-06 1300-1330 GMT-5
### Attendees: decko, dalley, ggainey
### Regrets:
### Agenda:
* Previous AIs:
### Notes
* PRs are in-progress, to be submitted this week
* HMS mtg taught some things we'll prob steal :)
* discussion around a plugin-approach to making otel available
* maybe just hooks in core, that do nothing w/out pulp_otel installed?
### Action Items:
* add notes to discourse
## 2023-03-31 1330-1400 GMT-5
### Attendees: dalley, decko, ggainey
### Regrets:
### Agenda:
* Previous AIs:
* AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
* There exists a Red Hat Observability CoP!
* https://source.redhat.com/groups/public/program_observability
### Notes
* pulp-content/aiohttp instrumentation demo from decko
* traces working, still trying to get metrics
* things in flight
* aiohttp package w/ instrumentation
* metrics labels
* instrumenting workers
* getting pulp-api metrics, but not from punp-content, need to understand why
* A Plan:
* finish oci-env profile for otel
* figure out why we're not getting some wsgi-instr labels
* https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/http-metrics.md
* get a working aiohttp-instr PR submitted (based on the work of the existing 'abandoned' PR)
### Action Items:
* add notes to discourse
* ggainey to sched next mtg for next Thurs
## 2023-03-22 1330-1400 GMT-4
### Attendees: decko, jsherrill, dralley, ggainey
### Regrets:
### Agenda:
* jsherrill to show us what his team is doing w/ monitoring/metrics
* NOTES
* might be "some" guidance available
* we're all making this up as we go along
* http status/latency
* msg-latency/error-rate (like tasks?)
* some "analytics" info mixed in
* grafana dashboard to visualize
* started from a template
* export JSON in order to import into app
* PR for visualization changes is currently "exciting"
* having a full-time data visualization expert would be a Good Thing
* discussion around SLOs (uptime/breach rules/alerting)
* "best practices" still "up in the air"?
* there are tests for alerts
* **review your output** - sometimes, there are bugs
* QUESTIONS
* app implements a /metrics endpoint
* gathered metrics thrown into prometheus
* How are metrics produced to make them available to /metrics?
* prometheus-client in go does the heavy lifting
* AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
### Action Items:
* AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
* add notes to discourse
* schedule mtg for next week
## 2023-03-16 1400-1500 GMT-4
### Attendees: ggainey, dalley, decko, bmbouter
### Regrets:
### Agenda:
* Previous AIs:
* oci-env/otel work to continue to completion
* decko to go from the above, to instrumenting workers
* aiohttp/asgi auto-instrument bakeoff
### Notes
* discussion RE decko's experiences
* worker-trace-example! Woo!
* looking at metrics in grafana - double wooo!
* How do we get actual-services-admins involved in setting up kinds-of metrics/visualizations
* AI: [ggainey] invite jsherrill to come demo their Grafana env for us
* What would be nice:
* Docs written from an Operational perspective
* "Here's a Thing you want to know, here are graphs that will help you answer it"
* Example: "Is pulp serving content correctly? - visualize content-app status codes"
* Next-steps sequence
* finish oci-env profile
* start workingon some "standard" grpahs
* work on how-to docs
* work on demos
* how can we merge better w/ pulpcore?
* right way to merge new libs to project?
* responding to various installation-scenarios
* discussion
* single-container - s6-svc
* what if users don't want to spin up otel? What happens to the app?
* pulp-otel-enabled variable - default to False
* what does that mean?
* does **not** mean that otel-libs aren't installed (are they direct-deps or not? will be incl in img regardless)
* multiprocess container - there's another svc running
* docs should call out/link to docs RE feature-flip vars for the auto-instr libs
* able to toggle collect-data or not, for various auto-instr pieces
* example: https://opentelemetry-python.readthedocs.io/en/latest/examples/django/README.html#disabling-django-instrumentation
* "direct dependency vs not" discussion
* if it is, you can't *uninstall* it
* not *everything* has to be a hard-dep
* maybe start with not-required
* prioritizing aiohttp server PR might be worthwhile
* acceptance is out of our control
* will take more time to get an aiohttp-lib w/ the support released
* auto-instr pkgs aren't going to include correlation-id-support for pulp's cids
* look at (eg) https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/instrumentation/opentelemetry-instrumentation-wsgi/src/opentelemetry/instrumentation/wsgi/__init__.py#L85
* wsgi and aiohttp should be enhanced this way
* what's the realtionship between trace-id and cid? Can we make them the same?
* spans might end up w/ dup ids? - Prob OK
* need to experiment/investigate
### Action Items:
* AI: [ggainey] invite jsherrill to come demo their Grafana env for us
* add notes to discourse
## 2023-03-09 1400-1500 GMT-5
### Attendees: ggainey, decko, dralley, bmbouter
### Regrets:
### Agenda:
* Previous AIs:
* bmbouter to do little perf-test
* bmbouter/decko to work together to get oci-env to run otel setup
* decko to go from the above, to instrumenting workers
* aiohttp/asgi auto-instrument bakeoff
* ggainey to sched for an hour next Thurs
* (insert topics here)
### Notes
* some data on performance-impact of traces : https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1556
* discussion ensues
* maybe we want to test more?
* if our worker span instrumentation is "feature-flippable", we're implying a direct dependency on otel being packaged
* prob want to discuss at pulpcore mtgs
* oci-env work in progress
* really close to having an otel-env
* work continues
* how's the worker-instrumentation going to work?
* can we get a span that covers creation-to-end?
* dispatch-to-start is a good thing to know
* just span for run-to-complete is "easy"
* can we use the correlation-id as a span-id?
* as opposed to task-uuid?
* BUT - think about dispatching-a-task-group
* what about **metrics** (as opposed to spans)
* auto-instrumentation setup has its own metrics
* are there Things we'd like to add to our code specifically?
* for tasking system, almost certainly
* per-worker metric(s)
* fail-rate
* task-throughput
* what happens when workers go-away?
* attach metrics to worker-names?
* missing-worker-events
* interpretation is key
* system-metrics as a whole
* wait-q-size
* waiting-lock-evaluation ("concurrency opportunity")
* ratio tasks/possible_concurrency
* discussion around how workers dispatch themselves
* thinking like an admin
* do I have too much hardware in use?
* not enough?
* how do I know "something is going wrong"?
* "service-level-indicator": how much time does a task wait before start
* "possible concurrency": how many *could* start, assuming enough workers?
* "utilization": what percentage of workers are "typically" busy?
* https://www.brendangregg.com/usemethod.html
### Action Items:
* oci-env/otel work to continue to completion
* decko to go from the above, to instrumenting workers
* aiohttp/asgi auto-instrument bakeoff
* add notes to discourse
* ggainey to schedule next for one week out
## 2023-03-02 1000-1030 GMT-5
### Attendees: ggainey, bmbouter, decko, dralley
### Regrets:
### Agenda:
## Notes
* updates
* experimented using wsgi_autoinstrumentation
* works better than django-auotoinstr
* correctly nested/subspanned things like postgres-spans
* is there any reason to use django-auto?
* look at their issues, maybe there's a known prob
* we don't know of anything we're missing
* maybe compare the two codebases?
* how do **metrics** compare to tracing output?
* how will we add this into our container?
* optional vs non-optional dependencies?
* need to id what the new dependencies are?
* what's the perf-overhead if you're **not** gathering trace-info (if any)
* dralley: there is perf-overhead when tracing
* bmbouter: is there a perf-impact when you're not collecting telemetry-output
* what do **metrics** look like (as opposed to tracing)
* bmbouter has gotten Pulp reporting metrics to an otel-container, and then shipping those to Prometheus
* next step is visualizing in grafana
* what are "the right" metrics?
* discussion around oci-env work/changes to support
* Prio #1: get oci-env profile in place to sup-port otel
* needs mikedep's PR for [oci-images #449](https://github.com/pulp/pulp-oci-images/issues/449) to be merged for our images
* discussion around django-prometheus
* no traces, just metrics
* is this maybe a path to be getting insight/inspiration for metrics?
* https://github.com/korfuri/django-prometheus
* discussion around current-monitoring-use by an actual (large) user of OTel
* detailed metrics-discussion w/ this user on 15-MAR
* bmbouter plans to have a demo available prior
* discussion around asgi-otel-instrumentation
* decko/bmbouter to do deeper discussion awesome
* links
* django-ai-code: https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-django
* wsgi-ai-code: https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-wsgi
* django-prometheus: https://github.com/korfuri/django-prometheus
* otel-asgi-instr: https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-asgi
* https://github.com/open-telemetry/opentelemetry-python-contrib/pull/942/files#diff-e2884db0811036aea22f73ead6dc004e9a27d8d3a8de9a4696f1b1327030af61R135-R156
## Action Items:
* bmbouter to do little perf-test
* bmbouter/decko to work together to get oci-env to run otel setup
* decko to go from the above, to instrumenting workers
* aiohttp/asgi auto-instrument bakeoff
* ggainey to sched for an hour next Thurs
## 2023-02-23 1000-1030 GMT-5
### Attendees: ggainey, bmbouter, decko
### Regrets:
### Agenda:
* Previous AIs:
* (insert topics here)
### Notes
* bmbouter reports
* got django auto-implementation running in api-workers
* importing into Jaeger
* configured oci-env w/ OT visualization
* Let's think about what we *really* want to get out of this effort?
* AI-all for next mtg
* Notes from user-discussions
* users need to be able to turn it off and on
* practical problems
* have otel be its own oci-env profile
* loads Jaeger as a side-container
* base img needs a way to turn OT off and on
* what if we had an instrumented img?
* leads to combinatorics-fun
* users want their own imgs
* what if there was an "instrument this" env-var? (OTEL_ENABLED)
* otel-pieces installed always, just not always "on" unless asked for via this var
* let cfg run via env-vars
* "Here's the OTEL docs, use their env-vars to control behavior"
* https://www.aspecto.io/blog/opentelemetry-collector-guide/
* discussion of "direct to collector" vs "agent to collector" architectures
* allows batching, allows data-transformation, allows redaction
* open question: how should a pulp dev-env be configured?
* example "interesting" metric : https://github.com/pulp/pulpcore/issues/3389
* let's think about the specific packages that might need "RPM-izing"
* may already be RPM'd in Fedora (maybe RHEL?)
* next step:
* instrumenting workers (tasking-system)
* makes optional-otel-dependency more problematic
* what about aiohttp-server side? (content-svc)
* https://github.com/open-telemetry/opentelemetry-python-contrib/pull/942
* maybe we just manually instrument?
* puts more burden on plugin-writers
* decko maybe picks up aiohttpserver tracing?
* prob going to meet once/wk
### Action Items:
* add notes to discourse
## 2022-01-11 1100-1130 GMT-5
### Attendees: dalley, ggainey, bmbouters, ipanova, wibbit
### Regrets:
### Agenda:
* Previous AIs:
* ~~AI: [ggainey] add to team agenda for 12-DEC~~
* ~~AI: [ggainey] get this on 3-month planning doc for Q1~~
* ~~AI: [ggainey] to set up another 30-min mtg for 2nd week Jan~~
* 11-JAN 1100-1130 GMT-5
* ~~AI: [dalley] to open issue/feature to get this work started~~
* https://github.com/pulp/pulpcore/issues/3445
* ~~AI: [ggainey] get minutes into Discourse thread~~
* (insert topics here)
### Notes
* action plan:
* get basic infrastructure POC up
* have one telemetry probe in pulp
* where to start?
* content-app would be Really Useful
* aiohttp doesn't have instrumentation yet
* api-app would be easiest place
* django has support
* "easiest first" is prob a good idea
* wibbit: more interested in api-worker monitoring
* biggest bottlenecks are api-calls and repo generation
* CI efforts push even more in that direction
* next steps
* who? and when?
* Grant, Ina, Daniel are interested, any could take it
* "Q1" - we have stakeholders that would like this "now"
* prio-list, dalley/ggainey/ipanova to work together?
* bmbouter/wibbit can help provide insights into "what we want to measure"
* getting POC up and running
* prob not a multi-month process
* need time to research OpenTelemtry docs
* need to understand oci-env service-setup a bit
* understanding export-formats, etc
* "whoever gets to it first" - consolidate what you/we learn into a doc
* maybe good for a Friday programming effort?
### Action Items:
* add notes to discourse
## 2022-12-06 1000-1030 GMT-5
### Attendees: bmbouter, dalley, ggainey
### Regrets:
### Agenda:
* Previous AIs:
* N/A
* Organizational Meeting
* What do we want this group to accomplish?
* How often do we want to meet?
* What should our next meeting look like?
* Where would we like to be in 3 months?
* Comments from woolsgrs on Discourse:
* currently running a large scale deployment on Pulp 2
* now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup.
* would like to see:
* Overall status of each deployment, understand its functioning correctly, capacity metrics around that.
* Able to trigger alerts from these metrics
* Performance for triggers knowing when to scale up/down
* e.g. no of tasks, task waiting etc.
Content Counts and no. of requests to that content
### Notes
* dalley: first step: get a POC of "followed the tutorial for a django app (apiserver?) and get some basic instrumentation available"
* bmbouter: content-app first (smaller surface, more valuable, answers one of woolsgrs' requests)
* launch POC as a tech-preview
* bmbouter: can this group give high-level goal instead of being prescriptive?
* get POC up **very** quickly
* ggainey, dalley approve
* dalley: let's get basic infrastructure in place, and **then** start iterating
* TIMEFRAME
* what's more important than this?
* satellite support
* HCaaS
* AH
* operator/container work
* rpm pytest
* similar importance
* other pytest conversion
* do we have a ballpark for "POC as a PR" date?
* "Q1" would be good
* how do we understand who is assigned to what, who is doing what, and what their time-commitment is?
* it "feels like" we have a couple of folk who could be freed up to work on this?
* let's bring this up at the core-team mtg next week?
* or at "sprint" planning?
* AI: [ggainey] add to team agenda for 12-DEC
* AI: [ggainey] get this on 3-month planning doc for Q1
* How often should we meet?
* Proposal: not before 2nd week in Jan
* AI: [ggainey] to set up another 30-min mtg that week
* wing it from there
* Proposal: need an issue/feature "Add basic telemtry suport to Pulp3"
* AI: [dalley] to open
### Action Items:
* AI: [ggainey] add to team agenda for 12-DEC
* AI: [ggainey] get this on 3-month planning doc for Q1
* AI: [ggainey] to set up another 30-min mtg for 2nd week Jan
* 11-JAN 1100-1130 GMT-5
* AI: [dalley] to open issue/feature to get this work started
* AI: [ggainey] get minutes into Discourse thread
###### tags: `Minutes`, `OpenTelemetry`