OpenTelemetry Working Group
See also: https://discourse.pulpproject.org/t/monitoring-telemetry-working-group/700/8
Overview
- Purpose: Brainstorm/design how to make Pulp3 more monitor-able. Prioritize the kinds of monitoring that would be useful to administrators.
- Attendees: Pulp dev-team and any interested Pulp administrators
Template
2023-05-11 1400-1430 GMT-3
Attendees: decko, bmbouter, dalley
Regrets:
Agenda:
- PR merging Otel dependencies and WSGI instrumentation - Done
- PR for opentelemetry-instrumentation-aiohttp-server. TBD
- We broke pulpcore CI!
Notes
- We need to add the OpenTelemetry libs as exception on the check_requirements plugin for the CI
Action Items:
- [decko] add notes to discourse
- [decko] Merge Otel oci-env profile
- [decko] Push the opentelemetry aiohttp-server PR
- [bmbouter] Fixes the plugin_template's check_requirements.py
- [bmbouter] Gonna add worker telemetry
2023-05-04 1300-1330 GMT-4
Attendees: decko, bmbouter, ggainey, dalley
Regrets:
Agenda:
Notes
- "what does 'done' looks like?"
- complete aiohttp-instr-PR
- base-version of grafana panels done
- make a decision on how the packaging is going to work
- record two demo videos (jaeger/grafana)
- user-documentation
- pulpcore-docs?
- pulp-oci-images-docs?
- pulp-to-collector overview/connection
- "maybe" otel-cfg we're using in oci-env?
- blog-post - can be ultra-specific, if/as desired
- api, pulp-specific instr, aiohttp, task-system
- api is in a great spot
- aiohttp is pretty good, but we really want it released upstream to consume
- nothing for tasking yet - phase-2 perhaps?
- pulp-specific hooks - phase-3?
- decko: need to have solid decision on packaging-discussion
- What we need to have OpenTelemetry working with Pulp?
- Collector container
- Bunch of dependencies
- opentelemetry-api
- opentelemetry-distro[otlp]
- opentelemetry-exporter-otlp-proto-http
- opentelemetry-exporter-otlp
- opentelemetry-instrumentation-wsgi
- opentelemetry-instrumentation-django
- opentelemetry-semantic-conventions
- opentelemetry-proto
- opentelemetry-sdk
- Instrumenting WSGI entrypoint (wsgi.py)
- The Faster Release Cycle…
- "We can undo it later if needed"
- https://github.com/pulp/pulpcore/pull/3632
- we could accept the import of
opentelemetry.instrumentation.wsgi
today (it exists)
- the import of
opentelemetry.instrumentation.aiohttp_server
needs work that isn't released in that upstream yet
- what can we do while waiting (if we have to)?
- fork the project? <== NO
- depend on the source-checkout? <== better
- vendor the PR?
- we'll decide when we've done everything else we can do
- Decision:
- split aiohttp-mods out of current core-PR
- add trimmed dep-list
- release
- aiohttp-into-core gets its own PR
- requirements.txt relies on src-checkout
- merges once upstream accepts the needed changes
- bring this to pulpcore-mtg next week
Action Items:
- add notes to discourse
- schedule next mtg: The Road To Merge
2023-04-27 1300-1330 GMT-5
Attendees: decko, bmbouter, dralley, ggainey
Regrets:
Agenda:
Notes
- decko showed off his progress
- discussion around what can we do further?
- can we set things up to pre-load Grafana dashboard from prepared JSON, at start-time
- what are, say, "4 things we want in a Grafana dashboard"?
- content-app
- response-codes over time
- organize by class (400/500/OK)?
- show my 500s only?
- req-latency
- "cost" items
- how many bytes have been served
- can we gather metrics per-domain?
- where do/can we record that?
- /pulp/content/DOMAIN/content-URL
- upstream (aiohttp) vs downstream (in-pulp)
- can we attach this data as a header to the request, and record that header?
- proposal: 'launch' w/ response/-codes/latency, in pretty "basic" graphs, preloaded into oci-env profile
- discussion: what needs to happen to get aiohttp-instrumentation-PR merged?
- open a new PR from decko's branch w/ orig commits against aiohttp repo?
- discussion (brief) around a "pulp_telemetry" 'shim'
Action Items:
2023-04-20 1300-1330 GMT-5
Attendees: decko, dralley, ggainey
Regrets:
Agenda:
- Previous AIs:
- AI: ggainey to start using oci_env profile PR for this
AI: review https://github.com/pulp/pulp-oci-images/pull/469
- Tabled for later investigation:
- AI: decko to add what is missing from the #pulpcore/3632
- AI: [decko] get tests running locally to see why docker-side fails when podman-side succeeds
- still can't figure out why docker "occasionally" fails
Notes
- see instructions in the profile-readme in the oci-env PR for a how-to
Action Items:
2023-04-13 1300-1330 GMT-5
Attendees: ggainey, decko, ggainey
Regrets:
Agenda:
Notes
- review/discussion of some test failures
- current kind-of-a-plan for aiohttp-metrics-work
- move tests to pytest, get them running clean (#soon)
- add metrics taking advantage of this fork
- think on what tests we prob should have in addition, write them, get them running clean
- submit otel-aiohttp PR upstream
- continue adding metrics to "our" fork independently
- AI: ggainey to start using oci_env profile PR for this
- AI: review https://github.com/pulp/pulp-oci-images/pull/469
- AI: decko to add what is missing from the #pulpcore/3632
Action Items:
- [ggainey] to start using oci_env profile PR for this
- [decko] get tests running locally to see why docker-side fails when podman-side succeeds
- [any] review https://github.com/pulp/pulp-oci-images/pull/469
- [decko] to add what is missing from the #pulpcore/3632
- [ggainey] schedule next mtg for next week
- [ggainey]add notes to discourse
2023-04-06 1300-1330 GMT-5
Attendees: decko, dalley, ggainey
Regrets:
Agenda:
Notes
- PRs are in-progress, to be submitted this week
- HMS mtg taught some things we'll prob steal :)
- discussion around a plugin-approach to making otel available
- maybe just hooks in core, that do nothing w/out pulp_otel installed?
Action Items:
2023-03-31 1330-1400 GMT-5
Attendees: dalley, decko, ggainey
Regrets:
Agenda:
- Previous AIs:
- AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
Notes
- pulp-content/aiohttp instrumentation demo from decko
- traces working, still trying to get metrics
- things in flight
- aiohttp package w/ instrumentation
- metrics labels
- instrumenting workers
- getting pulp-api metrics, but not from punp-content, need to understand why
- A Plan:
- finish oci-env profile for otel
- figure out why we're not getting some wsgi-instr labels
- get a working aiohttp-instr PR submitted (based on the work of the existing 'abandoned' PR)
Action Items:
- add notes to discourse
- ggainey to sched next mtg for next Thurs
2023-03-22 1330-1400 GMT-4
Attendees: decko, jsherrill, dralley, ggainey
Regrets:
Agenda:
- jsherrill to show us what his team is doing w/ monitoring/metrics
- NOTES
- might be "some" guidance available
- we're all making this up as we go along
- http status/latency
- msg-latency/error-rate (like tasks?)
- some "analytics" info mixed in
- grafana dashboard to visualize
- started from a template
- export JSON in order to import into app
- PR for visualization changes is currently "exciting"
- having a full-time data visualization expert would be a Good Thing
- discussion around SLOs (uptime/breach rules/alerting)
- "best practices" still "up in the air"?
- there are tests for alerts
- review your output - sometimes, there are bugs
- QUESTIONS
- app implements a /metrics endpoint
- gathered metrics thrown into prometheus
- How are metrics produced to make them available to /metrics?
- prometheus-client in go does the heavy lifting
- AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
Action Items:
- AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
- add notes to discourse
- schedule mtg for next week
2023-03-16 1400-1500 GMT-4
Attendees: ggainey, dalley, decko, bmbouter
Regrets:
Agenda:
- Previous AIs:
- oci-env/otel work to continue to completion
- decko to go from the above, to instrumenting workers
- aiohttp/asgi auto-instrument bakeoff
Notes
- discussion RE decko's experiences
- worker-trace-example! Woo!
- looking at metrics in grafana - double wooo!
- How do we get actual-services-admins involved in setting up kinds-of metrics/visualizations
- AI: [ggainey] invite jsherrill to come demo their Grafana env for us
- What would be nice:
- Docs written from an Operational perspective
- "Here's a Thing you want to know, here are graphs that will help you answer it"
- Example: "Is pulp serving content correctly? - visualize content-app status codes"
- Next-steps sequence
- finish oci-env profile
- start workingon some "standard" grpahs
- work on how-to docs
- work on demos
- how can we merge better w/ pulpcore?
- right way to merge new libs to project?
- responding to various installation-scenarios
- discussion
- single-container - s6-svc
- what if users don't want to spin up otel? What happens to the app?
- pulp-otel-enabled variable - default to False
- what does that mean?
- does not mean that otel-libs aren't installed (are they direct-deps or not? will be incl in img regardless)
- multiprocess container - there's another svc running
- docs should call out/link to docs RE feature-flip vars for the auto-instr libs
- "direct dependency vs not" discussion
- if it is, you can't uninstall it
- not everything has to be a hard-dep
- maybe start with not-required
- prioritizing aiohttp server PR might be worthwhile
- acceptance is out of our control
- will take more time to get an aiohttp-lib w/ the support released
- auto-instr pkgs aren't going to include correlation-id-support for pulp's cids
Action Items:
- AI: [ggainey] invite jsherrill to come demo their Grafana env for us
- add notes to discourse
2023-03-09 1400-1500 GMT-5
Attendees: ggainey, decko, dralley, bmbouter
Regrets:
Agenda:
- Previous AIs:
- bmbouter to do little perf-test
- bmbouter/decko to work together to get oci-env to run otel setup
- decko to go from the above, to instrumenting workers
- aiohttp/asgi auto-instrument bakeoff
- ggainey to sched for an hour next Thurs
- (insert topics here)
Notes
- some data on performance-impact of traces : https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1556
- discussion ensues
- maybe we want to test more?
- if our worker span instrumentation is "feature-flippable", we're implying a direct dependency on otel being packaged
- prob want to discuss at pulpcore mtgs
- oci-env work in progress
- really close to having an otel-env
- work continues
- how's the worker-instrumentation going to work?
- can we get a span that covers creation-to-end?
- dispatch-to-start is a good thing to know
- just span for run-to-complete is "easy"
- can we use the correlation-id as a span-id?
- as opposed to task-uuid?
- BUT - think about dispatching-a-task-group
- what about metrics (as opposed to spans)
- auto-instrumentation setup has its own metrics
- are there Things we'd like to add to our code specifically?
- for tasking system, almost certainly
- per-worker metric(s)
- fail-rate
- task-throughput
- what happens when workers go-away?
- attach metrics to worker-names?
- missing-worker-events
- system-metrics as a whole
- wait-q-size
- waiting-lock-evaluation ("concurrency opportunity")
- ratio tasks/possible_concurrency
- discussion around how workers dispatch themselves
- thinking like an admin
- do I have too much hardware in use?
- not enough?
- how do I know "something is going wrong"?
- "service-level-indicator": how much time does a task wait before start
- "possible concurrency": how many could start, assuming enough workers?
- "utilization": what percentage of workers are "typically" busy?
Action Items:
- oci-env/otel work to continue to completion
- decko to go from the above, to instrumenting workers
- aiohttp/asgi auto-instrument bakeoff
- add notes to discourse
- ggainey to schedule next for one week out
2023-03-02 1000-1030 GMT-5
Attendees: ggainey, bmbouter, decko, dralley
Regrets:
Agenda:
Notes
- updates
- experimented using wsgi_autoinstrumentation
- works better than django-auotoinstr
- correctly nested/subspanned things like postgres-spans
- is there any reason to use django-auto?
- look at their issues, maybe there's a known prob
- we don't know of anything we're missing
- maybe compare the two codebases?
- how do metrics compare to tracing output?
- how will we add this into our container?
- optional vs non-optional dependencies?
- need to id what the new dependencies are?
- what's the perf-overhead if you're not gathering trace-info (if any)
- dralley: there is perf-overhead when tracing
- bmbouter: is there a perf-impact when you're not collecting telemetry-output
- what do metrics look like (as opposed to tracing)
- bmbouter has gotten Pulp reporting metrics to an otel-container, and then shipping those to Prometheus
- next step is visualizing in grafana
- what are "the right" metrics?
- discussion around oci-env work/changes to support
- Prio #1: get oci-env profile in place to sup-port otel
- discussion around django-prometheus
- discussion around current-monitoring-use by an actual (large) user of OTel
- detailed metrics-discussion w/ this user on 15-MAR
- bmbouter plans to have a demo available prior
- discussion around asgi-otel-instrumentation
- decko/bmbouter to do deeper discussion awesome
- links
Action Items:
- bmbouter to do little perf-test
- bmbouter/decko to work together to get oci-env to run otel setup
- decko to go from the above, to instrumenting workers
- aiohttp/asgi auto-instrument bakeoff
- ggainey to sched for an hour next Thurs
2023-02-23 1000-1030 GMT-5
Attendees: ggainey, bmbouter, decko
Regrets:
Agenda:
- Previous AIs:
- (insert topics here)
Notes
- bmbouter reports
- got django auto-implementation running in api-workers
- importing into Jaeger
- configured oci-env w/ OT visualization
- Let's think about what we really want to get out of this effort?
- Notes from user-discussions
- users need to be able to turn it off and on
- practical problems
- have otel be its own oci-env profile
- loads Jaeger as a side-container
- base img needs a way to turn OT off and on
- what if we had an instrumented img?
- leads to combinatorics-fun
- users want their own imgs
- what if there was an "instrument this" env-var? (OTEL_ENABLED)
- otel-pieces installed always, just not always "on" unless asked for via this var
- let cfg run via env-vars
- "Here's the OTEL docs, use their env-vars to control behavior"
- https://www.aspecto.io/blog/opentelemetry-collector-guide/
- discussion of "direct to collector" vs "agent to collector" architectures
- allows batching, allows data-transformation, allows redaction
- open question: how should a pulp dev-env be configured?
- example "interesting" metric : https://github.com/pulp/pulpcore/issues/3389
- let's think about the specific packages that might need "RPM-izing"
- may already be RPM'd in Fedora (maybe RHEL?)
- next step:
- instrumenting workers (tasking-system)
- makes optional-otel-dependency more problematic
- what about aiohttp-server side? (content-svc)
- decko maybe picks up aiohttpserver tracing?
- prob going to meet once/wk
Action Items:
2022-01-11 1100-1130 GMT-5
Attendees: dalley, ggainey, bmbouters, ipanova, wibbit
Regrets:
Agenda:
- Previous AIs:
AI: [ggainey] add to team agenda for 12-DEC
AI: [ggainey] get this on 3-month planning doc for Q1
AI: [ggainey] to set up another 30-min mtg for 2nd week Jan
AI: [dalley] to open issue/feature to get this work started
AI: [ggainey] get minutes into Discourse thread
- (insert topics here)
Notes
- action plan:
- get basic infrastructure POC up
- have one telemetry probe in pulp
- where to start?
- content-app would be Really Useful
- aiohttp doesn't have instrumentation yet
- api-app would be easiest place
- "easiest first" is prob a good idea
- wibbit: more interested in api-worker monitoring
- biggest bottlenecks are api-calls and repo generation
- CI efforts push even more in that direction
- next steps
- who? and when?
- Grant, Ina, Daniel are interested, any could take it
- "Q1" - we have stakeholders that would like this "now"
- prio-list, dalley/ggainey/ipanova to work together?
- bmbouter/wibbit can help provide insights into "what we want to measure"
- getting POC up and running
- prob not a multi-month process
- need time to research OpenTelemtry docs
- need to understand oci-env service-setup a bit
- understanding export-formats, etc
- "whoever gets to it first" - consolidate what you/we learn into a doc
- maybe good for a Friday programming effort?
Action Items:
2022-12-06 1000-1030 GMT-5
Attendees: bmbouter, dalley, ggainey
Regrets:
Agenda:
- Previous AIs:
- Organizational Meeting
- What do we want this group to accomplish?
- How often do we want to meet?
- What should our next meeting look like?
- Where would we like to be in 3 months?
- Comments from woolsgrs on Discourse:
- currently running a large scale deployment on Pulp 2
- now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup.
- would like to see:
- Overall status of each deployment, understand its functioning correctly, capacity metrics around that.
- Able to trigger alerts from these metrics
- Performance for triggers knowing when to scale up/down
- e.g. no of tasks, task waiting etc.
Content Counts and no. of requests to that content
Notes
- dalley: first step: get a POC of "followed the tutorial for a django app (apiserver?) and get some basic instrumentation available"
- bmbouter: content-app first (smaller surface, more valuable, answers one of woolsgrs' requests)
- launch POC as a tech-preview
- bmbouter: can this group give high-level goal instead of being prescriptive?
- get POC up very quickly
- ggainey, dalley approve
- dalley: let's get basic infrastructure in place, and then start iterating
- TIMEFRAME
- what's more important than this?
- satellite support
- HCaaS
- AH
- operator/container work
- rpm pytest
- similar importance
- do we have a ballpark for "POC as a PR" date?
- "Q1" would be good
- how do we understand who is assigned to what, who is doing what, and what their time-commitment is?
- it "feels like" we have a couple of folk who could be freed up to work on this?
- let's bring this up at the core-team mtg next week?
- or at "sprint" planning?
- AI: [ggainey] add to team agenda for 12-DEC
- AI: [ggainey] get this on 3-month planning doc for Q1
- How often should we meet?
- Proposal: not before 2nd week in Jan
- AI: [ggainey] to set up another 30-min mtg that week
- wing it from there
- Proposal: need an issue/feature "Add basic telemtry suport to Pulp3"
Action Items:
- AI: [ggainey] add to team agenda for 12-DEC
- AI: [ggainey] get this on 3-month planning doc for Q1
- AI: [ggainey] to set up another 30-min mtg for 2nd week Jan
- AI: [dalley] to open issue/feature to get this work started
- AI: [ggainey] get minutes into Discourse thread