# OpenTelemetry Working Group See also: https://discourse.pulpproject.org/t/monitoring-telemetry-working-group/700/8 ## Overview * Purpose: Brainstorm/design how to make Pulp3 more monitor-able. Prioritize the kinds of monitoring that would be useful to administrators. * Attendees: Pulp dev-team and any interested Pulp administrators ## Template ``` ## YYYY-MM-DD 1300-1330 GMT-5 ### Attendees: ### Regrets: ### Agenda: * Previous AIs: * (insert topics here) ### Notes ### Action Items: * add notes to discourse ``` ## 2023-05-11 1400-1430 GMT-3 ### Attendees: decko, bmbouter, dalley ### Regrets: ### Agenda: * PR merging Otel dependencies and WSGI instrumentation - Done * PR for opentelemetry-instrumentation-aiohttp-server. TBD * We broke pulpcore CI! * Need to create some kind of exception mechanism to allow whitelisting dependencies??? * https://github.com/pulp/plugin_template/blob/main/templates/github/.ci/scripts/check_requirements.py.j2 ### Notes * We need to add the OpenTelemetry libs as exception on the check_requirements plugin for the CI ### Action Items: * [decko] add notes to discourse * [decko] Merge Otel oci-env profile * [decko] Push the opentelemetry aiohttp-server PR * [bmbouter] Fixes the plugin_template's check_requirements.py * [bmbouter] Gonna add worker telemetry ## 2023-05-04 1300-1330 GMT-4 ### Attendees: decko, bmbouter, ggainey, dalley ### Regrets: ### Agenda: * Previous AIs: ### Notes * "what does 'done' looks like?" * complete aiohttp-instr-PR * base-version of grafana panels done * make a decision on how the packaging is going to work * record two demo videos (jaeger/grafana) * user-documentation * pulpcore-docs? * pulp-oci-images-docs? * pulp-to-collector overview/connection * "maybe" otel-cfg we're using in oci-env? * blog-post - can be ultra-specific, if/as desired * api, pulp-specific instr, aiohttp, task-system * api is in a great spot * aiohttp is pretty good, but we really want it released upstream to consume * nothing for tasking yet - phase-2 perhaps? * pulp-specific hooks - phase-3? * decko: need to have solid decision on packaging-discussion * What we need to have OpenTelemetry working with Pulp? * Collector container * Bunch of dependencies * opentelemetry-api * opentelemetry-distro[otlp] * opentelemetry-exporter-otlp-proto-http * opentelemetry-exporter-otlp * opentelemetry-instrumentation-wsgi * opentelemetry-instrumentation-django * opentelemetry-semantic-conventions * opentelemetry-proto * opentelemetry-sdk * Instrumenting WSGI entrypoint (wsgi.py) * The Faster Release Cycle... * "We can undo it later if needed" * https://github.com/pulp/pulpcore/pull/3632 * we could accept the import of `opentelemetry.instrumentation.wsgi` **today** (it exists) * the import of `opentelemetry.instrumentation.aiohttp_server` needs work that isn't released in that upstream yet * what can we do while waiting (if we have to)? * fork the project? <== NO * depend on the source-checkout? <== better * vendor the PR? * we'll decide when we've done *everything else* we can do * Decision: * split aiohttp-mods out of current core-PR * add trimmed dep-list * release * aiohttp-into-core gets its own PR * requirements.txt relies on src-checkout * merges once upstream accepts the needed changes * bring this to pulpcore-mtg next week ### Action Items: * add notes to discourse * schedule next mtg: The Road To Merge ## 2023-04-27 1300-1330 GMT-5 ### Attendees: decko, bmbouter, dralley, ggainey ### Regrets: ### Agenda: * Previous AIs: ### Notes * decko showed off his progress * discussion around what can we do further? * can we set things up to pre-load Grafana dashboard from prepared JSON, at start-time * what are, say, "4 things we want in a Grafana dashboard"? * content-app * response-codes over time * organize by class (400/500/OK)? * show my 500s only? * req-latency * avg * P95 * P99 * "cost" items * how many bytes have been served * 202 is diff than 301 * can we gather metrics **per-domain**? * where do/can we record that? * /pulp/content/DOMAIN/content-URL * upstream (aiohttp) vs downstream (in-pulp) * can we attach this data as a header to the request, and record that header? * proposal: 'launch' w/ response/-codes/latency, in pretty "basic" graphs, preloaded into oci-env profile * discussion: what needs to happen to get aiohttp-instrumentation-PR merged? * open a new PR from decko's branch w/ orig commits against aiohttp repo? * (remember, based on https://github.com/open-telemetry/opentelemetry-python-contrib/pull/942) * https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1714 * discussion (brief) around a "pulp_telemetry" 'shim' ### Action Items: * add notes to discourse ## 2023-04-20 1300-1330 GMT-5 ### Attendees: decko, dralley, ggainey ### Regrets: ### Agenda: * Previous AIs: * AI: ggainey to start using oci_env profile PR for this * https://github.com/pulp/oci_env/pull/98 * no progress to report * ~~AI: review https://github.com/pulp/pulp-oci-images/pull/469~~ * merged * Tabled for later investigation: * AI: decko to add what is missing from the #pulpcore/3632 * AI: [decko] get tests running locally to see why docker-side fails when podman-side succeeds * still can't figure out why docker "occasionally" fails ### Notes * see instructions in the profile-readme in the oci-env PR for a how-to ### Action Items: * add notes to discourse ## 2023-04-13 1300-1330 GMT-5 ### Attendees: ggainey, decko, ggainey ### Regrets: ### Agenda: * ### Notes * review/discussion of some test failures * current kind-of-a-plan for aiohttp-metrics-work * move tests to pytest, get them running clean (#soon) * add metrics taking advantage of this fork * think on what tests we prob should have in addition, write them, get them running clean * submit otel-aiohttp PR upstream * continue adding metrics to "our" fork independently * AI: ggainey to start using oci_env profile PR for this * https://github.com/pulp/oci_env/pull/98 * AI: [decko] get tests running locally to see why docker-side fails when podman-side succeeds * AI: review https://github.com/pulp/pulp-oci-images/pull/469 * AI: decko to add what is missing from the #pulpcore/3632 ### Action Items: * [ggainey] to start using oci_env profile PR for this * [decko] get tests running locally to see why docker-side fails when podman-side succeeds * [any] review https://github.com/pulp/pulp-oci-images/pull/469 * [decko] to add what is missing from the #pulpcore/3632 * [ggainey] schedule next mtg for next week * [ggainey]add notes to discourse ## 2023-04-06 1300-1330 GMT-5 ### Attendees: decko, dalley, ggainey ### Regrets: ### Agenda: * Previous AIs: ### Notes * PRs are in-progress, to be submitted this week * HMS mtg taught some things we'll prob steal :) * discussion around a plugin-approach to making otel available * maybe just hooks in core, that do nothing w/out pulp_otel installed? ### Action Items: * add notes to discourse ## 2023-03-31 1330-1400 GMT-5 ### Attendees: dalley, decko, ggainey ### Regrets: ### Agenda: * Previous AIs: * AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group * There exists a Red Hat Observability CoP! * https://source.redhat.com/groups/public/program_observability ### Notes * pulp-content/aiohttp instrumentation demo from decko * traces working, still trying to get metrics * things in flight * aiohttp package w/ instrumentation * metrics labels * instrumenting workers * getting pulp-api metrics, but not from punp-content, need to understand why * A Plan: * finish oci-env profile for otel * figure out why we're not getting some wsgi-instr labels * https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/http-metrics.md * get a working aiohttp-instr PR submitted (based on the work of the existing 'abandoned' PR) ### Action Items: * add notes to discourse * ggainey to sched next mtg for next Thurs ## 2023-03-22 1330-1400 GMT-4 ### Attendees: decko, jsherrill, dralley, ggainey ### Regrets: ### Agenda: * jsherrill to show us what his team is doing w/ monitoring/metrics * NOTES * might be "some" guidance available * we're all making this up as we go along * http status/latency * msg-latency/error-rate (like tasks?) * some "analytics" info mixed in * grafana dashboard to visualize * started from a template * export JSON in order to import into app * PR for visualization changes is currently "exciting" * having a full-time data visualization expert would be a Good Thing * discussion around SLOs (uptime/breach rules/alerting) * "best practices" still "up in the air"? * there are tests for alerts * **review your output** - sometimes, there are bugs * QUESTIONS * app implements a /metrics endpoint * gathered metrics thrown into prometheus * How are metrics produced to make them available to /metrics? * prometheus-client in go does the heavy lifting * AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group ### Action Items: * AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group * add notes to discourse * schedule mtg for next week ## 2023-03-16 1400-1500 GMT-4 ### Attendees: ggainey, dalley, decko, bmbouter ### Regrets: ### Agenda: * Previous AIs: * oci-env/otel work to continue to completion * decko to go from the above, to instrumenting workers * aiohttp/asgi auto-instrument bakeoff ### Notes * discussion RE decko's experiences * worker-trace-example! Woo! * looking at metrics in grafana - double wooo! * How do we get actual-services-admins involved in setting up kinds-of metrics/visualizations * AI: [ggainey] invite jsherrill to come demo their Grafana env for us * What would be nice: * Docs written from an Operational perspective * "Here's a Thing you want to know, here are graphs that will help you answer it" * Example: "Is pulp serving content correctly? - visualize content-app status codes" * Next-steps sequence * finish oci-env profile * start workingon some "standard" grpahs * work on how-to docs * work on demos * how can we merge better w/ pulpcore? * right way to merge new libs to project? * responding to various installation-scenarios * discussion * single-container - s6-svc * what if users don't want to spin up otel? What happens to the app? * pulp-otel-enabled variable - default to False * what does that mean? * does **not** mean that otel-libs aren't installed (are they direct-deps or not? will be incl in img regardless) * multiprocess container - there's another svc running * docs should call out/link to docs RE feature-flip vars for the auto-instr libs * able to toggle collect-data or not, for various auto-instr pieces * example: https://opentelemetry-python.readthedocs.io/en/latest/examples/django/README.html#disabling-django-instrumentation * "direct dependency vs not" discussion * if it is, you can't *uninstall* it * not *everything* has to be a hard-dep * maybe start with not-required * prioritizing aiohttp server PR might be worthwhile * acceptance is out of our control * will take more time to get an aiohttp-lib w/ the support released * auto-instr pkgs aren't going to include correlation-id-support for pulp's cids * look at (eg) https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/instrumentation/opentelemetry-instrumentation-wsgi/src/opentelemetry/instrumentation/wsgi/__init__.py#L85 * wsgi and aiohttp should be enhanced this way * what's the realtionship between trace-id and cid? Can we make them the same? * spans might end up w/ dup ids? - Prob OK * need to experiment/investigate ### Action Items: * AI: [ggainey] invite jsherrill to come demo their Grafana env for us * add notes to discourse ## 2023-03-09 1400-1500 GMT-5 ### Attendees: ggainey, decko, dralley, bmbouter ### Regrets: ### Agenda: * Previous AIs: * bmbouter to do little perf-test * bmbouter/decko to work together to get oci-env to run otel setup * decko to go from the above, to instrumenting workers * aiohttp/asgi auto-instrument bakeoff * ggainey to sched for an hour next Thurs * (insert topics here) ### Notes * some data on performance-impact of traces : https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1556 * discussion ensues * maybe we want to test more? * if our worker span instrumentation is "feature-flippable", we're implying a direct dependency on otel being packaged * prob want to discuss at pulpcore mtgs * oci-env work in progress * really close to having an otel-env * work continues * how's the worker-instrumentation going to work? * can we get a span that covers creation-to-end? * dispatch-to-start is a good thing to know * just span for run-to-complete is "easy" * can we use the correlation-id as a span-id? * as opposed to task-uuid? * BUT - think about dispatching-a-task-group * what about **metrics** (as opposed to spans) * auto-instrumentation setup has its own metrics * are there Things we'd like to add to our code specifically? * for tasking system, almost certainly * per-worker metric(s) * fail-rate * task-throughput * what happens when workers go-away? * attach metrics to worker-names? * missing-worker-events * interpretation is key * system-metrics as a whole * wait-q-size * waiting-lock-evaluation ("concurrency opportunity") * ratio tasks/possible_concurrency * discussion around how workers dispatch themselves * thinking like an admin * do I have too much hardware in use? * not enough? * how do I know "something is going wrong"? * "service-level-indicator": how much time does a task wait before start * "possible concurrency": how many *could* start, assuming enough workers? * "utilization": what percentage of workers are "typically" busy? * https://www.brendangregg.com/usemethod.html ### Action Items: * oci-env/otel work to continue to completion * decko to go from the above, to instrumenting workers * aiohttp/asgi auto-instrument bakeoff * add notes to discourse * ggainey to schedule next for one week out ## 2023-03-02 1000-1030 GMT-5 ### Attendees: ggainey, bmbouter, decko, dralley ### Regrets: ### Agenda: ## Notes * updates * experimented using wsgi_autoinstrumentation * works better than django-auotoinstr * correctly nested/subspanned things like postgres-spans * is there any reason to use django-auto? * look at their issues, maybe there's a known prob * we don't know of anything we're missing * maybe compare the two codebases? * how do **metrics** compare to tracing output? * how will we add this into our container? * optional vs non-optional dependencies? * need to id what the new dependencies are? * what's the perf-overhead if you're **not** gathering trace-info (if any) * dralley: there is perf-overhead when tracing * bmbouter: is there a perf-impact when you're not collecting telemetry-output * what do **metrics** look like (as opposed to tracing) * bmbouter has gotten Pulp reporting metrics to an otel-container, and then shipping those to Prometheus * next step is visualizing in grafana * what are "the right" metrics? * discussion around oci-env work/changes to support * Prio #1: get oci-env profile in place to sup-port otel * needs mikedep's PR for [oci-images #449](https://github.com/pulp/pulp-oci-images/issues/449) to be merged for our images * discussion around django-prometheus * no traces, just metrics * is this maybe a path to be getting insight/inspiration for metrics? * https://github.com/korfuri/django-prometheus * discussion around current-monitoring-use by an actual (large) user of OTel * detailed metrics-discussion w/ this user on 15-MAR * bmbouter plans to have a demo available prior * discussion around asgi-otel-instrumentation * decko/bmbouter to do deeper discussion awesome * links * django-ai-code: https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-django * wsgi-ai-code: https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-wsgi * django-prometheus: https://github.com/korfuri/django-prometheus * otel-asgi-instr: https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-asgi * https://github.com/open-telemetry/opentelemetry-python-contrib/pull/942/files#diff-e2884db0811036aea22f73ead6dc004e9a27d8d3a8de9a4696f1b1327030af61R135-R156 ## Action Items: * bmbouter to do little perf-test * bmbouter/decko to work together to get oci-env to run otel setup * decko to go from the above, to instrumenting workers * aiohttp/asgi auto-instrument bakeoff * ggainey to sched for an hour next Thurs ## 2023-02-23 1000-1030 GMT-5 ### Attendees: ggainey, bmbouter, decko ### Regrets: ### Agenda: * Previous AIs: * (insert topics here) ### Notes * bmbouter reports * got django auto-implementation running in api-workers * importing into Jaeger * configured oci-env w/ OT visualization * Let's think about what we *really* want to get out of this effort? * AI-all for next mtg * Notes from user-discussions * users need to be able to turn it off and on * practical problems * have otel be its own oci-env profile * loads Jaeger as a side-container * base img needs a way to turn OT off and on * what if we had an instrumented img? * leads to combinatorics-fun * users want their own imgs * what if there was an "instrument this" env-var? (OTEL_ENABLED) * otel-pieces installed always, just not always "on" unless asked for via this var * let cfg run via env-vars * "Here's the OTEL docs, use their env-vars to control behavior" * https://www.aspecto.io/blog/opentelemetry-collector-guide/ * discussion of "direct to collector" vs "agent to collector" architectures * allows batching, allows data-transformation, allows redaction * open question: how should a pulp dev-env be configured? * example "interesting" metric : https://github.com/pulp/pulpcore/issues/3389 * let's think about the specific packages that might need "RPM-izing" * may already be RPM'd in Fedora (maybe RHEL?) * next step: * instrumenting workers (tasking-system) * makes optional-otel-dependency more problematic * what about aiohttp-server side? (content-svc) * https://github.com/open-telemetry/opentelemetry-python-contrib/pull/942 * maybe we just manually instrument? * puts more burden on plugin-writers * decko maybe picks up aiohttpserver tracing? * prob going to meet once/wk ### Action Items: * add notes to discourse ## 2022-01-11 1100-1130 GMT-5 ### Attendees: dalley, ggainey, bmbouters, ipanova, wibbit ### Regrets: ### Agenda: * Previous AIs: * ~~AI: [ggainey] add to team agenda for 12-DEC~~ * ~~AI: [ggainey] get this on 3-month planning doc for Q1~~ * ~~AI: [ggainey] to set up another 30-min mtg for 2nd week Jan~~ * 11-JAN 1100-1130 GMT-5 * ~~AI: [dalley] to open issue/feature to get this work started~~ * https://github.com/pulp/pulpcore/issues/3445 * ~~AI: [ggainey] get minutes into Discourse thread~~ * (insert topics here) ### Notes * action plan: * get basic infrastructure POC up * have one telemetry probe in pulp * where to start? * content-app would be Really Useful * aiohttp doesn't have instrumentation yet * api-app would be easiest place * django has support * "easiest first" is prob a good idea * wibbit: more interested in api-worker monitoring * biggest bottlenecks are api-calls and repo generation * CI efforts push even more in that direction * next steps * who? and when? * Grant, Ina, Daniel are interested, any could take it * "Q1" - we have stakeholders that would like this "now" * prio-list, dalley/ggainey/ipanova to work together? * bmbouter/wibbit can help provide insights into "what we want to measure" * getting POC up and running * prob not a multi-month process * need time to research OpenTelemtry docs * need to understand oci-env service-setup a bit * understanding export-formats, etc * "whoever gets to it first" - consolidate what you/we learn into a doc * maybe good for a Friday programming effort? ### Action Items: * add notes to discourse ## 2022-12-06 1000-1030 GMT-5 ### Attendees: bmbouter, dalley, ggainey ### Regrets: ### Agenda: * Previous AIs: * N/A * Organizational Meeting * What do we want this group to accomplish? * How often do we want to meet? * What should our next meeting look like? * Where would we like to be in 3 months? * Comments from woolsgrs on Discourse: * currently running a large scale deployment on Pulp 2 * now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup. * would like to see: * Overall status of each deployment, understand its functioning correctly, capacity metrics around that. * Able to trigger alerts from these metrics * Performance for triggers knowing when to scale up/down * e.g. no of tasks, task waiting etc. Content Counts and no. of requests to that content ### Notes * dalley: first step: get a POC of "followed the tutorial for a django app (apiserver?) and get some basic instrumentation available" * bmbouter: content-app first (smaller surface, more valuable, answers one of woolsgrs' requests) * launch POC as a tech-preview * bmbouter: can this group give high-level goal instead of being prescriptive? * get POC up **very** quickly * ggainey, dalley approve * dalley: let's get basic infrastructure in place, and **then** start iterating * TIMEFRAME * what's more important than this? * satellite support * HCaaS * AH * operator/container work * rpm pytest * similar importance * other pytest conversion * do we have a ballpark for "POC as a PR" date? * "Q1" would be good * how do we understand who is assigned to what, who is doing what, and what their time-commitment is? * it "feels like" we have a couple of folk who could be freed up to work on this? * let's bring this up at the core-team mtg next week? * or at "sprint" planning? * AI: [ggainey] add to team agenda for 12-DEC * AI: [ggainey] get this on 3-month planning doc for Q1 * How often should we meet? * Proposal: not before 2nd week in Jan * AI: [ggainey] to set up another 30-min mtg that week * wing it from there * Proposal: need an issue/feature "Add basic telemtry suport to Pulp3" * AI: [dalley] to open ### Action Items: * AI: [ggainey] add to team agenda for 12-DEC * AI: [ggainey] get this on 3-month planning doc for Q1 * AI: [ggainey] to set up another 30-min mtg for 2nd week Jan * 11-JAN 1100-1130 GMT-5 * AI: [dalley] to open issue/feature to get this work started * AI: [ggainey] get minutes into Discourse thread ###### tags: `Minutes`, `OpenTelemetry`