Try   HackMD

OpenTelemetry Working Group

See also: https://discourse.pulpproject.org/t/monitoring-telemetry-working-group/700/8

Overview

  • Purpose: Brainstorm/design how to make Pulp3 more monitor-able. Prioritize the kinds of monitoring that would be useful to administrators.
  • Attendees: Pulp dev-team and any interested Pulp administrators

Template

##  YYYY-MM-DD 1300-1330 GMT-5
### Attendees: 
### Regrets:
### Agenda:
* Previous AIs:
* (insert topics here)
### Notes
### Action Items:
* add notes to discourse

2023-05-11 1400-1430 GMT-3

Attendees: decko, bmbouter, dalley

Regrets:

Agenda:

Notes

  • We need to add the OpenTelemetry libs as exception on the check_requirements plugin for the CI

Action Items:

  • [decko] add notes to discourse
  • [decko] Merge Otel oci-env profile
  • [decko] Push the opentelemetry aiohttp-server PR
  • [bmbouter] Fixes the plugin_template's check_requirements.py
  • [bmbouter] Gonna add worker telemetry

2023-05-04 1300-1330 GMT-4

Attendees: decko, bmbouter, ggainey, dalley

Regrets:

Agenda:

  • Previous AIs:

Notes

  • "what does 'done' looks like?"
    • complete aiohttp-instr-PR
    • base-version of grafana panels done
    • make a decision on how the packaging is going to work
    • record two demo videos (jaeger/grafana)
    • user-documentation
      • pulpcore-docs?
      • pulp-oci-images-docs?
      • pulp-to-collector overview/connection
      • "maybe" otel-cfg we're using in oci-env?
      • blog-post - can be ultra-specific, if/as desired
  • api, pulp-specific instr, aiohttp, task-system
    • api is in a great spot
    • aiohttp is pretty good, but we really want it released upstream to consume
    • nothing for tasking yet - phase-2 perhaps?
    • pulp-specific hooks - phase-3?
  • decko: need to have solid decision on packaging-discussion
    • What we need to have OpenTelemetry working with Pulp?
      • Collector container
      • Bunch of dependencies
        • opentelemetry-api
        • opentelemetry-distro[otlp]
        • opentelemetry-exporter-otlp-proto-http
        • opentelemetry-exporter-otlp
        • opentelemetry-instrumentation-wsgi
        • opentelemetry-instrumentation-django
        • opentelemetry-semantic-conventions
        • opentelemetry-proto
        • opentelemetry-sdk
      • Instrumenting WSGI entrypoint (wsgi.py)
      • The Faster Release Cycle
        • "We can undo it later if needed"
  • https://github.com/pulp/pulpcore/pull/3632
    • we could accept the import of opentelemetry.instrumentation.wsgi today (it exists)
    • the import of opentelemetry.instrumentation.aiohttp_server needs work that isn't released in that upstream yet
      • what can we do while waiting (if we have to)?
        • fork the project? <== NO
        • depend on the source-checkout? <== better
        • vendor the PR?
      • we'll decide when we've done everything else we can do
  • Decision:
    • split aiohttp-mods out of current core-PR
    • add trimmed dep-list
    • release
    • aiohttp-into-core gets its own PR
      • requirements.txt relies on src-checkout
      • merges once upstream accepts the needed changes
  • bring this to pulpcore-mtg next week

Action Items:

  • add notes to discourse
  • schedule next mtg: The Road To Merge

2023-04-27 1300-1330 GMT-5

Attendees: decko, bmbouter, dralley, ggainey

Regrets:

Agenda:

  • Previous AIs:

Notes

  • decko showed off his progress
  • discussion around what can we do further?
    • can we set things up to pre-load Grafana dashboard from prepared JSON, at start-time
  • what are, say, "4 things we want in a Grafana dashboard"?
    • content-app
      • response-codes over time
        • organize by class (400/500/OK)?
        • show my 500s only?
      • req-latency
        • avg
        • P95
        • P99
      • "cost" items
        • how many bytes have been served
          • 202 is diff than 301
        • can we gather metrics per-domain?
          • where do/can we record that?
          • /pulp/content/DOMAIN/content-URL
          • upstream (aiohttp) vs downstream (in-pulp)
          • can we attach this data as a header to the request, and record that header?
      • proposal: 'launch' w/ response/-codes/latency, in pretty "basic" graphs, preloaded into oci-env profile
  • discussion: what needs to happen to get aiohttp-instrumentation-PR merged?
  • discussion (brief) around a "pulp_telemetry" 'shim'

Action Items:

  • add notes to discourse

2023-04-20 1300-1330 GMT-5

Attendees: decko, dralley, ggainey

Regrets:

Agenda:

  • Previous AIs:

Notes

  • see instructions in the profile-readme in the oci-env PR for a how-to

Action Items:

  • add notes to discourse

2023-04-13 1300-1330 GMT-5

Attendees: ggainey, decko, ggainey

Regrets:

Agenda:

Notes

  • review/discussion of some test failures
  • current kind-of-a-plan for aiohttp-metrics-work
    • move tests to pytest, get them running clean (#soon)
    • add metrics taking advantage of this fork
    • think on what tests we prob should have in addition, write them, get them running clean
    • submit otel-aiohttp PR upstream
    • continue adding metrics to "our" fork independently
  • AI: ggainey to start using oci_env profile PR for this
  • AI: review https://github.com/pulp/pulp-oci-images/pull/469
  • AI: decko to add what is missing from the #pulpcore/3632

Action Items:

  • [ggainey] to start using oci_env profile PR for this
  • [decko] get tests running locally to see why docker-side fails when podman-side succeeds
  • [any] review https://github.com/pulp/pulp-oci-images/pull/469
  • [decko] to add what is missing from the #pulpcore/3632
  • [ggainey] schedule next mtg for next week
  • [ggainey]add notes to discourse

2023-04-06 1300-1330 GMT-5

Attendees: decko, dalley, ggainey

Regrets:

Agenda:

  • Previous AIs:

Notes

  • PRs are in-progress, to be submitted this week
  • HMS mtg taught some things we'll prob steal :)
  • discussion around a plugin-approach to making otel available
    • maybe just hooks in core, that do nothing w/out pulp_otel installed?

Action Items:

  • add notes to discourse

2023-03-31 1330-1400 GMT-5

Attendees: dalley, decko, ggainey

Regrets:

Agenda:

Notes

Action Items:

  • add notes to discourse
  • ggainey to sched next mtg for next Thurs

2023-03-22 1330-1400 GMT-4

Attendees: decko, jsherrill, dralley, ggainey

Regrets:

Agenda:

  • jsherrill to show us what his team is doing w/ monitoring/metrics
  • NOTES
    • might be "some" guidance available
    • we're all making this up as we go along
    • http status/latency
    • msg-latency/error-rate (like tasks?)
    • some "analytics" info mixed in
    • grafana dashboard to visualize
      • started from a template
      • export JSON in order to import into app
    • PR for visualization changes is currently "exciting"
    • having a full-time data visualization expert would be a Good Thing
    • discussion around SLOs (uptime/breach rules/alerting)
    • "best practices" still "up in the air"?
    • there are tests for alerts
    • review your output - sometimes, there are bugs
  • QUESTIONS
    • app implements a /metrics endpoint
    • gathered metrics thrown into prometheus
      • How are metrics produced to make them available to /metrics?
        • prometheus-client in go does the heavy lifting
    • AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group

Action Items:

  • AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
  • add notes to discourse
  • schedule mtg for next week

2023-03-16 1400-1500 GMT-4

Attendees: ggainey, dalley, decko, bmbouter

Regrets:

Agenda:

  • Previous AIs:
    • oci-env/otel work to continue to completion
    • decko to go from the above, to instrumenting workers
    • aiohttp/asgi auto-instrument bakeoff

Notes

  • discussion RE decko's experiences
    • worker-trace-example! Woo!
    • looking at metrics in grafana - double wooo!
  • How do we get actual-services-admins involved in setting up kinds-of metrics/visualizations
    • AI: [ggainey] invite jsherrill to come demo their Grafana env for us
  • What would be nice:
    • Docs written from an Operational perspective
      • "Here's a Thing you want to know, here are graphs that will help you answer it"
      • Example: "Is pulp serving content correctly? - visualize content-app status codes"
  • Next-steps sequence
    • finish oci-env profile
    • start workingon some "standard" grpahs
    • work on how-to docs
    • work on demos
    • how can we merge better w/ pulpcore?
      • right way to merge new libs to project?
      • responding to various installation-scenarios
    • discussion
      • single-container - s6-svc
      • what if users don't want to spin up otel? What happens to the app?
      • pulp-otel-enabled variable - default to False
        • what does that mean?
        • does not mean that otel-libs aren't installed (are they direct-deps or not? will be incl in img regardless)
      • multiprocess container - there's another svc running
      • docs should call out/link to docs RE feature-flip vars for the auto-instr libs
  • "direct dependency vs not" discussion
    • if it is, you can't uninstall it
    • not everything has to be a hard-dep
    • maybe start with not-required
  • prioritizing aiohttp server PR might be worthwhile
    • acceptance is out of our control
    • will take more time to get an aiohttp-lib w/ the support released
  • auto-instr pkgs aren't going to include correlation-id-support for pulp's cids

Action Items:

  • AI: [ggainey] invite jsherrill to come demo their Grafana env for us
  • add notes to discourse

2023-03-09 1400-1500 GMT-5

Attendees: ggainey, decko, dralley, bmbouter

Regrets:

Agenda:

  • Previous AIs:
    • bmbouter to do little perf-test
    • bmbouter/decko to work together to get oci-env to run otel setup
    • decko to go from the above, to instrumenting workers
    • aiohttp/asgi auto-instrument bakeoff
    • ggainey to sched for an hour next Thurs
  • (insert topics here)

Notes

  • some data on performance-impact of traces : https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1556
    • discussion ensues
    • maybe we want to test more?
  • if our worker span instrumentation is "feature-flippable", we're implying a direct dependency on otel being packaged
    • prob want to discuss at pulpcore mtgs
  • oci-env work in progress
    • really close to having an otel-env
    • work continues
  • how's the worker-instrumentation going to work?
    • can we get a span that covers creation-to-end?
    • dispatch-to-start is a good thing to know
    • just span for run-to-complete is "easy"
    • can we use the correlation-id as a span-id?
      • as opposed to task-uuid?
      • BUT - think about dispatching-a-task-group
  • what about metrics (as opposed to spans)
    • auto-instrumentation setup has its own metrics
    • are there Things we'd like to add to our code specifically?
      • for tasking system, almost certainly
      • per-worker metric(s)
        • fail-rate
        • task-throughput
        • what happens when workers go-away?
          • attach metrics to worker-names?
        • missing-worker-events
          • interpretation is key
        • system-metrics as a whole
          • wait-q-size
          • waiting-lock-evaluation ("concurrency opportunity")
            • ratio tasks/possible_concurrency
            • discussion around how workers dispatch themselves
    • thinking like an admin
      • do I have too much hardware in use?
      • not enough?
      • how do I know "something is going wrong"?
    • "service-level-indicator": how much time does a task wait before start
    • "possible concurrency": how many could start, assuming enough workers?
    • "utilization": what percentage of workers are "typically" busy?

Action Items:

  • oci-env/otel work to continue to completion
  • decko to go from the above, to instrumenting workers
  • aiohttp/asgi auto-instrument bakeoff
  • add notes to discourse
  • ggainey to schedule next for one week out

2023-03-02 1000-1030 GMT-5

Attendees: ggainey, bmbouter, decko, dralley

Regrets:

Agenda:

Notes

Action Items:

  • bmbouter to do little perf-test
  • bmbouter/decko to work together to get oci-env to run otel setup
  • decko to go from the above, to instrumenting workers
  • aiohttp/asgi auto-instrument bakeoff
  • ggainey to sched for an hour next Thurs

2023-02-23 1000-1030 GMT-5

Attendees: ggainey, bmbouter, decko

Regrets:

Agenda:

  • Previous AIs:
  • (insert topics here)

Notes

  • bmbouter reports
    • got django auto-implementation running in api-workers
    • importing into Jaeger
    • configured oci-env w/ OT visualization
  • Let's think about what we really want to get out of this effort?
    • AI-all for next mtg
  • Notes from user-discussions
    • users need to be able to turn it off and on
  • practical problems
    • have otel be its own oci-env profile
      • loads Jaeger as a side-container
    • base img needs a way to turn OT off and on
    • what if we had an instrumented img?
      • leads to combinatorics-fun
      • users want their own imgs
    • what if there was an "instrument this" env-var? (OTEL_ENABLED)
      • otel-pieces installed always, just not always "on" unless asked for via this var
    • let cfg run via env-vars
      • "Here's the OTEL docs, use their env-vars to control behavior"
  • https://www.aspecto.io/blog/opentelemetry-collector-guide/
    • discussion of "direct to collector" vs "agent to collector" architectures
      • allows batching, allows data-transformation, allows redaction
  • open question: how should a pulp dev-env be configured?
  • example "interesting" metric : https://github.com/pulp/pulpcore/issues/3389
  • let's think about the specific packages that might need "RPM-izing"
    • may already be RPM'd in Fedora (maybe RHEL?)
  • next step:
  • prob going to meet once/wk

Action Items:

  • add notes to discourse

2022-01-11 1100-1130 GMT-5

Attendees: dalley, ggainey, bmbouters, ipanova, wibbit

Regrets:

Agenda:

  • Previous AIs:
    • AI: [ggainey] add to team agenda for 12-DEC
    • AI: [ggainey] get this on 3-month planning doc for Q1
    • AI: [ggainey] to set up another 30-min mtg for 2nd week Jan
      • 11-JAN 1100-1130 GMT-5
    • AI: [dalley] to open issue/feature to get this work started
    • AI: [ggainey] get minutes into Discourse thread
  • (insert topics here)

Notes

  • action plan:
    • get basic infrastructure POC up
    • have one telemetry probe in pulp
  • where to start?
    • content-app would be Really Useful
      • aiohttp doesn't have instrumentation yet
    • api-app would be easiest place
      • django has support
    • "easiest first" is prob a good idea
    • wibbit: more interested in api-worker monitoring
      • biggest bottlenecks are api-calls and repo generation
      • CI efforts push even more in that direction
  • next steps
    • who? and when?
      • Grant, Ina, Daniel are interested, any could take it
    • "Q1" - we have stakeholders that would like this "now"
    • prio-list, dalley/ggainey/ipanova to work together?
    • bmbouter/wibbit can help provide insights into "what we want to measure"
  • getting POC up and running
    • prob not a multi-month process
    • need time to research OpenTelemtry docs
    • need to understand oci-env service-setup a bit
    • understanding export-formats, etc
    • "whoever gets to it first" - consolidate what you/we learn into a doc
    • maybe good for a Friday programming effort?

Action Items:

  • add notes to discourse

2022-12-06 1000-1030 GMT-5

Attendees: bmbouter, dalley, ggainey

Regrets:

Agenda:

  • Previous AIs:
    • N/A
  • Organizational Meeting
    • What do we want this group to accomplish?
    • How often do we want to meet?
    • What should our next meeting look like?
    • Where would we like to be in 3 months?
  • Comments from woolsgrs on Discourse:
    • currently running a large scale deployment on Pulp 2
    • now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup.
    • would like to see:
      • Overall status of each deployment, understand its functioning correctly, capacity metrics around that.
      • Able to trigger alerts from these metrics
      • Performance for triggers knowing when to scale up/down
        • e.g. no of tasks, task waiting etc.
          Content Counts and no. of requests to that content

Notes

  • dalley: first step: get a POC of "followed the tutorial for a django app (apiserver?) and get some basic instrumentation available"
  • bmbouter: content-app first (smaller surface, more valuable, answers one of woolsgrs' requests)
  • launch POC as a tech-preview
  • bmbouter: can this group give high-level goal instead of being prescriptive?
    • get POC up very quickly
    • ggainey, dalley approve
  • dalley: let's get basic infrastructure in place, and then start iterating
  • TIMEFRAME
    • what's more important than this?
      • satellite support
      • HCaaS
      • AH
      • operator/container work
      • rpm pytest
    • similar importance
      • other pytest conversion
    • do we have a ballpark for "POC as a PR" date?
      • "Q1" would be good
      • how do we understand who is assigned to what, who is doing what, and what their time-commitment is?
      • it "feels like" we have a couple of folk who could be freed up to work on this?
      • let's bring this up at the core-team mtg next week?
      • or at "sprint" planning?
      • AI: [ggainey] add to team agenda for 12-DEC
      • AI: [ggainey] get this on 3-month planning doc for Q1
  • How often should we meet?
    • Proposal: not before 2nd week in Jan
      • AI: [ggainey] to set up another 30-min mtg that week
      • wing it from there
  • Proposal: need an issue/feature "Add basic telemtry suport to Pulp3"
    • AI: [dalley] to open

Action Items:

  • AI: [ggainey] add to team agenda for 12-DEC
  • AI: [ggainey] get this on 3-month planning doc for Q1
  • AI: [ggainey] to set up another 30-min mtg for 2nd week Jan
    • 11-JAN 1100-1130 GMT-5
  • AI: [dalley] to open issue/feature to get this work started
  • AI: [ggainey] get minutes into Discourse thread
tags: Minutes, OpenTelemetry