Analytics Working Group
Ongoing Notes
- first three mtgs will be one hour
- going forward, 30 min on the half-hour
List of Questions for Every Metric To Be Gathered
- What question will this help us answer?
- What is a specific example of the data to be gathered?
- How will this metric be stored in the database or gathered at runtime?
- Will the gathering and/or storage of this cause unacceptable burden/load on Pulp?
- Is this metric Personally-Identifiable-Data?
- What pulpcore version will this be collected with?
- Is this approved/not-approved?
Analytics-proposal Template
Open Questions
- Do we want to compute processes / host also?
2022-12-1
Attendees: bmbouter ppicka mdellweg dkliban wibbit ggainey
Agenda
- Determined the last regularly scheduled meeting, and followup meetings will happen as-needed
- To finalize the tech-debt, we should work on these two issues:
2022-10-20
Attendees: bmbouter ppicka mdellweg dkliban wibbit ggainey
Agenda
- Here’s a new set of graphs 2 to look at accepting from @mdellweg
- Here’s a proposal to collect, summarize, and visualize postgresql version 2 which would be a new metric. This is going to be the “live coding” part that I do at Pulpcon to add it.
- https://hackmd.io/zJ1dJe8qQtmzr0JiM1jptw
- discussion around "how do we want to summarize"
- e.g., is X.Y.Z really interesting?
- We want to summarize "versions that matter"
- side discussion: format/organization of main visualization page would be A Good Thing
- FYI lots of new docs here 1 including importing data from the production site
- Should we be limiting summaries to only systems with at least 2 checkins?
- "yes please" is the consensus
- Proposal: Add a “summarization” and “visualization” sections to the “proposal template”
2022-08-25
Attendees: ppicka, ggainey, bmbouter
Agenda
- Interesting resources shared with the group from Mozilla's telemetry groups
- Updates
- Next Steps
- bmbouter to fix whatever the issue is with summarization
- bmbouter to add plugin documentation on the processes and checklists this group currently has in hackmds
- bmbouter to add documentation on how to create the local dev environment
- Future meetings
- Telemetry working group will meet next week, and maybe the week after to finalize some process things and celebrate
- After that telemetry working group will suspend for at least 6 weeks
- Working group will resume as new proposals for metrics are proposed
2022-08-18
Attendees: ggainey, dkliban, bmbouters, ipanova, ppicka
Prev AIs
Agenda
- progress made on finalizing POC
- demo time!
- proposal: have "summarizer" delete old content (rather than replace)
- proposal: have "summarizer" only delete data older-than some window (2 weeks?)
AIs
- bmbouter to take up the proposals above
- add X.Y graph for each component
- next steps:
Links
2022-08-11
Attendees: ggainey, dkliban, ppicka, bmbouters, ipanova, wibbit
Prev AIs
Agenda
- discussion around https://github.com/pulp/pulpcore/pull/3032
- def a good idea, prob want this backported to 3.20
- progress update
- lots of progress being made, not baked yet
- lots of interaction w/ duck@osci
- analytics.pulpproject.org has 2 branches, main and dev
- auto-deploys to 2 diff OSCI deployments
- both use LetsEncrypt TLS
- web-process pod, posstgres backend
- django-admin enabled for superuser controls
- modification to how payloads are defined
- consolidates client and server definitions of payload
- using Google's "Protocol Buffer" approach (q.v.)
- what about version mismatches?
- ProtocolBuffer is Opinionated - follow their requirements
- next steps
- charting
- summaries
- manage.py cmd, to be called by openshift cron every 24 hrs
- data expiry
AIs
- bmbouters hoping for a tech demo next mtg
Links
2022-07-21
Attendees:
2022-07-14
Attendees: bmbouters, dkliban, ipanova, ppicka, ggainey
- Current State
- PROBLEMS
- summarization isn't working, investigation isn't getting us past whatever the problem is
- server-side-code pagination isn't working
- DNS for analytics-pulpproject-org to be analytics.pulpproject.org would require all pulpproject.org be handed over to cloudflare
- reverse-proxy is possible, POC works but is…suboptimal
- OSCI asking why we're not just running this on their openshift instance/platform
- PROPOSAL
- discussion ensues
- reliability/availability? visibility into admin/monitoring?
- health probe/autorestart-pod should work
- proposal: openapi work to auto-generate client/server side of this
- makes available to other projects who might want to do this
2022-06-16
Attendees: ppicka, bmbouter, ipanova, douglas
- currently pulpcore will post only to the dev site, and only if the user has a .dev installation
- some users could have .dev
Action Items
2022-05-26
Attendees: ppicka, bmbouter, ipanova, dkliban, douglas
- In summarizing numbers, in addition to the mean, do we want max and min also?
- Is it time to sign up for the $5 / month plan?
- How do we make the versions graph not so complicated?
- Keep the raw data including the z-version, but also make a graph that aggregates all Z versions into totals and show that
Action Items
- [bmbouter] Make a graph that aggregates all Z versions into totals and show that x.y counts
- [bmbouter] Revise telemetry PoC to only have it post dev data
- [bmbouter] Check in with RH about them enabling the pay-plan
2022-04-07
Attendees: ppicka, bmbouter, ipanova, dkliban, ggainey, douglas
- quick review of the graphs with the status data
- duplicate data submission
- expiration_time - 30days
- there should only be one data point from each system because the key is the systemID
- KV - data format
- {SystemID: {all_the_data, , , }}
- summarization process
- only considers the latest data points posted in the last 24 hours
- Are users allowed to download the raw data?
- No because we're telling users that their raw data is only ever retained for 30 days
- Are users allowed to download the summary data?
- The public analytics site will provide the data, we may allow for downloading of the summarized data later
- how to disable this for dev installs
- have a dev URL and analytics site and a production URL and analytics site
- if pulpcore ends in .dev submit to the dev site otherwise the production site
- similar to what home assistant does
- First implementation not planning to handle proxy configs
2022-03-24
Attendees: ppicka, bmbouter, ipanova, dkliban
- Will we share the raw data, or just the summarized data?
- We'll provide just the summaries publicly
- See the graphs to be produced at the bottom of the https://hackmd.io/@pulp/telemetry_status document
- Proposal: summarize daily and include only 1 data point from each systemID
2022-03-17
Attendees: ppicka, dfurlong, ggainey, bmbouters, ipanova
- bmbouter revised POC and demo
- Thoughts
- how/where do we log outgoing info?
- into logs? what level?
- into task progress-report?
- into sep file?
- needs discussion
- what's a good TTL for data sent to CloudFlare?
- cloudflare docs : https://api.cloudflare.com/#custom-hostname-for-a-zone-custom-hostname-details
- HomeAssistant has cloudflare-side worker-code receiving data
- How do we build/maintain summary info?
- What if we send as "uuid-timestamp": "data"?
- details are important - but at a high level, what aggregate/historical data are we actually interested in keeping?
- "What question are we answering" needs an additional "How are we going to visualize that information?"
- keep in mind the difference between "monitoring" and "telemetry"
- AI for all: what kinds of ways would we like to summarize/display/graph the existing data proposal ("status")
2022-02-03
Attendees: ppicka, dfurlong, dkliban, ggainey, bmbouters, ipanova
Prev AIs
Agenda
- review /status/ writeup
- alternative proposal approved
- [all]: What do we want to focus on in the following 30-min mtgs?
- example: how do we develop metrics and test them?
- example: how do we let plugins report?
- example: let's talk about status API
AIs
[ggainey] hackmd to list "things we might want telemetry proposals for", send link to list
- [ggainey]
update telemetry-proposal template to include "discussion", "alternative proposal", "RFE suggestions arising from discussion" sections
Links
2022-01-27
Attendees:
Prev AIs
- [bmbouters] make POC race-condition-free, post data, have a read-UI
- [all]: What do we want to focus on in the following 30-min mtgs?
- example: how do we develop metrics and test them?
- example: how do we let plugins report?
- example: let's talk about status API
- [ggainey]: write up "results of pulp /status/ API" as a formal presentation of a metric to the Telemetry Group, answering The List Of Questions
Agenda
- ggainey to report on anything from OCP Telemetry discussions
AIs
Links
2022-01-20
Attendees: bmbouters, dkliban, dfurlong, ppicka, ipanova, ggainey
- Last 1-hr mtg
- future mtgs 30 min at the half-hour
Prev AIs
- [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat]
- contact made, pointers received, email dispatched
- [bmbouter] POC against Cloudflare
- migration that creates UUID
- create CF account
- have periodic wsgi that posts UUID
- post progress to discourse
Agenda
- discussion about POC
- discussion around implications of adding tasking-subsystem to Pulp3
- signed up for Cloudflare k/v accoumnt (pulp-infra@ rmail)
- something is "not right yet" - #soon
- bmbouter to engage CF Discourse
- What are all the ways we could communicate this transparency to users?
- How do we make it Really Easy for user to know what's happening and opt-out?
- docs, release notes, discourse announcement
- social media (tweet, etc)
- youtube demo
- work w/ mcorr RE social-media
- log at start up that telemetry reporting is enabled and refer to a setting which should be changed to disable it
- really important for the Users Who Don't Read Anything
- log every time telemetry is sent
- homeassist does this here
- is periodicity configurable?
- "keep simple things simple" - hardcoded
- KISS - keep it simple stupid
- how often is "often enough"?
- what's the most-reasonable time interval, to most users?
- once/day
- can user control "when during the day" it happens?
- think about network-security-rules?
- at initial-migration-time, dispatch "soon" post-setup
- 30 min post-migrations-run (let pulp-install settle down)
- questions about performance (cpu/memory/etc)
- contact operate-first group
- performance/monitoring is separate from telemetry
- but a still really-useful thing to be doing!
- [dfurlong] memory-use/performance changes over time is really useful
- being able to easily-deliver monitoring results back to pulp from users would be great
- What is the list of questions we want to ask for each metric
- metric-acceptance discussion needs to be "somewhere permanent"
- should be a public checklist for answering these questions
- example: "How we decide if something is PII and how can it be sanitized"
- should be able to connect a specific metric to the exact commit when it entered the codebase
- what happens if/when an API being used to collect telemetry, changes what is delivered?
- what if PII gets added (e.g.)
- need to have a data-audit process in place
- an example:
- the data reported from the list of status
- What question will this help us answer?
- How many workers are users running?
- What plugins do they run?
- What is a specific example of the data to be gathered?
- How will this metric be stored in the database or gathered at runtime?
- We'll gather the data at runtime. This should not cause unecessary load on pulp
- Will the gathering and/or storage of this cause unacceptable burden/load on Pulp?
- Is this metric Personally-Identifiable-Data?
- Yes the hostnames, so it needs to be redacted
- discussion about kinds-of-data
- what if post fails
- give up, send it tomorrow
- api call-periodicity?
- api call-sequences?
- should be a standard way for a user to request all their data be removed from the public data store
- can there be a standard test-sequence that investigates metric results for "known PII problems" and fails a metric if/as it finds something?
AIs
- [bmbouters] make POC race-condition-free, post data, have a read-UI
- [all]: What do we want to focus on in the following 30-min mtgs?
- example: how do we develop metrics and test them?
- example: how do we let plugins report?
- example: let's talk about status API
- [ggainey]: write up "results of pulp /status/ API" as a formal presentation of a metric to the Telemetry Group, answering The List Of Questions
Links
2022-01-14
Attendees: bmbouter, ttereshc, ipanova, dkliban, ggainey
Prev AIs
- [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat]
- [bmbouter] talk about budget and direct costs with management
- "it's fine, but be selective about which provider we choose"
- [ttereshc] talk to lzap about Foreman telemetry
- done, largely concerned with performance-monitoring
- do we want to collect performance data? or just usage?
- what other red=hat-telemetry-services exist that we may want to integrate with/to?
- see ttereshc's email for more detail ("Foreman Telemetry")
Agenda
- next mtg 20-JAN, 1 hr, then switch to 30 min
- how is a UUID generated?
- per-pulp-system
- ie, one UUID per-clustered-pulp
- "one UUID per-database"
- how/where will it be stored?
- in db - if it doesn't exist, create one
- if it is in the db, use it
- would survive across restores/rebuilds
- multi-node installs/clusters
- same uuid, multiple nodes reporting - can we tell multi-machine architectures?
- how are we going to periodically post?
- single-node is 'easy'
- clusters
- not a separate call-home service
- periodic pulp-task-posting
- everyone puts data into db (somewhere), someone reports it up
- sanitizing data? - lv for "what do we report" later
- "how often" - performance data prob needs to be gathered more often, for example
- "how often do we write into the db?"
- write at service-startup?
- what about heartbeats?
- feature-use needs to happen more often?
- gather use-data from existing tables
- How do we do a daily task?
- wsgi, distributed-lock, dispatch task, record last-update
- wsgi heartbeat, check against last-dispatch, at correct interval start a new one
- database-xact to force ordering?
- even if it's poss for task to dispatch and yet fail to call home - it's ok
- what kind-of data is our focus?
- what versions of pulp are installed?
- what's "a typical pulp instance"?
- clustered vs not
- do we gather hardware info? (memory, disk usage, cpus?)
- what about feature-usage data?
- configuration - ie, content of pulp/settings.py?
- ONLY NON-SENSITIVE DATA
- def need to think hard about how to sanitize
- monitoring data?
- not a primary objective
- let's not shut the door on it for future opportunity
- monitoring wants UNsanitized data in order to be actionable
- what's at least one service we can POC against?
- cloudflare, amazon, etc
- bmbouters chooses Cloudflare - it uses Free Starter Account! It's Super-Effective!
- specific cost ballpark - $50-100/month at initial start, poss growing as we learn how much data and storage
- how can we provide full-choice to users to opt-out/opt-in
AIs
- [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat]
- [bmbouter] POC against Cloudflare
- migration that creates UUID
- create CF account
- have periodic wsgi that posts UUID
- post progress to discourse
Links
2022-01-06
attendees: wibbit, ttereshc, dkliban, bmbouters, ggainey, ppicka, ipanova
- first 2/3 mtgs, 1 hr - then shorten to 30, less often
what do we want from today?
- set goals
- where is the data going to go?
- focus on base infrastructure first, then "what data collected and how"
- process for how to change/mutate/morph the kinds-of data being collected
- timeline possibility:
- base infra posted by end-of-January?
- uuid/one-piece-of-data gathered and sent "somewhere"
- maybe not have a date attached? just work on POC?
- maybe just post Goal, and not worry about Date
- focus on base-infra and where data will go as POC, data-details come Later
- example of a telemetry operation in production use : https://www.home-assistant.io/integrations/analytics
- uses CloudFlare to store data
- don't forget about GDPR (and friends) laws
- what do other projects use?
- OpenShift - need to talk to Other Folks
- AI: establish contact with them?
- What about Foreman?
- lzap driving?
- AI: talk to lzap
- Fedora? crash reports, installation?
- Firefox addon may do this?
- may need some digging, does Fedora still do this?
- talk to Red Hat around direct-cost of supporting such a service
- AI: [bmbouters] talk to rchan
- wibbit: where does data go
- assuming data is sufficiently anonymized to be made public?
- yes please
- keeps us honest about anonymizing
- enhances trust/transparency
- cost of distribution/access to the data from the public
- data-outflow vs data-ingress costs
- wibbit: enterprise env can be draconic around security
- infra needs to support multiple pulp-instances hitting a single internal proxy that is the single point-of-contact to telemtry service?
- two requirements
- clear docs on details of how data posts
- proxy support
- wibbit: data needs to be staged/stageable locally prior to being submitted
- submit-queue that can be paused/investigated
- bmbouter: adds to better user-knowledge/transparency, good idea
- wibbit: allows for admin-internal-consumption
- dkliban: would help manage multi-pulp-installation
- wibbit: Real People didn't raise any major concerns, beyond "we need to know what's being uploaded"
- wibbit: do we need a consistent UUID over time?
- need to be able to identify across upgrades
- change-over-time is really important
- bmbouter: feature should default-to-on
- ipanova: already long talk in foreman-land on this, see discourse
- wibbit: dflt-to-on is ok
- assumption is admins know what they're doing
- would lose any temporal-system info if dflt-to-off
- caveat: dflt-on for new-install vs upgrade?
- when-introduced, to an existing system, is qualitatively diff than new-install
- let's discuss how to do this "very transparently and loudly"
- where will this flag exist?
what do want by next week?
- AIs
- [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat]
- [bmbouter] talk about budget and direct costs with management
- talk to lzap about Foreman telemetry
- Things for next week's agenda:
- how is a UUID generated?
- how/where will it be stored?
- how are we going to periodically post?
- what's at least one service we can POC against?
- first three mtgs will be one hour
- going forward, 30 min on the half-hour
Links