Analytics Working Group

# Analytics Working Group * discourse thread : https://discourse.pulpproject.org/t/proposal-analytics/259/2 * meeting-notes template ``` ## YYYY-MM-DD ### Attendees: ### Prev AIs ### Agenda ### AIs ### Links ``` ## Ongoing Notes * first three mtgs will be one hour * going forward, 30 min on the half-hour ## List of Questions for Every Metric To Be Gathered * What question will this help us answer? * What is a specific example of the data to be gathered? * How will this metric be stored in the database or gathered at runtime? * Will the gathering and/or storage of this cause unacceptable burden/load on Pulp? * Is this metric Personally-Identifiable-Data? * What pulpcore version will this be collected with? * Is this approved/not-approved? ### Analytics-proposal Template * **NOTE**: this is available as the "Analytics" template in hackmd.io/pulp! * https://hackmd.io/@pulp/telemetry_template ``` # Title ## What question will this help us answer? ## What is a specific example of the data to be gathered? ## How will this metric be stored in the database or gathered at runtime? ## Will the gathering and/or storage of this cause unacceptable burden/load on Pulp? ## Is this metric Personally-Identifiable-Data? ### How can we sanitize this output? ## What pulpcore version will this be collected with? ## Alternative proposal(s) ### Option 1 ### Option N ## Discussion notes ## Is this approved/not-approved? ## Parking Lot for potential future/RFE work ###### tags: `Analytics` ``` ## Open Questions * Do we want to compute processes / host also? ## 2022-12-1 ## Attendees: bmbouter ppicka mdellweg dkliban wibbit ggainey ## Agenda * Determined the last regularly scheduled meeting, and followup meetings will happen as-needed * To finalize the tech-debt, we should work on these two issues: * https://github.com/pulp/analytics.pulpproject.org/issues/65 * https://github.com/pulp/analytics.pulpproject.org/issues/69 ## 2022-10-20 ## Attendees: bmbouter ppicka mdellweg dkliban wibbit ggainey ## Agenda * Here’s a new set of graphs 2 to look at accepting from @mdellweg * https://github.com/pulp/analytics.pulpproject.org/pull/23 * one last minor change suggests, consensus appears to be "go4it" * Here’s a proposal to collect, summarize, and visualize postgresql version 2 which would be a new metric. This is going to be the “live coding” part that I do at Pulpcon to add it. * https://hackmd.io/zJ1dJe8qQtmzr0JiM1jptw * discussion around "how do we want to summarize" * e.g., is X.Y.Z really interesting? * We want to summarize "versions that matter" * side discussion: format/organization of main visualization page would be A Good Thing * FYI lots of new docs here 1 including importing data from the production site * https://github.com/pulp/analytics.pulpproject.org/tree/dev#exportingimporting-the-database * Should we be limiting summaries to only systems with at least 2 checkins? * "yes please" is the consensus * Proposal: Add a “summarization” and “visualization” sections to the “proposal template” ## 2022-08-25 ## Attendees: ppicka, ggainey, bmbouter ## Agenda * Interesting resources shared with the group from Mozilla's telemetry groups * https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/index.html * Updates * Both https://analytics.pulpproject.org/ and https://dev.analytics.pulpproject.org/ are deployed and ready to receive data * https://dev.analytics.pulpproject.org/ is receiving data from pulpcore:main and pulpcore:3.20 branches for dev instlals * Users will receive telemetry on by default starting with 3.21.0 with a setting to disabl and clearly marked. 3.21.0 is tenatively scheduled for Sept 8th. * Merged with this PR: https://github.com/pulp/pulpcore/pull/3116 * bmbouter and dkliban have admin access to https://dev.analytics.pulpproject.org/ * bmbouter only has access to https://analytics.pulpproject.org/ * AI: we need a second person for https://analytics.pulpproject.org/, will assign at next week's meeting * Summarization isn't working on https://dev.analytics.pulpproject.org/ for some reason * Next Steps * bmbouter to fix whatever the issue is with summarization * bmbouter to add plugin documentation on the processes and checklists this group currently has in hackmds * bmbouter to add documentation on how to create the local dev environment * Future meetings * Telemetry working group will meet next week, and maybe the week after to finalize some process things and celebrate * After that telemetry working group will suspend for at least 6 weeks * Working group will resume as new proposals for metrics are proposed ## 2022-08-18 ### Attendees: ggainey, dkliban, bmbouters, ipanova, ppicka ### Prev AIs ### Agenda * progress made on finalizing POC * demo time! * proposal: have "summarizer" delete old content (rather than replace) * proposal: have "summarizer" only delete data older-than some window (2 weeks?) ### AIs * bmbouter to take up the proposals above * add X.Y graph for each component * next steps: * PR to dev * pulpcore PR (https://github.com/pulp/pulpcore/pull/3025/files) ### Links ## 2022-08-11 ### Attendees: ggainey, dkliban, ppicka, bmbouters, ipanova, wibbit ### Prev AIs ### Agenda * discussion around https://github.com/pulp/pulpcore/pull/3032 * def a good idea, prob want this backported to 3.20 * progress update * lots of progress being made, not baked yet * lots of interaction w/ duck@osci * analytics.pulpproject.org has 2 branches, main and dev * auto-deploys to 2 diff OSCI deployments * both use LetsEncrypt TLS * web-process pod, posstgres backend * django-admin enabled for superuser controls * modification to how payloads are defined * consolidates client and server definitions of payload * using Google's "Protocol Buffer" approach (q.v.) * https://developers.google.com/protocol-buffers * what about version mismatches? * ProtocolBuffer is Opinionated - follow their requirements * next steps * charting * summaries * manage.py cmd, to be called by openshift cron every 24 hrs * data expiry ### AIs * bmbouters hoping for a tech demo next mtg ### Links * https://developers.google.com/protocol-buffers * https://github.com/pulp/pulpcore/pull/3032 ## 2022-07-21 ### Attendees: * demo/POC on analytics.pulpproject.org ? ## 2022-07-14 ### Attendees: bmbouters, dkliban, ipanova, ppicka, ggainey * **Current State** * cloudflare impl is doing some data collection/summarization * dev-installs (*any* plugin ends in -dev) **only** * https://dev-analytics-pulpproject-org.pulpproject.workers.dev/ * wrote to tech-list@ to see if this is already exists "somewhere" at redhat? * console.redhat.com - but for customers * **PROBLEMS** * summarization isn't working, investigation isn't getting us past whatever the problem is * server-side-code pagination isn't working * DNS for analytics-pulpproject-org to be analytics.pulpproject.org would require **all** pulpproject.org be handed over to cloudflare * reverse-proxy is possible, POC works but is...suboptimal * OSCI asking why we're not just running this on their openshift instance/platform * This is a fine question! * **PROPOSAL** * bmbouter takes day-of-learning to translate cloudflare server-side from typescript to python, stand up on duck's OS instance * analytics.pulpproject.org pointing to a helloworld app (thanks duck!) * https://github.com/pulp/analytics.pulpproject.org * discussion ensues * reliability/availability? visibility into admin/monitoring? * health probe/autorestart-pod should work * proposal: openapi work to auto-generate client/server side of this * makes available to other projects who might want to do this ## 2022-06-16 ### Attendees: ppicka, bmbouter, ipanova, douglas * currently pulpcore will post only to the dev site, and only if the user has a .dev installation * some users could have .dev * ### Action Items * [bmbouter] Make a graph that aggregates all Z versions into totals and show that x.y counts * [bmbouter] Put up "coming soon page" * [bmbouter] Get analytics.pulpproject.org DNS integrated with https://analytics-pulpproject-org.pulpproject.workers.dev/ * [bmbouter] Reset the https://analytics-pulpproject-org.pulpproject.workers.dev/ environment * [bmbouter] make additional graphs for each expected plugin version posted * [bmbouter] go through and implementation pagination in summary data ## 2022-05-26 ### Attendees: ppicka, bmbouter, ipanova, dkliban, douglas * In summarizing numbers, in addition to the mean, do we want max and min also? * not for now * Is it time to sign up for the $5 / month plan? * yes * How do we make the versions graph not so complicated? * Keep the raw data including the z-version, but also make a graph that aggregates all Z versions into totals and show that ### Action Items * [bmbouter] Make a graph that aggregates all Z versions into totals and show that x.y counts * [bmbouter] Revise telemetry PoC to only have it post dev data * [bmbouter] Check in with RH about them enabling the pay-plan ## 2022-04-07 ### Attendees: ppicka, bmbouter, ipanova, dkliban, ggainey, douglas * quick review of the graphs with the status data * https://hackmd.io/@pulp/telemetry_status#Graphs-to-be-produced * duplicate data submission * expiration_time - 30days * there should only be one data point from each system because the key is the systemID * KV - data format * {SystemID: {all_the_data, , , }} * summarization process * only considers the latest data points posted in the last 24 hours * Are users allowed to download the raw data? * No because we're telling users that their raw data is only ever retained for 30 days * Are users allowed to download the summary data? * The public analytics site will provide the data, we may allow for downloading of the summarized data later * how to disable this for dev installs * have a dev URL and analytics site and a production URL and analytics site * if pulpcore ends in .dev submit to the dev site otherwise the production site * similar to [what home assistant does](https://github.com/home-assistant/core/blob/4d72e41a3e88f696d255dc73e4f4e8ec88b1874f/homeassistant/components/analytics/analytics.py#L99) * First implementation not planning to handle proxy configs ## 2022-03-24 ### Attendees: ppicka, bmbouter, ipanova, dkliban * Will we share the raw data, or just the summarized data? * We'll provide just the summaries publicly * See the graphs to be produced at the bottom of the https://hackmd.io/@pulp/telemetry_status document * Proposal: summarize daily and include only 1 data point from each systemID ## 2022-03-17 ### Attendees: ppicka, dfurlong, ggainey, bmbouters, ipanova * bmbouter revised POC and demo * https://github.com/pulp/pulpcore/pull/2118/files * key/value - "systemid": "telemetry-key:value" * Thoughts * how/where do we log outgoing info? * into logs? what level? * into task progress-report? * into sep file? * needs discussion * what's a good TTL for data sent to CloudFlare? * cloudflare docs : https://api.cloudflare.com/#custom-hostname-for-a-zone-custom-hostname-details * HomeAssistant has cloudflare-side worker-code receiving data * How do we build/maintain summary info? * What if we send as "uuid-timestamp": "data"? * details are important - but at a high level, what aggregate/historical data are we actually interested in keeping? * "What question are we answering" needs an additional "How are we going to visualize that information?" * keep in mind the difference between "monitoring" and "telemetry" * AI for all: what kinds of ways would we like to summarize/display/graph the existing data proposal ("status") ## 2022-02-03 ### Attendees: ppicka, dfurlong, dkliban, ggainey, bmbouters, ipanova ### Prev AIs * ggainey status writup : https://hackmd.io/@pulp/telemetry_status * great discussion ensues ### Agenda * review /status/ writeup * alternative proposal approved * [all]: What do we want to focus on in the following 30-min mtgs? * example: how do we develop metrics and test them? * example: how do we let plugins report? * example: let's talk about status API ### AIs * ~~[ggainey] hackmd to list "things we might want telemetry proposals for", send link to list~~ * https://hackmd.io/@pulp/telemetry_suggestions * [ALL] everyone adds one line to ^^ * [ggainey] ~~update telemetry-proposal template to include "discussion", "alternative proposal", "RFE suggestions arising from discussion" sections~~ ### Links * https://hackmd.io/@pulp/telemetry_status * https://hackmd.io/@pulp/telemetry_suggestions ## 2022-01-27 ### Attendees: ### Prev AIs * [bmbouters] make POC race-condition-free, post data, have a read-UI * [all]: What do we want to focus on in the following 30-min mtgs? * example: how do we develop metrics and test them? * example: how do we let plugins report? * example: let's talk about status API * [ggainey]: write up "results of pulp /status/ API" as a formal presentation of a metric to the Telemetry Group, answering The List Of Questions * https://hackmd.io/@pulp/telemetry_status ### Agenda * ggainey to report on anything from OCP Telemetry discussions * response back from Nick Stielau * I have a link to an internal doc on how/what his group is measuring * Standing offer to have a 30-min telemetry/metric overview discussion, have not set a date yet * Pointer to https://www.productled.org/foundations/product-led-growth-metrics for general info (if anyone hasn't seen this before) ### AIs ### Links ## 2022-01-20 ### Attendees: bmbouters, dkliban, dfurlong, ppicka, ipanova, ggainey * Last 1-hr mtg * future mtgs 30 min at the half-hour ### Prev AIs * [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat] * contact made, pointers received, email dispatched * [bmbouter] POC against Cloudflare * migration that creates UUID * https://github.com/pulp/pulpcore/pull/2118/files * create CF account * done tied to pulp-infra * have periodic wsgi that posts UUID * post progress to discourse ### Agenda * discussion about POC * discussion around implications of adding tasking-subsystem to Pulp3 * signed up for Cloudflare k/v accoumnt (pulp-infra@ rmail) * something is "not right yet" - #soon * bmbouter to engage CF Discourse * https://discord.gg/cloudflaredev * What are all the ways we could communicate this transparency to users? * How do we make it Really Easy for user to know what's happening and opt-out? * docs, release notes, discourse announcement * social media (tweet, etc) * youtube demo * work w/ mcorr RE social-media * log at start up that telemetry reporting is enabled and refer to a setting which should be changed to disable it * really important for the Users Who Don't Read Anything * log every time telemetry is sent * homeassist does this [here](https://github.com/home-assistant/core/blob/4d72e41a3e88f696d255dc73e4f4e8ec88b1874f/homeassistant/components/analytics/analytics.py#L97) * is periodicity configurable? * "keep simple things simple" - hardcoded * KISS - keep it simple stupid * how often is "often enough"? * what's the most-reasonable time interval, to most users? * once/day * can user control "when during the day" it happens? * think about network-security-rules? * at initial-migration-time, dispatch "soon" post-setup * 30 min post-migrations-run (let pulp-install settle down) * questions about performance (cpu/memory/etc) * contact operate-first group * performance/monitoring is separate from telemetry * but a still really-useful thing to be doing! * [dfurlong] memory-use/performance changes over time is really useful * being able to easily-deliver monitoring results *back to pulp* from users would be great * What is the list of questions we want to ask for each metric * metric-acceptance discussion needs to be "somewhere permanent" * should be a public checklist for answering these questions * example: "How we decide if something is PII and how can it be sanitized" * should be able to connect a specific metric to the exact commit when it entered the codebase * what happens if/when an API being used to collect telemetry, changes what is delivered? * what if PII gets added (e.g.) * need to have a data-audit process in place * an example: * the data reported from the list of status * What question will this help us answer? * How many workers are users running? * What plugins do they run? * What is a specific example of the data to be gathered? * [example TBD] * How will this metric be stored in the database or gathered at runtime? * We'll gather the data at runtime. This should not cause unecessary load on pulp * Will the gathering and/or storage of this cause unacceptable burden/load on Pulp? * No * Is this metric Personally-Identifiable-Data? * Yes the hostnames, so it needs to be redacted * discussion about kinds-of-data * what if post fails * give up, send it tomorrow * api call-periodicity? * api call-sequences? * should be a standard way for a user to request all their data be removed from the public data store * can there be a standard test-sequence that investigates metric results for "known PII problems" and fails a metric if/as it finds something? ### AIs * [bmbouters] make POC race-condition-free, post data, have a read-UI * [all]: What do we want to focus on in the following 30-min mtgs? * example: how do we develop metrics and test them? * example: how do we let plugins report? * example: let's talk about status API * [ggainey]: write up "results of pulp /status/ API" as a formal presentation of a metric to the Telemetry Group, answering The List Of Questions ### Links ## 2022-01-14 ### Attendees: bmbouter, ttereshc, ipanova, dkliban, ggainey ### Prev AIs * [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat] * no progress to report * [bmbouter] talk about budget and direct costs with management * "it's fine, but be selective about which provider we choose" * [ttereshc] talk to lzap about Foreman telemetry * done, largely concerned with performance-monitoring * do we want to collect performance data? or just usage? * what other red=hat-telemetry-services exist that we may want to integrate with/to? * see ttereshc's email for more detail ("Foreman Telemetry") ### Agenda * next mtg 20-JAN, 1 hr, then switch to 30 min * how is a UUID generated? * per-pulp-system * ie, one UUID per-clustered-pulp * "one UUID per-database" * how/where will it be stored? * in db - if it doesn't exist, create one * create as a migration * if it is in the db, use it * would survive across restores/rebuilds * multi-node installs/clusters * same uuid, multiple nodes reporting - can we tell multi-machine architectures? * how are we going to periodically post? * single-node is 'easy' * clusters * not a separate call-home service * periodic pulp-task-posting * everyone puts data into db (somewhere), someone reports it up * sanitizing data? - lv for "what do we report" later * "how often" - performance data prob needs to be gathered more often, for example * "how often do we write into the db?" * write at service-startup? * what about heartbeats? * feature-use needs to happen more often? * gather use-data from existing tables * How do we do a daily task? * wsgi, distributed-lock, dispatch task, record last-update * wsgi heartbeat, check against last-dispatch, at correct interval start a new one * database-xact to force ordering? * even if it's poss for task to dispatch and yet fail to call home - it's ok * what kind-of data is our focus? * what versions of pulp are installed? * what's "a typical pulp instance"? * clustered vs not * do we gather hardware info? (memory, disk usage, cpus?) * what about feature-*usage* data? * configuration - ie, content of pulp/settings.py? * ONLY NON-SENSITIVE DATA * def need to think hard about how to sanitize * monitoring data? * not a primary objective * let's not shut the door on it for future opportunity * monitoring wants UNsanitized data in order to be actionable * what's at least one service we can POC against? * cloudflare, amazon, etc * bmbouters chooses Cloudflare - it uses Free Starter Account! It's Super-Effective! * specific cost ballpark - $50-100/month at initial start, poss growing as we learn how much data and storage * how can we provide full-choice to users to opt-out/opt-in ### AIs * [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat] * [bmbouter] POC against Cloudflare * migration that creates UUID * create CF account * have periodic wsgi that posts UUID * post progress to discourse ### Links ## 2022-01-06 ### attendees: wibbit, ttereshc, dkliban, bmbouters, ggainey, ppicka, ipanova * first 2/3 mtgs, 1 hr - then shorten to 30, less often #### what do we want from today? * set goals * where is the data going to go? * focus on base infrastructure first, then "what data collected and how" * process for how to change/mutate/morph the kinds-of data being collected * timeline possibility: * base infra posted by end-of-January? * uuid/one-piece-of-data gathered and sent "somewhere" * maybe not have a date attached? just work on POC? * maybe just post Goal, and not worry about Date * focus on base-infra and where data will go as POC, data-details come Later * example of a telemetry operation in production use : https://www.home-assistant.io/integrations/analytics * uses CloudFlare to store data * don't forget about GDPR (and friends) laws * what do other projects use? * OpenShift - need to talk to Other Folks * AI: establish contact with them? * What about Foreman? * lzap driving? * AI: talk to lzap * Fedora? crash reports, installation? * Firefox addon may do this? * may need some digging, does Fedora still do this? * talk to Red Hat around direct-cost of supporting such a service * AI: [bmbouters] talk to rchan * wibbit: where does data go * assuming data is sufficiently anonymized to be made public? * yes please * keeps us honest about anonymizing * enhances trust/transparency * cost of distribution/access to the data from the public * data-outflow vs data-ingress costs * wibbit: enterprise env can be draconic around security * infra needs to support multiple pulp-instances hitting a single internal proxy that is the single point-of-contact to telemtry service? * two requirements * clear docs on details of how data posts * proxy support * wibbit: data needs to be staged/stageable locally prior to being submitted * submit-queue that can be paused/investigated * bmbouter: adds to better user-knowledge/transparency, good idea * wibbit: allows for admin-internal-consumption * dkliban: would help manage multi-pulp-installation * wibbit: Real People didn't raise any major concerns, beyond "we need to know what's being uploaded" * wibbit: do we need a consistent UUID over time? * need to be able to identify across upgrades * change-over-time is really important * bmbouter: feature should default-to-on * ipanova: already long talk in foreman-land on this, see discourse * wibbit: dflt-to-on is ok * assumption is admins know what they're doing * would lose any temporal-system info if dflt-to-off * caveat: dflt-on for new-install vs upgrade? * when-introduced, to an existing system, is qualitatively diff than new-install * let's discuss how to do this "**very** transparently and loudly" * where will this flag exist? #### what do want by next week? * AIs * [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat] * [bmbouter] talk about budget and direct costs with management * talk to lzap about Foreman telemetry * Things for next week's agenda: * how is a UUID generated? * how/where will it be stored? * how are we going to periodically post? * what's at least one service we can POC against? * cloudflare, amazon, etc * first three mtgs will be one hour * going forward, 30 min on the half-hour #### Links * https://discourse.pulpproject.org/t/proposal-telemetry/259/2 * https://www.home-assistant.io/integrations/analytics#data-storage--processing * https://www.cloudflare.com/products/workers-kv/ * https://www.home-assistant.io/integrations/analytics * https://community.theforeman.org/t/foreman-telemetry-api-for-developers/26409 ###### tags: `Telemetry`