# Telemetry Working Group * discourse thread : https://discourse.pulpproject.org/t/proposal-telemetry/259/2 * meeting-notes template ``` ## YYYY-MM-DD ### Attendees: ### Prev AIs ### Agenda ### AIs ### Links ``` ## Ongoing Notes * first three mtgs will be one hour * going forward, 30 min on the half-hour ## List of Questions for Every Metric To Be Gathered * What question will this help us answer? * What is a specific example of the data to be gathered? * How will this metric be stored in the database or gathered at runtime? * Will the gathering and/or storage of this cause unacceptable burden/load on Pulp? * Is this metric Personally-Identifiable-Data? * What pulpcore version will this be collected with? * Is this approved/not-approved? ### Telemetry-proposal Template * **NOTE**: this is available as the "Telemetry" template in hackmd.io/pulp! * https://hackmd.io/@pulp/telemetry_template ``` # Title ## What question will this help us answer? ## What is a specific example of the data to be gathered? ## How will this metric be stored in the database or gathered at runtime? ## Will the gathering and/or storage of this cause unacceptable burden/load on Pulp? ## Is this metric Personally-Identifiable-Data? ### How can we sanitize this output? ## What pulpcore version will this be collected with? ## Alternative proposal(s) ### Option 1 ### Option N ## Discussion notes ## Is this approved/not-approved? ## Parking Lot for potential future/RFE work ###### tags: `Telemetry` ``` ## Open Questions * Do we want to compute processes / host also? * Should we want to configure this to analytics.pulpproject.org? ## 2022-06-16 ### Attendees: ppicka, bmbouter, ipanova, douglas * currently pulpcore will post only to the dev site, and only if the user has a .dev installation * some users could have .dev * ### Action Items * [bmbouter] Make a graph that aggregates all Z versions into totals and show that x.y counts * [bmbouter] Put up "coming soon page" * [bmbouter] Get analytics.pulpproject.org DNS integrated with https://analytics-pulpproject-org.pulpproject.workers.dev/ * [bmbouter] Reset the https://analytics-pulpproject-org.pulpproject.workers.dev/ environment * [bmbouter] make additional graphs for each expected plugin version posted * [bmbouter] go through and implementation pagination in summary data ## 2022-05-26 ### Attendees: ppicka, bmbouter, ipanova, dkliban, douglas * In summarizing numbers, in addition to the mean, do we want max and min also? * not for now * Is it time to sign up for the $5 / month plan? * yes * How do we make the versions graph not so complicated? * Keep the raw data including the z-version, but also make a graph that aggregates all Z versions into totals and show that ### Action Items * [bmbouter] Make a graph that aggregates all Z versions into totals and show that x.y counts * [bmbouter] Revise telemetry PoC to only have it post dev data * [bmbouter] Check in with RH about them enabling the pay-plan ## 2022-04-07 ### Attendees: ppicka, bmbouter, ipanova, dkliban, ggainey, douglas * quick review of the graphs with the status data * https://hackmd.io/@pulp/telemetry_status#Graphs-to-be-produced * duplicate data submission * expiration_time - 30days * there should only be one data point from each system because the key is the systemID * KV - data format * {SystemID: {all_the_data, , , }} * summarization process * only considers the latest data points posted in the last 24 hours * Are users allowed to download the raw data? * No because we're telling users that their raw data is only ever retained for 30 days * Are users allowed to download the summary data? * The public analytics site will provide the data, we may allow for downloading of the summarized data later * how to disable this for dev installs * have a dev URL and analytics site and a production URL and analytics site * if pulpcore ends in .dev submit to the dev site otherwise the production site * similar to [what home assistant does](https://github.com/home-assistant/core/blob/4d72e41a3e88f696d255dc73e4f4e8ec88b1874f/homeassistant/components/analytics/analytics.py#L99) * First implementation not planning to handle proxy configs ## 2022-03-24 ### Attendees: ppicka, bmbouter, ipanova, dkliban * Will we share the raw data, or just the summarized data? * We'll provide just the summaries publicly * See the graphs to be produced at the bottom of the https://hackmd.io/@pulp/telemetry_status document * Proposal: summarize daily and include only 1 data point from each systemID ## 2022-03-17 ### Attendees: ppicka, dfurlong, ggainey, bmbouters, ipanova * bmbouter revised POC and demo * https://github.com/pulp/pulpcore/pull/2118/files * key/value - "systemid": "telemetry-key:value" * Thoughts * how/where do we log outgoing info? * into logs? what level? * into task progress-report? * into sep file? * needs discussion * what's a good TTL for data sent to CloudFlare? * cloudflare docs : https://api.cloudflare.com/#custom-hostname-for-a-zone-custom-hostname-details * HomeAssistant has cloudflare-side worker-code receiving data * How do we build/maintain summary info? * What if we send as "uuid-timestamp": "data"? * details are important - but at a high level, what aggregate/historical data are we actually interested in keeping? * "What question are we answering" needs an additional "How are we going to visualize that information?" * keep in mind the difference between "monitoring" and "telemetry" * AI for all: what kinds of ways would we like to summarize/display/graph the existing data proposal ("status") ## 2022-02-03 ### Attendees: ppicka, dfurlong, dkliban, ggainey, bmbouters, ipanova ### Prev AIs * ggainey status writup : https://hackmd.io/@pulp/telemetry_status * great discussion ensues ### Agenda * review /status/ writeup * alternative proposal approved * [all]: What do we want to focus on in the following 30-min mtgs? * example: how do we develop metrics and test them? * example: how do we let plugins report? * example: let's talk about status API ### AIs * ~~[ggainey] hackmd to list "things we might want telemetry proposals for", send link to list~~ * https://hackmd.io/@pulp/telemetry_suggestions * [ALL] everyone adds one line to ^^ * [ggainey] ~~update telemetry-proposal template to include "discussion", "alternative proposal", "RFE suggestions arising from discussion" sections~~ ### Links * https://hackmd.io/@pulp/telemetry_status * https://hackmd.io/@pulp/telemetry_suggestions ## 2022-01-27 ### Attendees: ### Prev AIs * [bmbouters] make POC race-condition-free, post data, have a read-UI * [all]: What do we want to focus on in the following 30-min mtgs? * example: how do we develop metrics and test them? * example: how do we let plugins report? * example: let's talk about status API * [ggainey]: write up "results of pulp /status/ API" as a formal presentation of a metric to the Telemetry Group, answering The List Of Questions * https://hackmd.io/@pulp/telemetry_status ### Agenda * ggainey to report on anything from OCP Telemetry discussions * response back from Nick Stielau * I have a link to an internal doc on how/what his group is measuring * Standing offer to have a 30-min telemetry/metric overview discussion, have not set a date yet * Pointer to https://www.productled.org/foundations/product-led-growth-metrics for general info (if anyone hasn't seen this before) ### AIs ### Links ## 2022-01-20 ### Attendees: bmbouters, dkliban, dfurlong, ppicka, ipanova, ggainey * Last 1-hr mtg * future mtgs 30 min at the half-hour ### Prev AIs * [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat] * contact made, pointers received, email dispatched * [bmbouter] POC against Cloudflare * migration that creates UUID * https://github.com/pulp/pulpcore/pull/2118/files * create CF account * done tied to pulp-infra * have periodic wsgi that posts UUID * post progress to discourse ### Agenda * discussion about POC * discussion around implications of adding tasking-subsystem to Pulp3 * signed up for Cloudflare k/v accoumnt (pulp-infra@ rmail) * something is "not right yet" - #soon * bmbouter to engage CF Discourse * https://discord.gg/cloudflaredev * What are all the ways we could communicate this transparency to users? * How do we make it Really Easy for user to know what's happening and opt-out? * docs, release notes, discourse announcement * social media (tweet, etc) * youtube demo * work w/ mcorr RE social-media * log at start up that telemetry reporting is enabled and refer to a setting which should be changed to disable it * really important for the Users Who Don't Read Anything * log every time telemetry is sent * homeassist does this [here](https://github.com/home-assistant/core/blob/4d72e41a3e88f696d255dc73e4f4e8ec88b1874f/homeassistant/components/analytics/analytics.py#L97) * is periodicity configurable? * "keep simple things simple" - hardcoded * KISS - keep it simple stupid * how often is "often enough"? * what's the most-reasonable time interval, to most users? * once/day * can user control "when during the day" it happens? * think about network-security-rules? * at initial-migration-time, dispatch "soon" post-setup * 30 min post-migrations-run (let pulp-install settle down) * questions about performance (cpu/memory/etc) * contact operate-first group * performance/monitoring is separate from telemetry * but a still really-useful thing to be doing! * [dfurlong] memory-use/performance changes over time is really useful * being able to easily-deliver monitoring results *back to pulp* from users would be great * What is the list of questions we want to ask for each metric * metric-acceptance discussion needs to be "somewhere permanent" * should be a public checklist for answering these questions * example: "How we decide if something is PII and how can it be sanitized" * should be able to connect a specific metric to the exact commit when it entered the codebase * what happens if/when an API being used to collect telemetry, changes what is delivered? * what if PII gets added (e.g.) * need to have a data-audit process in place * an example: * the data reported from the list of status * What question will this help us answer? * How many workers are users running? * What plugins do they run? * What is a specific example of the data to be gathered? * [example TBD] * How will this metric be stored in the database or gathered at runtime? * We'll gather the data at runtime. This should not cause unecessary load on pulp * Will the gathering and/or storage of this cause unacceptable burden/load on Pulp? * No * Is this metric Personally-Identifiable-Data? * Yes the hostnames, so it needs to be redacted * discussion about kinds-of-data * what if post fails * give up, send it tomorrow * api call-periodicity? * api call-sequences? * should be a standard way for a user to request all their data be removed from the public data store * can there be a standard test-sequence that investigates metric results for "known PII problems" and fails a metric if/as it finds something? ### AIs * [bmbouters] make POC race-condition-free, post data, have a read-UI * [all]: What do we want to focus on in the following 30-min mtgs? * example: how do we develop metrics and test them? * example: how do we let plugins report? * example: let's talk about status API * [ggainey]: write up "results of pulp /status/ API" as a formal presentation of a metric to the Telemetry Group, answering The List Of Questions ### Links ## 2022-01-14 ### Attendees: bmbouter, ttereshc, ipanova, dkliban, ggainey ### Prev AIs * [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat] * no progress to report * [bmbouter] talk about budget and direct costs with management * "it's fine, but be selective about which provider we choose" * [ttereshc] talk to lzap about Foreman telemetry * done, largely concerned with performance-monitoring * do we want to collect performance data? or just usage? * what other red=hat-telemetry-services exist that we may want to integrate with/to? * see ttereshc's email for more detail ("Foreman Telemetry") ### Agenda * next mtg 20-JAN, 1 hr, then switch to 30 min * how is a UUID generated? * per-pulp-system * ie, one UUID per-clustered-pulp * "one UUID per-database" * how/where will it be stored? * in db - if it doesn't exist, create one * create as a migration * if it is in the db, use it * would survive across restores/rebuilds * multi-node installs/clusters * same uuid, multiple nodes reporting - can we tell multi-machine architectures? * how are we going to periodically post? * single-node is 'easy' * clusters * not a separate call-home service * periodic pulp-task-posting * everyone puts data into db (somewhere), someone reports it up * sanitizing data? - lv for "what do we report" later * "how often" - performance data prob needs to be gathered more often, for example * "how often do we write into the db?" * write at service-startup? * what about heartbeats? * feature-use needs to happen more often? * gather use-data from existing tables * How do we do a daily task? * wsgi, distributed-lock, dispatch task, record last-update * wsgi heartbeat, check against last-dispatch, at correct interval start a new one * database-xact to force ordering? * even if it's poss for task to dispatch and yet fail to call home - it's ok * what kind-of data is our focus? * what versions of pulp are installed? * what's "a typical pulp instance"? * clustered vs not * do we gather hardware info? (memory, disk usage, cpus?) * what about feature-*usage* data? * configuration - ie, content of pulp/settings.py? * ONLY NON-SENSITIVE DATA * def need to think hard about how to sanitize * monitoring data? * not a primary objective * let's not shut the door on it for future opportunity * monitoring wants UNsanitized data in order to be actionable * what's at least one service we can POC against? * cloudflare, amazon, etc * bmbouters chooses Cloudflare - it uses Free Starter Account! It's Super-Effective! * specific cost ballpark - $50-100/month at initial start, poss growing as we learn how much data and storage * how can we provide full-choice to users to opt-out/opt-in ### AIs * [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat] * [bmbouter] POC against Cloudflare * migration that creates UUID * create CF account * have periodic wsgi that posts UUID * post progress to discourse ### Links ## 2022-01-06 ### attendees: wibbit, ttereshc, dkliban, bmbouters, ggainey, ppicka, ipanova * first 2/3 mtgs, 1 hr - then shorten to 30, less often #### what do we want from today? * set goals * where is the data going to go? * focus on base infrastructure first, then "what data collected and how" * process for how to change/mutate/morph the kinds-of data being collected * timeline possibility: * base infra posted by end-of-January? * uuid/one-piece-of-data gathered and sent "somewhere" * maybe not have a date attached? just work on POC? * maybe just post Goal, and not worry about Date * focus on base-infra and where data will go as POC, data-details come Later * example of a telemetry operation in production use : https://www.home-assistant.io/integrations/analytics * uses CloudFlare to store data * don't forget about GDPR (and friends) laws * what do other projects use? * OpenShift - need to talk to Other Folks * AI: establish contact with them? * What about Foreman? * lzap driving? * AI: talk to lzap * Fedora? crash reports, installation? * Firefox addon may do this? * may need some digging, does Fedora still do this? * talk to Red Hat around direct-cost of supporting such a service * AI: [bmbouters] talk to rchan * wibbit: where does data go * assuming data is sufficiently anonymized to be made public? * yes please * keeps us honest about anonymizing * enhances trust/transparency * cost of distribution/access to the data from the public * data-outflow vs data-ingress costs * wibbit: enterprise env can be draconic around security * infra needs to support multiple pulp-instances hitting a single internal proxy that is the single point-of-contact to telemtry service? * two requirements * clear docs on details of how data posts * proxy support * wibbit: data needs to be staged/stageable locally prior to being submitted * submit-queue that can be paused/investigated * bmbouter: adds to better user-knowledge/transparency, good idea * wibbit: allows for admin-internal-consumption * dkliban: would help manage multi-pulp-installation * wibbit: Real People didn't raise any major concerns, beyond "we need to know what's being uploaded" * wibbit: do we need a consistent UUID over time? * need to be able to identify across upgrades * change-over-time is really important * bmbouter: feature should default-to-on * ipanova: already long talk in foreman-land on this, see discourse * wibbit: dflt-to-on is ok * assumption is admins know what they're doing * would lose any temporal-system info if dflt-to-off * caveat: dflt-on for new-install vs upgrade? * when-introduced, to an existing system, is qualitatively diff than new-install * let's discuss how to do this "**very** transparently and loudly" * where will this flag exist? #### what do want by next week? * AIs * [ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat] * [bmbouter] talk about budget and direct costs with management * talk to lzap about Foreman telemetry * Things for next week's agenda: * how is a UUID generated? * how/where will it be stored? * how are we going to periodically post? * what's at least one service we can POC against? * cloudflare, amazon, etc * first three mtgs will be one hour * going forward, 30 min on the half-hour #### Links * https://discourse.pulpproject.org/t/proposal-telemetry/259/2 * https://www.home-assistant.io/integrations/analytics#data-storage--processing * https://www.cloudflare.com/products/workers-kv/ * https://www.home-assistant.io/integrations/analytics * https://community.theforeman.org/t/foreman-telemetry-api-for-developers/26409 ###### tags: `Telemetry`