Analytics Working Group
## YYYY-MM-DD
### Attendees:
### Prev AIs
### Agenda
### AIs
### Links
Ongoing Notes
first three mtgs will be one hour
going forward, 30 min on the half-hour
List of Questions for Every Metric To Be Gathered
What question will this help us answer?
What is a specific example of the data to be gathered?
How will this metric be stored in the database or gathered at runtime?
Will the gathering and/or storage of this cause unacceptable burden/load on Pulp?
Is this metric Personally-Identifiable-Data?
What pulpcore version will this be collected with?
Is this approved/not-approved?
Analytics-proposal Template
# Title
## What question will this help us answer?
## What is a specific example of the data to be gathered?
## How will this metric be stored in the database or gathered at runtime?
## Will the gathering and/or storage of this cause unacceptable burden/load on Pulp?
## Is this metric Personally-Identifiable-Data?
### How can we sanitize this output?
## What pulpcore version will this be collected with?
## Alternative proposal(s)
### Option 1
### Option N
## Discussion notes
## Is this approved/not-approved?
## Parking Lot for potential future/RFE work
###### tags: `Analytics`
Open Questions
Do we want to compute processes / host also?
2022-12-1 Attendees: bmbouter ppicka mdellweg dkliban wibbit ggainey Agenda
Determined the last regularly scheduled meeting, and followup meetings will happen as-needed
To finalize the tech-debt, we should work on these two issues:
2022-10-20 Attendees: bmbouter ppicka mdellweg dkliban wibbit ggainey Agenda
Here’s a new set of graphs 2 to look at accepting from @mdellweg
Here’s a proposal to collect, summarize, and visualize postgresql version 2 which would be a new metric. This is going to be the “live coding” part that I do at Pulpcon to add it.
https://hackmd.io/zJ1dJe8qQtmzr0JiM1jptw
discussion around "how do we want to summarize"
e.g., is X.Y.Z really interesting?
We want to summarize "versions that matter"
side discussion: format/organization of main visualization page would be A Good Thing
FYI lots of new docs here 1 including importing data from the production site
Should we be limiting summaries to only systems with at least 2 checkins?
"yes please" is the consensus
Proposal: Add a “summarization” and “visualization” sections to the “proposal template”
2022-08-25 Attendees: ppicka, ggainey, bmbouter Agenda
Interesting resources shared with the group from Mozilla's telemetry groups
Updates
Next Steps
bmbouter to fix whatever the issue is with summarization
bmbouter to add plugin documentation on the processes and checklists this group currently has in hackmds
bmbouter to add documentation on how to create the local dev environment
Future meetings
Telemetry working group will meet next week, and maybe the week after to finalize some process things and celebrate
After that telemetry working group will suspend for at least 6 weeks
Working group will resume as new proposals for metrics are proposed
2022-08-18 Attendees: ggainey, dkliban, bmbouters, ipanova, ppicka Prev AIs Agenda
progress made on finalizing POC
demo time!
proposal: have "summarizer" delete old content (rather than replace)
proposal: have "summarizer" only delete data older-than some window (2 weeks?)
AIs
bmbouter to take up the proposals above
add X.Y graph for each component
next steps:
Links 2022-08-11 Attendees: ggainey, dkliban, ppicka, bmbouters, ipanova, wibbit Prev AIs Agenda
discussion around https://github.com/pulp/pulpcore/pull/3032
def a good idea, prob want this backported to 3.20
progress update
lots of progress being made, not baked yet
lots of interaction w/ duck@osci
analytics.pulpproject.org has 2 branches, main and dev
auto-deploys to 2 diff OSCI deployments
both use LetsEncrypt TLS
web-process pod, posstgres backend
django-admin enabled for superuser controls
modification to how payloads are defined
consolidates client and server definitions of payload
using Google's "Protocol Buffer" approach (q.v.)
what about version mismatches?
ProtocolBuffer is Opinionated - follow their requirements
next steps
charting
summaries
manage.py cmd, to be called by openshift cron every 24 hrs
data expiry
AIs
bmbouters hoping for a tech demo next mtg
Links 2022-07-21 Attendees: 2022-07-14 Attendees: bmbouters, dkliban, ipanova, ppicka, ggainey
Current State
PROBLEMS
summarization isn't working, investigation isn't getting us past whatever the problem is
server-side-code pagination isn't working
DNS for analytics-pulpproject-org to be analytics.pulpproject.org would require all pulpproject.org be handed over to cloudflare
reverse-proxy is possible, POC works but is … suboptimal
OSCI asking why we're not just running this on their openshift instance/platform
PROPOSAL
discussion ensues
reliability/availability? visibility into admin/monitoring?
health probe/autorestart-pod should work
proposal: openapi work to auto-generate client/server side of this
makes available to other projects who might want to do this
2022-06-16 Attendees: ppicka, bmbouter, ipanova, douglas
currently pulpcore will post only to the dev site, and only if the user has a .dev installation
some users could have .dev
Action Items 2022-05-26 Attendees: ppicka, bmbouter, ipanova, dkliban, douglas
In summarizing numbers, in addition to the mean, do we want max and min also?
Is it time to sign up for the $5 / month plan?
How do we make the versions graph not so complicated?
Keep the raw data including the z-version, but also make a graph that aggregates all Z versions into totals and show that
Action Items
[bmbouter] Make a graph that aggregates all Z versions into totals and show that x.y counts
[bmbouter] Revise telemetry PoC to only have it post dev data
[bmbouter] Check in with RH about them enabling the pay-plan
2022-04-07 Attendees: ppicka, bmbouter, ipanova, dkliban, ggainey, douglas
quick review of the graphs with the status data
duplicate data submission
expiration_time - 30days
there should only be one data point from each system because the key is the systemID
KV - data format
{SystemID: {all_the_data, , , }}
summarization process
only considers the latest data points posted in the last 24 hours
Are users allowed to download the raw data?
No because we're telling users that their raw data is only ever retained for 30 days
Are users allowed to download the summary data?
The public analytics site will provide the data, we may allow for downloading of the summarized data later
how to disable this for dev installs
have a dev URL and analytics site and a production URL and analytics site
if pulpcore ends in .dev submit to the dev site otherwise the production site
similar to what home assistant does
First implementation not planning to handle proxy configs
2022-03-24 Attendees: ppicka, bmbouter, ipanova, dkliban
Will we share the raw data, or just the summarized data?
We'll provide just the summaries publicly
See the graphs to be produced at the bottom of the https://hackmd.io/@pulp/telemetry_status document
Proposal: summarize daily and include only 1 data point from each systemID
2022-03-17 Attendees: ppicka, dfurlong, ggainey, bmbouters, ipanova
bmbouter revised POC and demo
Thoughts
how/where do we log outgoing info?
into logs? what level?
into task progress-report?
into sep file?
needs discussion
what's a good TTL for data sent to CloudFlare?
cloudflare docs : https://api.cloudflare.com/#custom-hostname-for-a-zone-custom-hostname-details
HomeAssistant has cloudflare-side worker-code receiving data
How do we build/maintain summary info?
What if we send as "uuid-timestamp": "data"?
details are important - but at a high level, what aggregate/historical data are we actually interested in keeping?
"What question are we answering" needs an additional "How are we going to visualize that information?"
keep in mind the difference between "monitoring" and "telemetry"
AI for all: what kinds of ways would we like to summarize/display/graph the existing data proposal ("status")
2022-02-03 Attendees: ppicka, dfurlong, dkliban, ggainey, bmbouters, ipanova Prev AIs Agenda
review /status/ writeup
alternative proposal approved
[all]: What do we want to focus on in the following 30-min mtgs?
example: how do we develop metrics and test them?
example: how do we let plugins report?
example: let's talk about status API
AIs
[ggainey] hackmd to list "things we might want telemetry proposals for", send link to list
[ggainey] update telemetry-proposal template to include "discussion", "alternative proposal", "RFE suggestions arising from discussion" sections
Links 2022-01-27 Attendees: Prev AIs
[bmbouters] make POC race-condition-free, post data, have a read-UI
[all]: What do we want to focus on in the following 30-min mtgs?
example: how do we develop metrics and test them?
example: how do we let plugins report?
example: let's talk about status API
[ggainey]: write up "results of pulp /status/ API" as a formal presentation of a metric to the Telemetry Group, answering The List Of Questions
Agenda
ggainey to report on anything from OCP Telemetry discussions
AIs Links 2022-01-20 Attendees: bmbouters, dkliban, dfurlong, ppicka, ipanova, ggainey
Last 1-hr mtg
future mtgs 30 min at the half-hour
Prev AIs
[ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat]
contact made, pointers received, email dispatched
[bmbouter] POC against Cloudflare
migration that creates UUID
create CF account
have periodic wsgi that posts UUID
post progress to discourse
Agenda
discussion about POC
discussion around implications of adding tasking-subsystem to Pulp3
signed up for Cloudflare k/v accoumnt (pulp-infra@ rmail)
something is "not right yet" - #soon
bmbouter to engage CF Discourse
What are all the ways we could communicate this transparency to users?
How do we make it Really Easy for user to know what's happening and opt-out?
docs, release notes, discourse announcement
social media (tweet, etc)
youtube demo
work w/ mcorr RE social-media
log at start up that telemetry reporting is enabled and refer to a setting which should be changed to disable it
really important for the Users Who Don't Read Anything
log every time telemetry is sent
homeassist does this here
is periodicity configurable?
"keep simple things simple" - hardcoded
KISS - keep it simple stupid
how often is "often enough"?
what's the most-reasonable time interval, to most users?
once/day
can user control "when during the day" it happens?
think about network-security-rules?
at initial-migration-time, dispatch "soon" post-setup
30 min post-migrations-run (let pulp-install settle down)
questions about performance (cpu/memory/etc)
contact operate-first group
performance/monitoring is separate from telemetry
but a still really-useful thing to be doing!
[dfurlong] memory-use/performance changes over time is really useful
being able to easily-deliver monitoring results back to pulp from users would be great
What is the list of questions we want to ask for each metric
metric-acceptance discussion needs to be "somewhere permanent"
should be a public checklist for answering these questions
example: "How we decide if something is PII and how can it be sanitized"
should be able to connect a specific metric to the exact commit when it entered the codebase
what happens if/when an API being used to collect telemetry, changes what is delivered?
what if PII gets added (e.g.)
need to have a data-audit process in place
an example:
the data reported from the list of status
What question will this help us answer?
How many workers are users running?
What plugins do they run?
What is a specific example of the data to be gathered?
How will this metric be stored in the database or gathered at runtime?
We'll gather the data at runtime. This should not cause unecessary load on pulp
Will the gathering and/or storage of this cause unacceptable burden/load on Pulp?
Is this metric Personally-Identifiable-Data?
Yes the hostnames, so it needs to be redacted
discussion about kinds-of-data
what if post fails
give up, send it tomorrow
api call-periodicity?
api call-sequences?
should be a standard way for a user to request all their data be removed from the public data store
can there be a standard test-sequence that investigates metric results for "known PII problems" and fails a metric if/as it finds something?
AIs
[bmbouters] make POC race-condition-free, post data, have a read-UI
[all]: What do we want to focus on in the following 30-min mtgs?
example: how do we develop metrics and test them?
example: how do we let plugins report?
example: let's talk about status API
[ggainey]: write up "results of pulp /status/ API" as a formal presentation of a metric to the Telemetry Group, answering The List Of Questions
Links 2022-01-14 Attendees: bmbouter, ttereshc, ipanova, dkliban, ggainey Prev AIs
[ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat]
[bmbouter] talk about budget and direct costs with management
"it's fine, but be selective about which provider we choose"
[ttereshc] talk to lzap about Foreman telemetry
done, largely concerned with performance-monitoring
do we want to collect performance data? or just usage?
what other red=hat-telemetry-services exist that we may want to integrate with/to?
see ttereshc's email for more detail ("Foreman Telemetry")
Agenda
next mtg 20-JAN, 1 hr, then switch to 30 min
how is a UUID generated?
per-pulp-system
ie, one UUID per-clustered-pulp
"one UUID per-database"
how/where will it be stored?
in db - if it doesn't exist, create one
if it is in the db, use it
would survive across restores/rebuilds
multi-node installs/clusters
same uuid, multiple nodes reporting - can we tell multi-machine architectures?
how are we going to periodically post?
single-node is 'easy'
clusters
not a separate call-home service
periodic pulp-task-posting
everyone puts data into db (somewhere), someone reports it up
sanitizing data? - lv for "what do we report" later
"how often" - performance data prob needs to be gathered more often, for example
"how often do we write into the db?"
write at service-startup?
what about heartbeats?
feature-use needs to happen more often?
gather use-data from existing tables
How do we do a daily task?
wsgi, distributed-lock, dispatch task, record last-update
wsgi heartbeat, check against last-dispatch, at correct interval start a new one
database-xact to force ordering?
even if it's poss for task to dispatch and yet fail to call home - it's ok
what kind-of data is our focus?
what versions of pulp are installed?
what's "a typical pulp instance"?
clustered vs not
do we gather hardware info? (memory, disk usage, cpus?)
what about feature- usage data?
configuration - ie, content of pulp/settings.py?
ONLY NON-SENSITIVE DATA
def need to think hard about how to sanitize
monitoring data?
not a primary objective
let's not shut the door on it for future opportunity
monitoring wants UNsanitized data in order to be actionable
what's at least one service we can POC against?
cloudflare, amazon, etc
bmbouters chooses Cloudflare - it uses Free Starter Account! It's Super-Effective!
specific cost ballpark - $50-100/month at initial start, poss growing as we learn how much data and storage
how can we provide full-choice to users to opt-out/opt-in
AIs
[ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat]
[bmbouter] POC against Cloudflare
migration that creates UUID
create CF account
have periodic wsgi that posts UUID
post progress to discourse
Links 2022-01-06 attendees: wibbit, ttereshc, dkliban, bmbouters, ggainey, ppicka, ipanova
first 2/3 mtgs, 1 hr - then shorten to 30, less often
what do we want from today?
set goals
where is the data going to go?
focus on base infrastructure first, then "what data collected and how"
process for how to change/mutate/morph the kinds-of data being collected
timeline possibility:
base infra posted by end-of-January?
uuid/one-piece-of-data gathered and sent "somewhere"
maybe not have a date attached? just work on POC?
maybe just post Goal, and not worry about Date
focus on base-infra and where data will go as POC, data-details come Later
example of a telemetry operation in production use : https://www.home-assistant.io/integrations/analytics
uses CloudFlare to store data
don't forget about GDPR (and friends) laws
what do other projects use?
OpenShift - need to talk to Other Folks
AI: establish contact with them?
What about Foreman?
lzap driving?
AI: talk to lzap
Fedora? crash reports, installation?
Firefox addon may do this?
may need some digging, does Fedora still do this?
talk to Red Hat around direct-cost of supporting such a service
AI: [bmbouters] talk to rchan
wibbit: where does data go
assuming data is sufficiently anonymized to be made public?
yes please
keeps us honest about anonymizing
enhances trust/transparency
cost of distribution/access to the data from the public
data-outflow vs data-ingress costs
wibbit: enterprise env can be draconic around security
infra needs to support multiple pulp-instances hitting a single internal proxy that is the single point-of-contact to telemtry service?
two requirements
clear docs on details of how data posts
proxy support
wibbit: data needs to be staged/stageable locally prior to being submitted
submit-queue that can be paused/investigated
bmbouter: adds to better user-knowledge/transparency, good idea
wibbit: allows for admin-internal-consumption
dkliban: would help manage multi-pulp-installation
wibbit: Real People didn't raise any major concerns, beyond "we need to know what's being uploaded"
wibbit: do we need a consistent UUID over time?
need to be able to identify across upgrades
change-over-time is really important
bmbouter: feature should default-to-on
ipanova: already long talk in foreman-land on this, see discourse
wibbit: dflt-to-on is ok
assumption is admins know what they're doing
would lose any temporal-system info if dflt-to-off
caveat: dflt-on for new-install vs upgrade?
when-introduced, to an existing system, is qualitatively diff than new-install
let's discuss how to do this " very transparently and loudly"
where will this flag exist?
what do want by next week?
AIs
[ggainey] establish contact with Carl Trieoff RE OpenShift data gathering [gchat]
[bmbouter] talk about budget and direct costs with management
talk to lzap about Foreman telemetry
Things for next week's agenda:
how is a UUID generated?
how/where will it be stored?
how are we going to periodically post?
what's at least one service we can POC against?
first three mtgs will be one hour
going forward, 30 min on the half-hour
Links