Proposal: gather /status/ telemetry

What question will this help us answer?

The general use is to get a feel for what a "typical" Pulp3 installation looks like.

The most important question being answered is "What specific plugin-versions are in use?" This will start giving uis insight into what our community's upgrade-pace is, and gauge plugin-popularity.

We'll be able to start gauging what a typical/standard/median "size" of a Pulp3 installtion is, by learning:

How many content-apps are in use?
How many hosts do these content-apps run on?
How many workers are in use?
How many hosts do the workers run on?
* How much disk is available

Finally, we'll learn how important "redis" is to the community, since /status/ reports on whether it is in use or not.

What is a specific example of the data to be gathered?

$ http :/pulp/api/v3/status/
{
    "database_connection": {
        "connected": true
    },
    "online_content_apps": [
        {
            "last_heartbeat": "2022-01-26T18:58:36.372233Z",
            "name": "49030@pulp3-source-fedora34.padre-fedora.example.com"
        },
        {
            "last_heartbeat": "2022-01-26T18:58:36.532093Z",
            "name": "49031@pulp3-source-fedora34.padre-fedora.example.com"
        },
    ],
    "online_workers": [
        {
            "current_task": null,
            "last_heartbeat": "2022-01-26T18:58:39.618259Z",
            "name": "49007@pulp3-source-fedora34.padre-fedora.example.com",
            "pulp_created": "2022-01-26T15:17:47.865## Parking Lot for potential future/RFE work
912Z",
            "pulp_href": "/pulp/api/v3/workers/202beb44-4c54-48d9-a3f1-c671c406310e/"
        },
        {
            "current_task": null,
            "last_heartbeat": "2022-01-26T18:58:39.655667Z",
            "name": "49017@pulp3-source-fedora34.padre-fedora.example.com",
            "pulp_created": "2022-01-26T15:17:48.479940Z",
            "pulp_href": "/pulp/api/v3/workers/4d3bfd4c-1bd2-4930-8c7b-15e800bec3e0/"
        },

    ],
    "redis_connection": {
        "connected": true
    },
    "storage": {
        "free": 31851548672,
        "total": 42006183936,
        "used": 7990427648
    },
    "versions": [
        {
            "component": "core",
            "version": "3.18.0.dev"
        },
        {
            "component": "file",
            "version": "1.11.0.dev"
        },
        {
            "component": "rpm",
            "version": "3.18.0.dev"
        },
        {
            "component": "container",
            "version": "2.11.0.dev"
        },
        {
            "component": "deb",
            "version": "2.18.0.dev"
        },
        {
            "component": "certguard",
            "version": "1.6.0.dev"
        },
        {
            "component": "pulp_2to3_migration",
            "version": "0.16.0.dev"
        }
    ]
}

How will this metric be stored (in the database or gathered at runtime)?

Gathered directly from the /status/ endpoint at telemetry-send-time.

Will the gathering and/or storage of this cause unacceptable burden/load on Pulp?

No - /status/ is a very low-impact API.

Is this metric Personally-Identifiable-Data?

YES - online_content_apps and online_workers worker-names include machine-names, which can carry PID. These will need to be sanitized.

How can we sanitize this output?

remove the "name" field (it doesn't teach us anything)
replace "name": <actual-process-id>@<actual-host> with "name": PID@HOST
replace "name": <actual-process-id>@<actual-host> with "name": PID@sha256(actual-host-name)
- this would let us track number-of-unique-hosts, without knowing the hostname
record only number-of-processes/number-of-hosts
- requires a little more processing

What pulpcore version will this be collected with?

3.19

Discussion

keeping unique-host-info vs counts
- can we tell plugins-per-host? is it useful? is it even possible?
- currently don't/can't do this
  - would allow scaling-control
what about artifact/content "sizes"?
- yes! needs its own proposal - volunteers?

Is this approved/not-approved?

accept alternative proposal:
- Aye: 6
- Nay: 0

Alternative Proposal

Lose db/redis info
- "on" isn't useful - "version" is
- should be their own telemetry
Lose storage
- not very useful
- when connected to object-storage, not very sueful
- should also be its own telemetry option
Summary info
- change to count process and hosts
  - see better questions at top, sanitation section

Proposed alternate telemetry data

{
    "online_content_apps": {
        "processes": 2
        "hosts": 1
    },
    "online_workers": {
        "processes": 2
        "hosts": 1
    },
    "versions": [
        {
            "component": "core",
            "version": "3.18.0.dev"
        },
        {
            "component": "file",
            "version": "1.11.0.dev"
        },
        {
            "component": "rpm",
            "version": "3.18.0.dev"
        },
        {
            "component": "container",
            "version": "2.11.0.dev"
        },
        {
            "component": "deb",
            "version": "2.18.0.dev"
        },
        {
            "component": "certguard",
            "version": "1.6.0.dev"
        },
        {
            "component": "pulp_2to3_migration",
            "version": "0.16.0.dev"
        }
    ]
}

Parking Lot for potential future/RFE work

can we tell plugins-per-host? is it useful? is it even possible?
- currently don't/can't do this
  - would allow scaling-control
Determining clusters solutions
- Pulp instances that are scaled out horizontally, how could that be visualised (give away in the status?)
- Unique pulp instances, that for part of a "cluster" from a clients perspective (does that matter, probably not)

tags: `Telemetry`

Graphs to be produced

How many unique systems there are?
- represent as a line graph over time (count of total unique systems)
Versions per component
- For each component, e.g. rpm, certguard, pulpcore
  - use a pie chart to show the version distribution for that component
Bar graph reports the number of users per component
- Regardless of version
- how to distinguish whether this is a single container installation?
Average hosts
- online_content_app hosts summarized into a single average, and graphed as a timeseries
- online_workers hosts summarized into a single average, and graphed as a timeseries
Average processes
- same as above, only for processes
Average processes / host
- same as above, only for average processes / host