Basic Statistics in Opencast

# Basic Statistics in Opencast This document proposes a system that will allow Opencast to gather and report basic statistics, e.g. video view counts. While counting seems like one of the simplest things you can do with software, designing a system that is somewhat resilient, respects users' privacy and complies with laws is far from trivial. ## Terminology This document uses the term "event" in its typical English meaning, i.e. an occurence, something that happens at some point in time. It will use the term "video" to refer to a video in Opencast (often called "event", or episode, or media package). ## Goals and Non-Goals ### "Basic" We are not trying to build a full-blown analytics solution; for example, we are not trying to replace Matomo. What we want to provide is simple statistical data that users would expect from a video management system, most notably "view counts" (i.e. how often a video was viewed) and watchtime (total time spent watching a video). A timeline heatmap (what parts of a video were watched most often) is also a feature we want to offer eventually. YouTube, for example, also reports all these metrics. ### Stored in Opencast Statistical data has to be stored and handled by Opencast, as it's the central hub for all video interactions. If an LMS or Tobira would gather statistics on its own, this data would be skewed as views/interactions on other platforms wouldn't be counted. ### GDPR compliant without consent banners Of course Opencast needs to comply with laws like GDPR, ePrivacy, and others. But we want to go a step further and design our system in a way that does not require a cookie/consent banner, for various reasons: - _Bad UX_: it annoys the user. - _Data minimalism & user privacy_: not invading a user's privacy is simply the right thing to do. Only collect what we absolutely need to offer a good service. - _Brand credibility_: Opencast as a self-hostable open source project stands for respecting privacy and data ownership. - _Better statistics_: a consent banner implies that a user can reject (or ignore) it, meaning that those users are lost in the statistics. Further, privacy invasive tracking in Opencast might get blocked by browsers or browser extensions, resulting in even more users being unaccounted for. Rejecting cookies or blocking tracking is more common for certain groups (e.g. computer science students) than others, meaning that statistical data between different videos is not even comparable. By using a privacy respecting solution without need for consent, we can capture virtually all user interactions. ### Abuse protection The internet is full of bad actors, in ever increasing numbers, in large part thanks to AI. The bad news is that we cannot protect against all of them: a sufficiently motivated attacker will find a way to overload the system or influence statistical data. What this system aims to do is to use reasonable measure to protect against most of the simpler abuses and to have a reasonable and mostly meaningful definition of a "view". ## Legal status of this proposal The general techniques and ideas used in this proposed solution have been discussed with the our lawyer at elan e.V. and have been considered acceptable. Once the community agrees with this proposal, we will check it again, but it's highly likely all of this is complying with privacy laws (GDPR, ePrivacy, PECR, CCPA). It should be clear that we are operating close to the edge of what's allowed and what isn't. Therefore, details matter. Strongly helping our case is that we don't track cross-device, cross-app, or over long time periods; not trying to build a profile and not use this data for marketing or targetting. It also helps that companies like Plausible are already operating with the techniques used here. ## Experimentation and implementation timeline While this proposal is quite detailed and lots of experience from similar existing systems have been used, building a statistics system requires experimentation and real world data. That way, assumptions can be tested, and algorithms can be tweaked. While lots of care has been put into this proposal, I don't expect it to work perfectly out of the box. Some of the assumptions we want to check are: - Stability/uniqueness of IP-addresses (see discussion section) - Order of raw events for a typical user interaction with a video - Typical byte ranges of HTTP requests for videos It would be ideal to quickly test out data gathering and aggregation in production systems, without exposing collected data to users quite yet. The usefulness of test systems is very limited here, as we need real user interactions. It might also be worth experimenting in Tobira for parts of this. Tobira likely wants some statistics on its own in addition to what Opencast will offer (e.g. page views, global daily unique users), and Tobira might be a better playground for experimentation: implementation is likely faster (smaller scope, ...) and updating Tobira is easier than Opencast. And speaking generally, changes outside of Opencast are necessary, most notably octoka, LMS, Tobira. The Paella player shipped with Opencast also needs to be adjusted to generate raw events. <br> ## Technical specification Some general notes: - Fields/data/types not explicitly marked as optional or nullable, are *not nullable*. - This specification already considers us being able to collect statistics about more than just videos. This is mainly for future-proofing the design, and might not be relevant in the near term. If it is possibble to add this flexibility later, it can be ignored in the first implementation. - When referring to an RFC3339 date or datetime, we additionally require: - The date to use 4-digit years - The timezone to always be UTC (i.e. it always ends in Z) - The separator between date and time to be `T` ### Overview (Non-normative) - All data is stored in Opencast - _Raw events_: low level user actions, with timestamp, pseudonomized for current day, fully anonymized for past days - _Aggregated data_: user-friendly data, continiously updated from raw events - Raw events use session hashes as privacy-friendly way to detect unique users - Opencast stores a "threshold date" to inform users since when statistical data was collected - APIs - Registering events - *Client push* (`POST /api/stats/client-push`): intended for a user browser to directly register an event (similar to how Matomo works) - *Trusted push* (`POST /api/stats/trusted-push`): allows trusted external apps (octoka, Tobira, ...) to push many events at once - Reading aggregated data: `GET /api/stats/video/<id>?...` - Potentially other utility APIs ### Common types & constants - `video_timestamp`: unsigned integer, at least 31 bits, specifying a time in a video via the number of milliseconds since the video's start. - `datetime`: string containing an RFC3339 formatted `date-time` - `MAX_CLIENT_PUSH_DELAY` (duration): specifies how far in the past raw events are allowed to be when added by a non-trusted source. Recommended value: 15min. - `ALLOWED_CLOCK_SKEW` (duration): specifies the maximum allowed/expected clock skew between different devices (user client, file server, Opencast). Recommended value: 30s. ### Unique user detection & session hash #### Intro For analytics and statistical systems, it is important to be able to distinguish new from returning users, and to see which low-level events stem from the same high-level user interaction. For Opencast in particular, we need it to: - Correlate "play", "fetch-file", "seek", and similar video events to one another - Know how many unique users watched a videos and how often it was rewatched - Limit how refreshes influence the view count Doing this in a manner that respects user-privacy and complies with privacy laws without consent banner is fairly tricky. We will use a method very similar to what privacy-focused analytics tools like "Plausible" are using. To understand the general approach, see the relevant discussion section below! #### Specification The _session hash_ is defined as `HMAC_SHA256(daily_secret, item_id || IP || UA)` where: - `HMAC_SHA256` is [HMAC](https://en.wikipedia.org/wiki/HMAC) using SHA256 - `daily_secret` is a random secret, rotated/regenerated daily - `||` is the concatenation operator - `item_id` is the ID of the item the event involves (e.g. video UUID) - `IP` is the IP address of the user - `UA` is the user agent string of the user *Note*: the hash should operate on the actual bytes of the IP address, not its stringified form. The `daily_secret` must be rotated each day, ideally during times of lowest activity (e.g. 4am). It must be completely randomly generated and have at least 128bits. It must _not_ be stored in the database at any point, but distributed to other Opencast nodes via different means (if necessary). Otherwise, careless DB backups could preserve the secret for a long time, breaking this anonymization technique. The old secret must be properly deleted. See the "alternatives" and "discussion" sections below for more information. ### Raw events These describe low-level events relevant to statistics. Everything starts as a raw event, as this is the only entry point for stasticical data. Each raw event has the following fields: - _Timestamp_: UTC timestamp when the event happened (or specifically: started happening) - _Session_: session hash (see above) - _Item type_: type of item this applies to, like "video", "series" or "playlist" - _Item ID_: the ID of the item this applies to, e.g. video UUID - _Event type_: the type of event that happened (e.g. play, pause, seek, ...), from a list of allowed values. - _Event payload_: an arbitrary JSON payload that depends on the event type. Raw events are stored in some kind of DB. In code, there should be an interface layer to be able to use different DB implementations. Initially, only one implementation will exist: storing this data in the main DB that Opencast already uses. This setup allows us to use time series database in the future, if that is deemed useful. The item type and event type should both be stored in a compact form, i.e. not as a string, but as an "enum" (each type is assigned a small integer). 8 bits should suffice for the item type, 16 for the event type. The event data JSON should be stored as compact as possible as well. This usually means using DB built-in types like PostgreSQL's `jsonb`. The session hash should be stored as binary (e.g. `bytea` in PostgreSQL), not as a hexadecimal or base64 string. The storage should allow the following operations to be executed quickly (usually by using indices): - Lookup by item ID (potentially together with item type) - Range queries for `timestamp` ### Event type and data The following events and their respective event data (payloads) are defined. If no payload is specified, the event data field must be null. Otherwise, the field must be a JSON object with the fields specified below. - `page-visit`: a page dedicated to this item was opened. For example, Opencast's `/play` route or Tobira's video page (`/v/<id>`) would generate this event, but a page embedding a video among other things would not. Payload: - `url` (`string`): URL of the dedicated item page - `video:play`: user has clicked "play" on a video to start watching. No payload. - `video:pause`: user has paused video playback. Payload: - `at` (`video_timestamp`): when the user paused - `video:resume`: user has resumed video playback. Payload: - `at` (`video_timestamp`): where the user resumed playback - `video:seek`: user jumped to somewhere in the video. Payload: - `to` (`video_timestamp`): time in the video that was jumped to - `video:watched`: the user has fully watched part of the video. The event timestamp is the "end" time when the part has already been watched. Payload: - `from` (`video_timestamp`) - `to` (`video_timestamp`) - `fetch-file`: a file was (partially) downloaded. The event timestamp is the time when the first request was first received. Payload: - `elem` (string): (file) element ID, which is the path segment after the video ID. - `from` (uint): start of byte range of what was downloaded. Non-range-request specify 0. - `to` (nullable, uint): end of byte range of what was downloaded. `null` if the request does not specify an end and the file server has no way to check what bytes were actually sent to the client. Of course, new event types may be added in the future. #### Notes for frontends Sending video events to the backend should be buffered. [`sendBeacon`](https://developer.mozilla.org/en-US/docs/Web/API/Navigator/sendBeacon) should be used to send all remaining buffered events when the user closes the page. Note however, that `sendBeacon` is not fully reliable, so the buffer duration should be kept reasonably short, to avoid losing too many reports. Events must not be buffered for longer than `MAX_CLIENT_PUSH_DELAY`. None of the statistic requests should influence the main functionality, i.e. even if requests are super slow, or fail, main functionality must remain. Frontends should debounce video events to clean user behavior a bit. For example, for multiple seek operations in a short amount of time, reporting only the last one is likely a good call (imagine a user trying to jump far forward by pressing the +10s button many times). The same is true for pause/resume actions. To report the `video:watched` action, frontends have to have their own small logic, and they should try to always report the largest possible range, or in other words: merge adjacent ranges. For example, instead of reporting one `video:watched` event for every second watched, only one event for each consecutive section watched should be reported. Of course, we still want to report progress somewhat regularly (to avoid losing these events on page close), so `video:watched` events of neighboring ranges will still end up in the DB. #### Notes for file servers Requests for consecutive byte ranges of the same file should be merged already, to reduce the number of events stored. Ideally, file servers should report byte ranges of data actually sent to the client instead of the request `Range` parameters. For example, clients just sending requests and immediately closing the connection wouldn't count, or at least only for one packet worth of bytes. ### Raw event producers The initial implementation of this should already include basic producers of these raw events, in particular: - A Paella player plugin that generates all the `video:*` events and sends them via client push - The built-in OC file server generating `fetch-file` events - Octoka generating `fetch-file` events How to implement these is basically described in the previous section in "Notes for frontends"/"Notes for file servers". ### Aggregated data Raw events are aggregated into user-friendly data. This is the part where the "basic" in this proposal's title applies. We can start simple, as future additions (like a timeline heat map) can derive their information from already stored raw events. For videos, we store the following data: - Number of views - Watchtime Both of these metrics are defined below. They are stored in hour granularity (i.e. per hour bucket) _and_ as all time total (which is the sum of all hour-buckets). Example (for a single video, just showing views): ``` Total: 3419 ... 2026-04-27 8am: 0 2026-04-27 9am: 3 2026-04-27 10am: 2 2026-04-27 11am: 5 2026-04-27 12am: 12 2026-04-27 13am: 8 ... ``` #### Harvesting metadata While many external applications will simply use the Opencast API to read statistical data directly, some others (e.g. Tobira) rely on having their own cached version of that data. For that purpose, some way of harvesting new statistical data is necessary. Thus, Opencast should store a timestamp for each video specifying when its aggregated statistical data was last updated. To be clear: the existing `updated` timestamp of the video should _not_ be updated when only statistical data changes. #### Storage details The exact storage details are up to the implementation: storing a number for each hour since the video was uploaded might be a bit wasteful and mostly store 0s, so some optimization might be appropriate. One option is to store one DB row per day where there are >0 views, and then storing two arrays of 24 values (one for views, one for watchtime). It is also up to the implementation how to obtain the all time totals (always calculate on the fly, just cache in memory, store in DB, ...), as long as retrieving these totals is a fast operation. A time series DB might also be a good choice for this data. However, Opencast statistics should not require admins to setup such a time series database. Storage in the main DB must remain an option. Code should be structured in a way to allow plugging in a time series DB later. It should also be considered that common operations should be fast. The aggregation itself will regularly update data, so that shouldn't be overly costly. Further, the API for reading aggregated data will query data for a single video, so an index should likely be added. Finally, harvesting will often run queries of the form "for all videos that had their aggregated data change since $timestamp, give me their totals and the $n last hour buckets". ### Aggregation algorithm Aggregation of raw events happens continiously in order to keep statistics pretty up to date. However, it happens not fully "on the fly", i.e. the API handlers for ingesting raw events are not doing the aggregation itself. Instead, there should be a background aggregation task. There are multiple reasons for this: - Often, high level metrics are a combination of multiple raw events, so we need to wait for all of them before being able to update the aggregation. - For abuse detection, we want to have some longer time buffer we can observe for unusual activity. - Delaying statistic updates a bit is actually improving user privacy, as a video owner cannot pinpoint the exact second when a video was watched. The background task regularly "wakes up" and checks the raw events to see if some of them can be aggregated. There is only one such task per Opencast cluster. Aggregation likely requires certain temporary data to be stored. The DB could be used for that purpose, allowing the task to continue work without problem after having crashed or being killed. Alternatively, it could be stored in working memory, with a restart requiring the task to redo all the aggregration work for the day to arrive at the state it was before. Details are left for the implementation to decide. #### Views A number approximating the notion of how often a video was viewed. Of course, this is over-simplifying user behavior and necessitates some kind of semi-arbitrary definition of what constitutes a "view". We roughly stick to what YouTube is doing, mostly because that matches user expectations most closely. Compare the reference section on YouTube details. The following combination of raw events (all with the same session hash and video ID) are counted as one view: - A `video:play` event - `video:watched` events summing to at least `min(30s, video_duration * 0.7)` of watch time - At least one `file-fetch` event There are some important nuances: - A `video:watched` event is always associated with the closest `video:play` event that happened before. - The `file-fetch` event is generated by another device, so we need to account for clock skew and also count `file-fetch` events that happened slightly before the `video:play` event, according to the timestamp. But the aggregration algorithm must make sure that each `file-fetch` event is only associated with a single `video:play` event and thus only contributes to one view. Allowed clock skew should be around 30s max. Examples: - play: no view - play, fetch: no view - play, fetch, watch 35s: 1 view - play, fetch, watch 10s: no view (too little watchtime) - play, fetch, watch 10s, fetch, watch 25s: 1 view - fetch, play, watch 35s: 1 view (allow for some clock skew) - play, fetch, fetch, watch 35s, play, watch 35s: 2 views, assuming all happens in a short time frame (fetches can be clock skewed) - play, fetch, watch 35s, play, watch 35s: 1 view (fetched can be clock skewed, but there is only a single fetch) - fetch, ...2 minutes pass..., play, watch 35s: no view (allowed clock skew is limited) - play, fetch, watch 35s, fetch, ...2 minutes pass..., play, watch 35s: 1 view (allowed clock skew is limited) - play, fetch, watch 35s, fetch, watch 35s, play: 1 view (`watched` events always belong to the `play` event happening before) Notes: - Browser might cache video data, meaning a rewatch of a short video might not generate `file-fetch` events and thus not generate views. This kind of cache is typically rare and does not last for long, so the effect is limited. - We might want to make the `file-fetch` requirement configurable, as not all Opencast nodes are immediately setup with a file server reporting these events (e.g. octoka). This configuration should likely not be a boolean, but a date, after which this is a requirement (see "Data completeness" section). - We are not using the byte range request info of `file-fetch` here yet, as interpreting it to estimate "30s watched" is finicky. On top of that logic, we limit the number of views a single session hash can generate per day to 4. Further "views" generated by the above logic are ignored if the session hash already viewed the video 4 times. #### Watchtime How much of the video was watched in total. Video-duration, not real-time duration, i.e. watching at 2x speed means a user can generate 10min of watchtime in 5min. Watchtime is simply the sum of `video:watched` raw events, with some extra logic: - Only `video:watched` events are counted that can be associated with at least one `video:play` and one `fetch-file` event (very similar to the view count explained above) - We cap the time each raw event contributes to 2.5 times the time passed since the last `video:watched` event for the same video and session hash. That prevents a single user sending "watched 1h of the video" every second. If there is no previous `video:watched` event, the cap is `MAX_CLIENT_PUSH_DELAY * 2.5`. The factor of 2.5 is the max reasonable speed. Users might rarely watch at higher speeds, but the error due to this is acceptable. In the future, we can potentially use the `fetch-file` ranges to further validate `video:watched` events. No additional daily watch time limit per user session is imposed, the above rules already imply a limit of 2.5 * 24h. Users trying to artificially inflate watchtime figures is likely not a big threat because watchtime, unlike view counts, will unlikely to be shown publicly ever. ### Data completeness As we are introducing this feature to a software that is already running in production, we have to deal with the fact that existing videos will have incomplete data. Somewhat obviously, before this proposal is implemented, released and updated to, no statistical data is collected at all. Further, data collection relies on other software (LMS, Tobira, octoka, Paella) to implement some support, which likely happens at different times. Therefore, there will likely be periods of partial data collection. Finally, in the future, we might update the system to collect more data or have smarter aggregation algorithms. Applications that present statistical data to users (in particular "all time totals") should show a hint/note if the data for that video is likely incomplete. Otherwise, users will be confused or even displeased (e.g. if their videos show a small view count just due to being uploaded some time ago). Of course, applications probably don't want to explain the full timeline of data collection to the user, but, for example, just give one date before which data was not collected. The statistics system in Opencast should be versioned (internally). This can be a simple number. The initial version (i.e. this proposal) is version 1. The version should be bumped whenever a change is made that could influence statistics seen by users to a notable degree. Whenever Opencast starts up, the statistics module writes an entry with its version and the current timestamp into the DB; that is, unless the latest (by timestamp) entry in the DB already matches the current version. That way, admins get a version history to see when a certain statistic system was updated to. #### Threshold date Even with this versioning data, deciding on one date as a threshold for "complete statistical data" is tricky. Therefore, a configuration option needs to be added that allows administrators to chose a date manually. If the configuration is not set, Opencast will default to the date when version 1 first started up (which is stored in the database, see above). This fallback might be changed in future version of the statistics system. This "threshold date" is exposed in the API and should be shown by frontends to users. It should be understood however, that this is a simplification in most cases and that the timeline of data collection is slightly more complicated. Frontends should therefore use somewhat vague language, e.g. "Statistical data from before 2026-04-28 might be incomplete or completely missing". Frontends can also use heuristics to decide to sometimes not show "all time total" numbers at all for old events, where the numbers might be very misleading. ### API: raw event ingest #### Client push This is similar to most analytics APIs: the client (user browser) directly sends a request to that API, authentication is not required. ``` POST /api/stats/client-push ``` The request must contain a body containing a JSON object with the following fields: - `events`: array of objects with the following fields, which correspond directly to the fields of a raw event with the same name: - `timestamp`: [`datetime`](#common-types) with millisecond precision (i.e. 3 digits after `.`). Example: `"2026-04-27T14:56:38.415Z"` - `item_type`: string - `item_id`: string - `event_type`: string - `event_payload`: optional, `null` or object (Note: we use an outer object instead of the array to future-proof this API.) Upon receiving a request, Opencast performs these steps: - For each item/event in the request body: - Verify: - `item_type` and `event_type` are known - `item_id` is non-empty - `event_payload` fits `event_type`, i.e. required fields are present and have correct type. Unknown field names are allowed, but ignored and must not be stored in the DB. - `event_type` is allowed in client push (currently only `"fetch-file"` is disallowed) - `timestamp` is a valid datetime as specified above and: - is not more than `ALLOWED_CLOCK_SKEW` in the future, - nor more than `MAX_CLIENT_PUSH_DELAY` in the past. - Only cheap checks should be done, certainly nothing that requires DB or file system queries (e.g. checking `item_id` exists, or that the seek target is <= the video duration) - If the item is valid: mark it for insertion into the DB - If the item is malformed: store its index and an error message for reporting - Calculate session hash from the actual IP and user agent string of the request - *Note*: due to the event timestamps being allowed to diverge from `now()` a bit, it is possible that events end up in the database, with their session hash calculated with a secret that does not match the event's timestamp. This only happens around the key rotation time each day, but we lose unique user tracking there anyway, so that's not a problem. - For all items/events marked for DB insertion: - Write new raw event to the DB, using the above session hash and the values of the request payload for the remaining fields. - Return 200 with a JSON response object with the following fields: - `accepted`: number of events stored in the database - `rejected:`: array with one item per invalid event in the request. Each item is an object with these fields: - `index`: index of the invalid item in the request array (0 indexed) - `error`: string with short, developer focussed error message **Example** Request (JS): ```js await fetch("https://opencast.tld/api/stats/client-push", { method: "POST", headers: { "Content-Type": "application/json", // Note: the browser will automatically add `User-Agent` header }, body: JSON.stringify({ events: [ { timestamp: "2026-04-27T14:56:38.415Z", item_type: "video", item_id: "2ea94d36-e5aa-4069-af43-75515772d2c2", event_type: "video:play", // event_payload might be omitted or set to `null` here }, { timestamp: "2026-04-27T14:56:41.123Z", item_type: "video", item_id: "2ea94d36-e5aa-4069-af43-75515772d2c2", event_type: "banana", // unknown type }, { timestamp: "2026-04-27T14:56:57.987Z", item_type: "video", item_id: "2ea94d36-e5aa-4069-af43-75515772d2c2", event_type: "video:seek", event_payload: { to: 42856, flux: "compensated", // unknown field }, }, ], }) }); ``` This would result in two events being written into `raw_events`, all with the same session hash: - `video:play` with `null` payload - `video:seek` with `{ "to": 42856 }` as payload (the unknown field is ignored) Response: ```json { "accepted": 2, "rejected": [ { "index": 1, "error": "unknown event_type 'banana'" } ] } ``` #### Trusted push This is intended for servers/nodes that Opencast trust, like `octoka` (which delivers files). Authentication via a shared secret is required. (Authentication details are not further specified here.) The trusted node can push many events at once, from multiple users. ``` POST /api/stats/trusted-push ``` This endpoint is very similar to the "Client Push" endpoint. Apart from authentication and API path, there are the following differences: - Each object in the request array must contain two additional fields: - `addr`: IP address (as string) - `ua`: user agent string - During input verification: - All `event_types` are allowed (including `fetch-file`) - The `timestamp` may be arbitrarily far in the past (i.e. `MAX_CLIENT_PUSH_DELAY` is not used) - Session hash: - The session hash is calculated for each event individually, using the provided `addr` and `ua` fields, instead of the request metadata (as that request is sent by another node). - The rotating daily secret must be acquired atomically once, such that all events are using the same secret. **Example** Request (JS just for demonstration purposes, likely not sent by JS): ```js await fetch("https://opencast.tld/api/stats/trusted-push", { method: "POST", headers: { "Content-Type": "application/json", }, body: JSON.stringify({ events: [ { timestamp: "2026-04-27T15:29:31.456Z", addr: "105.59.238.2", ua: "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:149.0) Gecko/20100101 Firefox/149.0", item_type: "video", item_id: "307c7327-d7e4-47b5-b01d-6779e2422f9f", event_type: "file-fetch", event_payload: { elem: "1f6cca9c-0d4c-4bad-ab82-53b6b0514508", from: 0, to: 262144, }, }, ], }) }); ``` ### API: read aggregated data These are APIs to read aggregated statistical data. In the future, this data will be integrated into the new API. ``` GET /api/stats/video/<id>?... ``` Query arguments: - `granularity` (optional): `total` (default) or `hour` Responses: - 200: video exists, JSON body (see below) - 403: user is not authorized to read statistical data of this video (see below) - 404: video does not exist Response body (in case 200) is a JSON object, with the following fields: - `incompleteBefore` (nullable `datetime`): if the ["data completeness threshold date"](#threshold-date) is after the video creation date, this field contains the data completeness threshold date. Otherwise, this field is `null`. Additionally, there are fields depending on the `granularity` argument. If `granularity` is `total`: - `views` (uint): all-time view count - `watchtime` (uint): all time watchtime (in seconds) If `granularity` is *not* `total`: - `start`: [`datetime`](#common-types) of the start of the first bucket. - `views` (array of uints): view count in buckets of length `granularity` - `watchtime` (array of uints): watchtime (in seconds) in buckets of length `granularity` In the second case, `start` is roughly the video upload date or the start of statistical data, depending on whats later. The `end` is always `now()`, i.e. all available data after `start` is included. **Authorization**: For now, only users who have `write` access to the video are allowed to read statistical data. This will likely be refined in the future, by making if configurable and potentially adding a corresponding action to the ACL. *Note*: in the future, additional features can be added to this API. For example, day granularity can be added. Further, we might want to limit the number of buckets that can be returned. Though, even for 10 years of hour buckets, the response will be around just 1MB _before_ compression. It's likely that statistical data over time is only requested for one video at a time, so that shouldn't be a problem. #### Examples ``` GET /api/stats/video/2517bc0e-5fcd-4ca2-ba3a-f81f4e5b1604 ``` ```json { incompleteBefore: "2026-04-23T14:23:09Z", views: 392, watchtime: 70581, } ``` ``` GET /api/stats/video/5fcb2356-91d8-455c-bac3-08393e1f3cb5?granularity=hours ``` ```json { incompleteBefore: null, start: "2026-04-25T08:00:00Z", views: [3, 21, 25, ...], watchtime: [120, 9540, 14283, ...] } ``` ### Additional APIs Additional APIs should be added when deemed useful during implementation. Ideas: - API to re-aggregate data for a specific video (or all of them) for a specific time range (or all time). Only for admins. - API to get the "data completeness threshold date" - API to get the version of statistic system For some of these, it might be useful to *not* have them under `/api`, but a separate `/stats` for example, in order to not give API stability guarantees. These details are for the implementation to decide. <br> <br> ## Future extensions ### Rate limiting and additional abuse detection The system as specified above only protects against very simple cases of abuse (e.g. a user just refreshing a page a hundred times). While we can never protect against everything, there are some additional easy-to-implement mitigation that Opencast should add. These techniques are used for not only client pushes, but also trusted pushes, to not requiring trusted pushers to implement these same protections themselves. For one, there should be some kind of rate limiting per IP-address. While a user can often somewhat change their IP address (restart home router, change MAC address & reconnect to WiFi, ...), it is usually limited, takes some time and is not easily scriptable. Access to a big pool of IP addresses that can be easily switched to is out of the reach of most even technically inclined users. If the number of raw events from a single IP address over some time spans (e.g. 1s, 30s, and 5min) exceeds a configurable limit, that IP address is temporarily blocked (new raw events from it are ignored) and already stored raw events from that IP address in that time span will be deleted. Further, rate limiting per item (e.g. video) also makes sense. Typically, attackers would want to increase the view numbers of a small number of videos. Information about rate limiting is not exposed via API (i.e. there client push API will not reply "ignored, you are rate limited") in order to make it harder for bad actors to detect such cases. To perform rate limiting, additional data needs to be stored. This might use the DB or potentially just working memory of the Opencast node. While the latters makes the limit slightly less effective, effectively multiplying the actual limits by the number of nodes, it might be easier to implement and it makes no difference for the many institutions just having one statistics node. Further, it helps with making sure the data is deleted regularly: IP addresses may be temporarily stored (even in clear text) if there are valid reasons and spam/abuse detection is one of those reasons (see Fathom below). But of course, the addresses must only be stored as long as absolutely necessary and safely deleted afterwards. Additional detection of attack patterns might be employed beyond that, but of course a sufficiently motivated attacker can imitate natural access patterns and make detection almost impossible. Abuse detection at the sophistication level of YouTube will never be possible for Opencast, as one should expect. The only viable way (I can think of) to almost abuse-resistant statistics is by not counting anonymous views, but requiring a login that's presumably well-controlled by the university. This is likely not in the interested of most Opencast users. ### Timeline heatmap Aggregate data that indicates what parts of the video are watched most. This will be some kind of combination of `video:watched` and `video:seek` raw events. Potentially also `video:pause` events, or aggregate them into their own metric, to see where users commonly stopped. The data can be shown for the video owner, which is useful for didactical reasons, among others. We might also show it in the player for every user, which potentially helps them find relevant parts of a video. But if we do that, we have to deal with the feedback-loop-problem: users jump to a time because many users jumped to it already, amplifying the data. ### Basic user data It seems legally permissible to store some coarse data about users with each raw event, like: - _Device type_: desktop, mobile, other - _Operating system_: Windows, Linux, Android, iOS, ... (just coarse) - _Browser_ (without version) - _Location_: very coarse location, likely just country, at most include very big cities This would allow us to have more demographic information, potentially per video. For video owners, we would probably only show device type and location, if at all. Other than that, this data is mostly just useful aggregated for the whole of Opencast. Care must be taken to not accidentally store information that is too fine-grained and would allow for user identification. This data would only be stored for raw event of the current day and aggregated & discarded on `daily_secret` rotation. The first three data points can be derived from the UA. The Location can be derived from the IP address using open source IP geolocation databases like [IP66](https://ip66.dev). ### Referrers Some raw events, like `page-visits` can get a referrer field added, storing where the user came from. Again, we must make sure the stored value does not contain any user identifyable information. Common practice is to just store the domain, which would already give us plenty of insight. We might consider having some special values, for example for Tobira to distinguish users coming from its search, from some public page or from their own bookmarks. ## Alternatives ### Salted hash instead of HMAC Companies like Plausible use `hash(secret || UII)`, this specification suggests `HMAC(secret, UUI)`. These two things are essentially equivalent from a security perspective. The reason we use HMAC is because that's essentially the proper tool for the job: a keyed hash function. Concattenating a secret to the input of a hash can have some nuances that could lead to problems, like the "length extension attack". In this scenario, I believe all these nuances don't matter and cannot be abused by an attacker (mostly due to the fact that an attack does not have access to the produced hashes, except if the DB leaks, which is usually quite time delayed). Using HMAC still seemed like that better choice, just to use something with better provable security guarantees. There are potentially two reasons to use `hash(secret || UII)` instead. As HMAC is essentially just a hash function applied twice (and some minor bit fiddling), calculating it takes twice as long. Further, when judging this proposal from a legal standpoints, lawyers might be more comfortable with something that exactly matches what's used by others, even though they are equivalent. ### Do not include the item ID in the session hash By including the item ID (e.g. video ID) in the session hash input, alongside the UII, we make session hashes between different items incomparible. While this retains the ability to know which events (e.g. "press play") for a specific video come from the same user, for two different videos, we cannot tell if two events were produced by the same user or not. This does not appreciably hamper brute force attacks, so its only real purpose is to prevent cross-item correlations. We could omit the item ID from the input, which would give us mainly two new abilities: - See what videos are commonly watched together (on the same day) - Know how many unique users Opencast had on a day I believe it's legally not a problem to omit the item ID (compare Plausible & Co), but including it was chosen to go "above and beyond" in respecting user privacy. ### Where to store the daily secret? Above it is specified that the daily secret is fully random and must not be stored in the DB due to the risks of DB backups. This complicates things a bit, unfortunately. If we indeed want to keep everything secure even with DB backups, we have a few choices: - Store in memory: safest, but if OC crashes, the secret is lost, meaning that we treat all events after the restart as new users. Not a catastrophy, just skews data a bit. - Store in NFS or local file system: survives crashes, NFS also allows for easy sharing with other nodes, but might also be susceptible to leakage due to backups? Generally, we could either assume that only a single OC node needs the secret. This limits HA capability of OC to some degree. If we want to support multi nodes, choosing and sharing a secret among nodes is tricky and likely needs some kind of consensus algorithm. Alternatively, we could instruct admins to exclude certain tables from backups and then use the DB to store the secret. If we then also employ the "hash to ID" rewrite idea (see above) and exclude the current day events from the backup, we can even use a predictable secret, like `KDF(master_key, date)`. ### Session tracking via unique ID generated in frontend Instead of using IP+UA, one might think of generating an ID in the frontend and attaching this to all requests. This is basically not possible without user consent, as it runs into the same legal clauses as tracking cookies, even without using cookies. As soon as you store some kind of tracking identifier (used for functions not explicitly requested by the user) in a cookie, local storage or any equivalent places, consent is required. The only way to make this potentially work legally is to make the session ID completely ephimeral, i.e. it is discarded on page reload and probably navigation. That might be legally fine, but is then likely less powerful than IP+UA approach. There are also some technical disadvantages (like attaching the id to media URLs destroys browser caching), but these are not worth discussing given the above. ### Have separate "raw event archive" Instead of keeping raw events inside the main table forever, one could move them into a separate table/storage. Doing that, we could save a bit of storage space, especially once we have basic user data (location, device type, ...) stored per event, which we is deleted for old raw events. Further, it makes it a touch easier to purge old events (`clear table;` instead of timestamp filter). Finally, DB backups might want to skip the pseudonymized events, while including the fully anonymized events, which is also easier if both are separate tables. ### Additional anonymization techniques On `daily_secret` rotation, we could also replace all hashes by artificial IDs. For example, initialize a counter to 0 and an empty hashmap mapping from session hash to ID. Then iterate through all raw events, check if the hashmap contains the session hash already. If yes, use that ID. If no, insert the session hash with the current counter into the hashmap and increment counter. The maintains the same unique user property as before but completely gets rid of the hashes. Deleting the old daily secret already properly and fully anonymizes the data as the hashes are impossible to brute force then. This "hash to ID" rewrite would just be an additional layer of protection, in case the daily secret is accidentally persisted or leaked, or a vulnerability is found in the hash function. We could also coarsen the timestamps to hour resolution, for example. There might be some tiny risk remaining to re-identify users by the exact timestamps. Again, I don't think this is necessary (technically nor legally) and it would somewhat hinder our ability to use the archive data in the future. But it would be an additional layer of protection. ## Discussion ### Hashes as anonymization or pseudonymization? At the core of this proposal is the idea of detecting unique users by using the hash of the IP-address and the user agent string. Both of these data points are "user identifyable information" (UII) and there are strict rules on the storage of those. Ideally, it isn't stored at all, but in some circumstances it is allowed to store them without consent: - Importantly, it needs a good reason. Marketing, tracking, ads are obviously not valid. But even statistics like ours is mostly not enough, as it's not a feature explicitly requested by the user. Security or abuse protection is usually valid. - Storing the data is actually necessary (i.e. there is no alternative approach to solve the problem). - They can only be as stored as long as it's absolutely necessary. By hashing the UII, do we side step this issue? That's tricky. A hash function, like SHA256 is mathematically a one way function, meaning you can only hash some data, but never go from the hash output to the original data. We encounter problems when this theoretical property meets the real world, most importantly due to low input entropy and brute forcing. Brute forcing is the process of just trying a huge number of inputs to see if hashing them results in the hash value in question. For SHA256, even consumer GPUs can calculate tens of billions of hashes per second. There are hash functions with tunable computational-complexity, which can make brute-forcing extremely slow, but this is not a viable solution as Opencast servers would have to calculate this expensive hash function very frequently. The brute force viability hinges on not only the hashing speed, but also the input entropy, or in other words: how many different inputs do we reasonably expect? Assuming IPv4 addresses, that gives us 4 billion, though reducing it to residential IP blocks likely excludes a large chunk. User agent strings could be anything, but with a few hundreds of them, one would likely cover the vast majority of users. That puts us roughly in the ~trillion (10^12) range, but an invested attacker with some additional information can likely reduce this number quite a bit. Either way, that is definitely in the range making a brute-force attack viable, putting the user at risk of "re-identification". Companies like Plausible use a _salt_ to make the hashes more brute-force resistant. This salt is a random string that is added to the input before hashing, therefore drastically increasing entropy. It is rotated (changed) daily, meaning that after such a rotation, all interactions are again counted as new users. Crucially, after the rotation the old salt is properly deleted, rendering the hashes un-brute-forcible. (Technically, the term "pepper" would be more appropriate, as "salt" usually describes a string that might as well be public, while "pepper" describes a secret added to the input before hashing.) Plausible itself calls the hashes of the current day "pseudonymized" and all older hashes "anonymized". There are a few documents and reports by the EU and state agencies about hashing and how well it works for anonymization (see the references section). Most of these reports focus on anonymizing study results, for excample, so it does not map to our use case exactly. The reports mention all the dangers that can break the "one way function" assumption, they also talk about the salt technique, and for the most part call these techniques merely "pseudonymization", although they don't outright refuse that properly implemented, using hashes can consitute a proper anonymization. ### IP + UA for unique user detection? One obvious question arises from above specification: is the combination of IP address and user agent string actually a good proxy for a unique user? ("Proxy" as "stand-in", not network proxy) First things first: the user agent string (UA) can be trivially changed by the user, as they can send anything. Therefore, an attacker could absolutely trick the system. Protecting against this kind of abuse is happening via different means, this system is for well-behaved users. #### IP address stability & sharing The more interesting discussion is about IP addresses and how stably they map to one user on one device. Broadly, we can put users into three buckets depending on the type of network they use: at home, mobile network, and university network. The university of Bern provided me with the statistical figure that only 2% of video views come from their university network. While this number might be higher for other universities, it is somewhat reasonable to assume that lecture content is mostly watched at home, not while you're at the university anyway. Further, it is probably fair to assume that the share of videos watched from a mobile network is also not too large, given that most users still have monthly data limits in their contracts and videos are pretty large. For home networks, either the home router gets assigned a single IPv4 address and uses NAT for devices on the home network; or the ISP itself also runs NAT and a single home does not even get its own address (carrier-grade NAT, CGNAT). In the first case, sometimes IPs stay assigned for many days or even weeks, while in other cases IPs are reassigned regularly, like daily. For carrier-grade NATs, I have no idea how stable IP addresses are. Suffice to say: in all cases, multiple to many devices share an IP address. The idea is that for a typical home network, the UA is able to distinguish sufficiently between all devices. Mobile networks are a lot more dynamic: CGNAT is very much used, a large number of devices can share the same address and addresses are reassigned more often, depending on network load and when the user moves from one cell to another. University networks vary a lot between different universities. I talked to network people of a few universities to get a rough idea, but this is obviously not universally valid. Sometimes, devices in a university net get their own public IP (especially for wired connections), which are sometimes stable for very long times. Other devices, especially those on WiFi, have more dynamic internal IPs. Moving from one building to another can often result in a new address. Those devices sometimes/usually use NAT for internet access, i.e. addresses are shared. Lease duration and address pool size varies between universities (e.g. /21 and /23 nets). Notably, many universities have quite large IP pools (up to /16), but often only a fraction of that is used for assigning dynamic addresses to internet users. Finally, VPN users might get a different treatment yet again, with sometimes more dynamic address rotation. If Opencast is hosted at the university itself, devices inside the network might not even use their public IP to connect to it. In that case, Opencast would see an internal IP. But there are also cloud hosted Opencasts and some universities have internal NATs, so this behavior cannot be assumed. And let's not forget about IPv6! Support varies: some univerities and home networks offer full dual stack, some don't support IPv6. Some home networks even use "DS-Lite", where IPv4 is tunneled via a IPv6 network. Whether the user connects to the server via IPv4 or IPv6 is based on things that definitely escape my limited network knowledge. While the IPv6 pool is large enough for every single device to have its own permanent address, due to privacy reasons, home routers or the clients itself can randomly rotate addresses. #### Discussion All of this sounds pretty chaotic and unreliable in general. Which agrees with the common knowledge that one shouldn't conflate the IP address with a single device or user. On the other hand, companies like Plausible claim that the IP+UA combination is actually surprisingly accurate for detecting unique users. And compared to consent-based methods, which lose users due to lack of consent, it is supposedly even more accurate. Of course, it is in the financial interest of those companies to make those claims, so they cannot be blindly trusted. Finding fully trustworthy and reliable data on this is very tricky. Looking at the network descriptions above again, it seems fair to assume that in most cases, the IP for a non-moving user stays fixed for at least around an hour. And further, that for the expected usage pattern of Opencast, that the number of users sharing one IP address is decently limited, which in turn means the user agent has a good chance at distinuishing between them. It helps to look at what we use the unique user detection for and what breaks, if the IP+UA assumption breaks. - Very short (~seconds): "click play" to "download first video chunk" - Single video (~1 hour): downloading more chunks, user interacting in one video (e.g. seeking, pausing), user reloading the page - Whole day: user rewatching the same video later, or continue to watch later For interactions separated by only seconds, it is highly likely the IP+UA stays fixed. For everything during a single video, I'd estimate there is a decent chance IP+UA stays fixed and we correctly correlate all of these to a single unique user. The last use case is likely one where we falsly classify the user as two unique users, as they will likely suspend their device in between, or change networks. And of course, the estimation can also wrongly identify two users as the same, but my guess would be that this isn't too common either. What can be said with fairly high confidence is that: - IP+UA is nowhere near a perfect identifier for a single user, and - it's definitely better than nothing. ### Why keep old raw events around? - Review unusual activities, e.g. someone tried to influence the view count of a video via scripts. One can also fix incorrect aggregates by subtracting unusual events. - Inform future improvements and extensions to Opencast's statistic system. The algorithm to count views might still need improvements; we might want to know access/usage patterns to better estimate timeline heatmaps or watch time; and there are things not even on the table now, designing which benefits from having real world data. - Re-aggregating data with better algorithms in the future. ## References and prior art ### Analytics Software - No consent banner required - [Plausible](https://plausible.io/) - [Fathom analytics](https://usefathom.com/) - [Ackee](https://docs.ackee.electerious.com/#/) - [Matomo](https://matomo.org/) ### Consent-free unique user detection Documentation by companies running this in production: - Plausible - https://plausible.io/data-policy#how-we-count-unique-users-without-cookies - https://plausible.io/blog/legal-assessment-gdpr-eprivacy - Fathom - https://usefathom.com/data - https://usefathom.com/blog/anonymization - https://usefathom.com/legal/compliance - Ackee - https://docs.ackee.electerious.com/#/docs/Anonymization Related documents: - [Introduction to the hash function as a personal data pseudonymisation technique (Nov 2019)](https://www.edps.europa.eu/data-protection/our-work/publications/papers/introduction-hash-function-personal-data_en) - [Opinion 05/2014 on Anonymisation Techniques](https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf) - [Practice guide to anonymising personal data](https://stiftungdatenschutz.org/fileadmin/Redaktion/Dokumente/Anonymisierung_personenbezogener_Daten/SDS_Practice_Guide_to_Anonymising-Web-EN.pdf) - [Anonymization of personal data](https://bdi.eu/de/publications/publications/anonymization-of-personal-data) ### YouTube YouTube publicly shows the view count for each video. Video owners can see a lot more: - View count, watch time, unique viewers as a graph over time - Audience retention graphs (how many viewers still watch at this timestamp) - Sources, how viewser found the video (from search, recommendations, external, ...) - Audience statistics (country, gender, age) The view aggregated total view count is calculated as follows: - Requires an intentional start (click) - Must be a real human (nevermind how they estimate that) - 30s minimum watch time - Repeats are counted up to 4-5 times per day Sources: - https://www.subsub.io/blog/how-does-youtube-count-views - https://www.tubics.com/blog/what-counts-as-a-view-on-youtube

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.