Metrics Naming Conventions

# Metrics Naming Conventions This article is a consolidation of the best pratices for metrics naming conventions based on the references [linked here](#Refs). ## Naming constraints Metrics names should: * Be in lowercase * Separated by `_` (underscore) * Have only letters and numbers * Start with a letter ## Permanent: Renaming metrics is painful and dangerous You will not do it, or you will suffer the consequences. Once the app starts emitting metrics (even in QA), its name effectively becomes permanent. Think early, think twice! ## Should have a (single-word) application prefix Relevant to the domain the metric belongs to. The prefix is sometimes referred to as namespace by client libraries. For metrics specific to an application, the prefix is usually the application name itself. Sometimes, however, metrics are more generic, like standardized metrics exported by client libraries. Examples: * prometheus_notifications_total (specific to the Prometheus server) * process_cpu_seconds_total (exported by many client libraries) * http_request_duration_seconds (for all HTTP requests) ## Metric names and attributes exist within a single universe and a single hierarchy Metric names and attributes exist within a single universe and a single hierarchy. Metric names and attributes MUST be considered within the universe of all existing metric names. When defining new metric names and attributes, consider the prior art of existing standard metrics and metrics from frameworks/libraries. ## Associated metrics SHOULD be nested together in a hierarchy based on their usage Associated metrics SHOULD be nested together in a hierarchy based on their usage. Define a top-level hierarchy for common metric categories: for OS metrics, like CPU and network; for app runtimes, like GC internals. Libraries and frameworks should nest their metrics into a hierarchy as well. This aids in discovery and adhoc comparison. This allows a user to find similar metrics given a certain metric. ## Metric names SHOULD NOT be pluralized * system.filesystem.utilization, http.server.duration, and system.cpu.time should not be pluralized, even if many data points are recorded. * system.paging.faults, system.disk.operations, and system.network.packets should be pluralized, even if only a single data point is recorded. ## Should have a suffix describing the unit > Prometheus guidelines overcomes open telemetry here, because open telemetry says the opposite but has exceptions for some cases, Prometheus definition is more consistent. In plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable. * http_request_duration_seconds * node_memory_usage_bytes * http_requests_total (for a unit-less accumulating count) * process_cpu_seconds_total (for an accumulating count with unit) * foobar_build_info (for a pseudo-metric that provides metadata about the running binary) * data_pipeline_last_record_processed_timestamp_seconds (for a timestamp that tracks the time of the latest record processed in a data processing pipeline) ## Use count Instead of Pluralization Ex: system.processes.count ## Should use base units e.g. seconds, bytes, meters - not milliseconds, megabytes, kilometers. See below for a list of base units. ![](https://hackmd.io/_uploads/HJDuu2q-p.png =512x) ## Refs * https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/ * https://prometheus.io/docs/practices/naming/#metric-names * https://github.com/micrometer-metrics/micrometer-docs/blob/8a42694a240fd56e69ff91a7cd3e4a07bc4ad3df/src/docs/concepts/naming.adoc * https://gist.github.com/fralalonde/de6d5caf849a27bbe6a02bbc5da69269