quantus telemetry

# quantus telemetry ## external miner telemetry Architecture and implementation plan for reporting miner telemetry to a Polkadot SDK–compatible telemetry server. Goals - Report miner health and mining activity to a telemetry server that accepts Substrate/Polkadot SDK telemetry. - Associate miner telemetry with the node the miner is working for (multi-node capable). - Keep the transport and envelopes compatible with upstream telemetry expectations so messages are accepted and parsable by existing servers. Non-goals - Replace Prometheus metrics already exposed by the miner. - Tight coupling to any single node process or embedding the miner inside a node. Constraints and context - The miner runs as an external HTTP service. - The miner may not know which node it is serving unless the node tells it so. - The upstream telemetry protocol (Polkadot SDK `sc-telemetry`) expects: - A connection that sends line-delimited JSON messages over (secure) WebSocket. - Every message to be wrapped in a standard envelope: id (UUID-like string), ts (ms since epoch), payload (JSON object), and optional verbosity. - A “system.connected” handshake with process metadata on connect, followed by periodic interval snapshots and event messages. Payloads must contain at least “msg”: "<event-name>". ### High-level design #### 1) Association model (how the miner knows the node) - Primary: node-provided context (recommended) - Extend the external miner API to accept optional “telemetry_context” in /mine requests: - telemetry_context: { node_telemetry_id: string, // The node’s own telemetry UUID/id (if available) node_peer_id: string, // libp2p PeerId node_name: string, // e.g., "quantus-node-01" node_version: string, // e.g., "1.2.3-abcdef" chain: string, // chain name genesis_hash: string // 0x-hex or hex w/o 0x } - The miner caches this context per job_id and uses it to link events with the node. - Secondary: session registration (alternate option) - Add a POST /identify endpoint for nodes to register themselves first, returning a miner_session_id to be referenced on subsequent /mine calls. - Fallback: operator-provided CLI/env - Allow specifying static association values via CLI if the node cannot provide them: - --telemetry-node-id, --telemetry-node-peer-id, --telemetry-chain, --telemetry-genesis, --telemetry-node-name, --telemetry-node-version - These act as defaults when a job does not carry an explicit context. #### 2) Transport (how messages are sent) - Implement a small async telemetry client in the miner that speaks the upstream telemetry wire format: - Connect to one or more endpoints (e.g. wss://telemetry.example.org/submit/ or the Substrate multiaddr form). - On connect, send a system.connected payload with miner and association info. - Periodically send system.interval, plus miner-specific events. - Use a non-blocking channel to buffer outbound events; drop oldest on backpressure. - Auto-reconnect with jittered exponential backoff; resume interval reporting after reconnect. - Envelope format (compatible with sc-telemetry expectations): ```json { "id": "<uuid>", // stable per miner process until restart "ts": 1716400000000, // timestamp in ms "verbosity": 0, // optional u8 "payload": { ... } // event-specific data (see below) } ``` - Handshake payload (example subset): ```json { "msg": "system.connected", "name": "quantus-miner", "chain": "<chain>", "genesis_hash": "<hex>", "implementation": "quantus-miner", "version": "<semver>", "authority": false, "platform": { "os": "<linux>", "arch": "<x86_64>" }, // Association hints (non-standard keys but safe to include) "linked_node": { "telemetry_id": "<node_telemetry_id>", "peer_id": "<node_peer_id>", "name": "<node_name>", "version": "<node_version>" } } ``` - Interval payload (example subset): ```json { "msg": "system.interval", "uptime_ms": 123456, "engine": "<cpu-fast|cpu-baseline|cpu-montgomery|...>", "workers": 8, "hash_rate": 1234567.8, "active_jobs": 2, // Association: derive most-recent or multi-node summary if many jobs present "linked_node_hint": "<node_telemetry_id>" } ``` #### 3) Event catalog (message payloads) All events use the same envelope. The “payload.msg” identifies the event. Unknown messages are accepted by the server but may not be visualized; they remain useful to downstream consumers. - system.connected - Sent once per connection; see handshake payload above. - system.interval - Sent every N seconds (e.g., 15s). Aggregated miner stats for quick health checks: - uptime_ms, engine, workers, hash_rate, active_jobs, cpu/mem load (optional), gpu flags (optional). - Optional “linked_node_hint” with the node_telemetry_id of the most recent or majority job source. - mining.job_started - Sent when /mine is accepted. - ```json { "msg": "mining.job_started", "job_id": "<uuid>", "mining_hash": "<hex64>", "distance_threshold": "<decimal string>", "nonce_start": "<hex128>", "nonce_end": "<hex128>", "engine": "<engine-name>", "workers": <u32>, "linked_node": { ... } // exact telemetry_context from request if present } ``` - mining.progress - Sent on periodic progress updates (coarse-grained; do not spam): - ```json { "msg": "mining.progress", "job_id": "<uuid>", "hash_count": <u64>, // cumulative job count so far "job_hash_rate": <f64>, // per-job EMA we already maintain "elapsed_ms": <u64>, // since job start "threads": [ // optional sampling of a few threads’ EMA rates and last update {"id": 0, "rate": <f64>, "last_ms_ago": <u64>}, {"id": 1, "rate": <f64>, "last_ms_ago": <u64>} ], "engine": "<engine-name>", "linked_node": { ... } } ``` - mining.found_candidate - Sent when a candidate is found; the miner will cancel other threads: - ```json { "msg": "mining.found_candidate", "job_id": "<uuid>", "nonce": "<hex U512>", "work": "<hex 64 bytes>", "distance": "<decimal string U512>", "origin": "<cpu|gpu-g1|gpu-g2|unknown>", "engine": "<engine-name>", "linked_node": { ... } } ``` - mining.completed | mining.failed | mining.cancelled - Sent on terminal state transition: - ```json { "msg": "mining.completed", // or mining.failed / mining.cancelled "job_id": "<uuid>", "hash_count": <u64>, "elapsed_ms": <u64>, "engine": "<engine-name>", "linked_node": { ... } } ``` - mining.result_served - Sent when the node successfully fetched the result via GET /result/{job_id}: - ```json { "msg": "mining.result_served", "job_id": "<uuid>", "engine": "<engine-name>", "linked_node": { ... } } ``` ### Association strategy and rationale - The miner should use its own telemetry id in the envelope id field to avoid collisions with the node’s connection. - To associate miner events with the node, include a “linked_node” object inside payloads whenever possible. This preserves compatibility and avoids breaking upstream server assumptions about per-id session identity. - If an operator absolutely needs the miner to appear “as the node” in telemetry (discouraged), provide an experimental flag that forces the miner to reuse the node’s telemetry id. This is risky and may conflict with the node’s own connection; keep it off by default and clearly documented as unsupported. ### CLI, configuration, and API changes #### CLI (miner-cli) - New flags (env var equivalents in parentheses): - --telemetry-endpoint <url> [repeatable] (MINER_TELEMETRY_ENDPOINTS) - --telemetry-verbosity <u8> (MINER_TELEMETRY_VERBOSITY) default: 0 - --telemetry-enabled (MINER_TELEMETRY_ENABLED) default: true if endpoints set - Optional association defaults (used only when not provided per job): - --telemetry-node-id <string> - --telemetry-node-peer-id <string> - --telemetry-chain <string> - --telemetry-genesis <hex> - --telemetry-node-name <string> - --telemetry-node-version <string> #### Service configuration (miner-service) - Extend ServiceConfig to carry telemetry options and optional default association hints. - Instantiate a Telemetry client when endpoints are provided and enabled. - On service start: - Generate a stable miner telemetry id (UUID v4) per process lifetime. - Emit system.connected once per successful connection to each endpoint. - Start interval task to emit system.interval every N seconds (e.g., 15s). - When jobs are added/progress/complete, emit corresponding mining.* messages. #### API (resonance_miner_api) - Backward-compatible extension of MiningRequest (new optional field): - telemetry_context?: { node_telemetry_id?: string, node_peer_id?: string, node_name?: string, node_version?: string, chain?: string, genesis_hash?: string } - No changes required for MiningResult. - If the API crate is shared with the node, bump its version and ensure the node client populates the context when available. #### Wire protocol notes (compatibility) - Endpoints: typical Substrate telemetry uses secure WebSocket submit path (e.g. wss://telemetry.<domain>/submit/), or a multiaddr form commonly found in chain specs: /dns/telemetry.example.org/tcp/443/x-parity-wss/%2Fsubmit%2F - Envelope: - id: identify this miner session (unique per process lifetime). - ts: milliseconds since epoch (number). - verbosity: optional u8; 0 is the least verbose. - payload: object containing at least "msg": "<event-name>" and event-specific keys. - The server tolerates unknown payload keys; avoid removing standard keys (msg, name, chain, genesis_hash, version, authority) from system.connected. ### Reliability and performance - Non-blocking bounded channel to decouple hot mining paths from I/O. - On channel full: drop oldest messages and increment an internal drop counter (optional to report in next interval). - Reconnect with backoff and jitter to prevent thundering herd. - Do not block mining or HTTP handlers on telemetry I/O; all I/O is async task–driven. - Keep progress reporting sparse: use chunk-based mining updates already in place to throttle mining.progress emission (e.g., at most once per few seconds per job). ### Security and privacy - Do not send secrets or local addresses by default. - The linked_node fields are supplied by the node or operator and are treated as public operational metadata. - Allow disabling telemetry completely; default to disabled unless endpoints are configured. ### Implementation plan (milestones) #### M1 — Foundations - Create a new crate: crates/miner-telemetry - Public API: - TelemetryConfig { enabled: bool, endpoints: Vec<String>, verbosity: u8 } - TelemetryHandle (cloneable) - TelemetryNodeLink { node_telemetry_id, node_peer_id, node_name, node_version, chain, genesis_hash } - start(config) -> TelemetryHandle - emit_system_connected(...) - emit_system_interval(...) - emit(event_name: &str, payload: serde_json::Value) - Internals: - Async WS client(s) with line-delimited JSON writer. - Backpressure channel + task for encoding/sending. - Reconnect/backoff loop. - Add feature flag “telemetry” in workspace defaults; crate depends on tokio, serde_json, tungstenite/async-tungstenite (or integrate sc-telemetry crate if preferred). #### M2 — Miner integration - Extend crates/miner-service: - Extend ServiceConfig + CLI plumbing (miner-cli) to pass TelemetryConfig and defaults for association hints. - On service startup: - Start Telemetry and send system.connected using default association hints if present. - Spawn interval task that samples aggregated stats and calls emit_system_interval. - At job lifecycle points: - handle_mine_request: accept, capture telemetry_context if present, emit mining.job_started. - MiningJob.update_from_results: periodically emit mining.progress (throttle). - On best candidate: emit mining.found_candidate. - On status change to Completed/Failed/Cancelled: emit terminal event. - handle_result_request after success: emit mining.result_served. #### M3 — API extension - Extend resonance_miner_api::MiningRequest with optional telemetry_context (as above) and bump minor version. - Keep validation permissive: if telemetry_context is present, parse and store; if absent, use CLI defaults if configured. #### M4 — Hardening and docs - Add configuration docs to README and this telemetry.md. - Add E2E smoke test against a public telemetry instance or a local mock WebSocket server that captures and asserts envelopes. - Add logging around connect/reconnect, channel overflow, and event emission. ### Testing strategy - Unit-test envelope construction (id, ts, payload). - Integration-test the WS client with a local echo/mock server. - E2E manual test: - Start miner with --telemetry-endpoint pointing at a dev/substrate telemetry server, with verbosity=0. - Submit a job with telemetry_context populated. - Verify system.connected lines arrive; observe mining.* events in server logs or consumer. ### Operational guidance - Sample flags: - --telemetry-endpoint wss://telemetry.res.fm/submit/ - --telemetry-verbosity 0 - --telemetry-node-id <uuid> - --telemetry-chain Resonance - --telemetry-genesis 0x... - Using chain spec endpoints: - Accept multiaddr strings used in Polkadot SDK chain specs (e.g., /dns/telemetry.res.fm/tcp/443/x-parity-wss/%2Fsubmit%2F); normalize to a WebSocket URL internally. ### Risks, alternatives, and future work - Reusing the node’s telemetry id (Option B) can cause conflicts; avoid unless you fully control both ends and accept undefined server behavior. - If you need richer linkage, consider adding a miner-specific “subsystem” tag to payloads for consumer-side correlation. - If Substrate’s sc-telemetry crate becomes more modular, consider embedding it directly to reduce protocol drift. - Consider buffering to disk or using a local UDP/syslog sink if WS egress is unreliable in operator environments. ### Appendix: example messages - Envelope ```json { "id": "88d60a1d-38e0-4fdc-b2ed-8d2127cdd1b2", "ts": 1716400123456, "verbosity": 0, "payload": { /* event payloads below */ } } ``` - system.connected ```json { "msg": "system.connected", "name": "quantus-miner", "implementation": "quantus-miner", "version": "0.3.0", "chain": "Resonance", "genesis_hash": "0xabc123...", "authority": false, "platform": {"os": "linux", "arch": "x86_64"}, "linked_node": { "telemetry_id": "e4a2b4fd-6a74-4e0a-b52a-9f0c6d2e68aa", "peer_id": "12D3KooW...", "name": "resonance-node-01", "version": "1.7.2" } } ``` - mining.job_started ```json { "msg": "mining.job_started", "job_id": "123e4567-e89b-12d3-a456-426614174000", "mining_hash": "a0a0...a0", "distance_threshold": "1234567890000000000000", "nonce_start": "0000...000", "nonce_end": "ffff...fff", "engine": "cpu-fast", "workers": 8, "linked_node": { "telemetry_id": "e4a2b4fd-..." } } ``` - mining.progress ```json { "msg": "mining.progress", "job_id": "123e4567-e89b-12d3-a456-426614174000", "hash_count": 987654321, "job_hash_rate": 1234567.8, "elapsed_ms": 15000, "engine": "cpu-fast", "linked_node": { "telemetry_id": "e4a2b4fd-..." } } ``` - mining.found_candidate ```json { "msg": "mining.found_candidate", "job_id": "123e4567-e89b-12d3-a456-426614174000", "nonce": "01ab...", "work": "00ff...", "distance": "42", "origin": "cpu", "engine": "cpu-fast", "linked_node": { "telemetry_id": "e4a2b4fd-..." } } ``` - mining.completed ```json { "msg": "mining.completed", "job_id": "123e4567-e89b-12d3-a456-426614174000", "hash_count": 1000456789, "elapsed_ms": 16234, "engine": "cpu-fast", "linked_node": { "telemetry_id": "e4a2b4fd-..." } } ``` ### Summary This plan equips the external miner with a light-weight telemetry client that speaks a Polkadot SDK–compatible protocol, introduces a clean way to associate miner events with the node via optional request metadata, and keeps the miner independent and robust. It complements the miner’s existing Prometheus metrics with real-time, fleet-wide observability suitable for Substrate telemetry backends. ## node telemetry This section describes how telemetry reporting and metrics instrumentation are implemented in the Quantus node. It summarizes where telemetry is configured, what is reported, and the structures used for metrics. Scope: - Substrate telemetry reporting (sc-telemetry) - Prometheus metrics (built-in Substrate metrics and custom business metrics) File references: - Node service bootstrapping: `chain/node/src/service.rs` - Chain specifications: `chain/node/src/chain_spec.rs` - Custom Prometheus metrics: `chain/node/src/prometheus.rs` - Dependency declarations: `chain/Cargo.toml`, `chain/node/Cargo.toml` --- ### 1) Substrate Telemetry (sc-telemetry) Substrate’s telemetry framework is used to report node-level events and state to a telemetry backend. Where it’s configured: - `chain/node/src/chain_spec.rs`: telemetry endpoints are defined for the Live Testnet and Heisenberg environments via `with_telemetry_endpoints`. - `chain/node/src/service.rs`: a `TelemetryWorker` and `Telemetry` handle are created when the node is configured with telemetry endpoints. The worker is spawned and its handle is passed to `sc_service::spawn_tasks`, enabling built-in Substrate telemetry emission. Key code paths: - `chain/node/src/chain_spec.rs` - Live Testnet: `TelemetryEndpoints::new(vec![("/dns/telemetry.res.fm/tcp/443/x-parity-wss/%2Fsubmit%2F".to_string(), 0)])` - Heisenberg: `TelemetryEndpoints::new(vec![("/dns/telemetry.res.fm/tcp/443/x-parity-wss/%2Fsubmit%2F".to_string(), 0)])` - Development: no telemetry endpoints are configured. - `chain/node/src/service.rs` - Telemetry setup occurs conditionally: - Endpoints are read from `Configuration.telemetry_endpoints`. - `TelemetryWorker::new(16)` is created, and `worker.handle().new_telemetry(endpoints)` produces a `Telemetry` handle. - The worker is spawned as an async task named `"telemetry"`. - The handle is provided to `sc_service::spawn_tasks` to wire up Substrate’s built-in telemetry. When it is active: - Telemetry is enabled only when non-empty telemetry endpoints are configured (i.e., in the Live Testnet and Heisenberg chain specs). - If no endpoints are provided (e.g., Development), telemetry is not initialized. What is reported: - This codebase does not define custom telemetry events or payload shapes. - With the `Telemetry` handle attached, Substrate’s internal components emit their standard telemetry events (e.g., service lifecycle, block import and finalization updates, networking information, etc.). The exact event set is maintained by upstream Substrate (in the `sc-telemetry` crate and related services). Message structure: - Event serialization and transmission are handled internally by `sc-telemetry`. This repository does not implement or override the event schema. - The configured endpoint uses Parity’s telemetry service (WSS) and follows Substrate’s telemetry protocol. Relevant dependency flags: - `chain/node/Cargo.toml` sets `sc-telemetry.default-features = true` for the node. - The workspace also includes `sc-telemetry` with `default-features = false`, but the node crate enables defaults explicitly. --- ### 2) Prometheus Metrics Prometheus metrics are exposed via Substrate’s Prometheus integration when a Prometheus registry is configured in `sc_service::Configuration`. There are three sources of metrics in this codebase: A) Built-in Substrate metrics wired by the node - Transaction pool metrics: - `chain/node/src/service.rs` passes `config.prometheus_registry()` into `sc_transaction_pool::Builder::with_prometheus(...)`. - When a registry is present, Substrate’s transaction pool registers its standard metrics. - Network notification metrics: - `chain/node/src/service.rs` calls `N::register_notification_metrics(config.prometheus_registry())` during network initialization, enabling standard network metrics if a registry is present. - Other service and runtime metrics: - Substrate components (e.g., sync, RPC, etc.) register metrics with the provided registry through `sc_service::spawn_tasks` and related setup. B) Custom business metrics: `qpow_metrics` - Defined in `chain/node/src/prometheus.rs` via the `ResonanceBusinessMetrics` helper. - Metric type: GaugeVec - Name: `qpow_metrics` - Help: `QPOW Metrics` - Labels: `["data_group"]` (single label key) - This helper registers the GaugeVec with the provided `prometheus::Registry` (if present) and spawns a monitoring task that updates metrics when new blocks are imported. Update flow: - The monitoring task (`"monitoring_qpow"`) subscribes to `client.import_notification_stream()`. - On each imported block, the following runtime API calls are made to retrieve the latest values: - `get_block_time_sum` - `get_median_block_time` - `get_distance_threshold` (U512) - `get_last_block_time` - `get_last_block_duration` - `get_chain_height` - `get_difficulty` (U512) - The metric values are written to the GaugeVec using the label values: - `data_group` ∈ { - `"chain_height"` - `"block_time_sum"` - `"median_block_time"` - `"distance_threshold"` - `"difficulty"` - `"last_block_time"` - `"last_block_duration"` } - Numeric conversion details: - U512 values (e.g., `distance_threshold`, `difficulty`) must be converted to f64 for gauges. - For `distance_threshold`: the implementation packs the highest-order 64 bits of the U512 into an f64 (`pack_u512_to_f64`). - For `difficulty`: the code uses `difficulty.low_u64()` as the gauge value. C) Transitive metrics crates - The project depends on substrate and polkadot crates that include metrics support (e.g., `substrate-prometheus-endpoint`, `polkadot-node-metrics`), but this repository doesn’t directly configure them beyond wiring a registry in the node. How the registry is provided: - The `prometheus::Registry` originates from `Configuration.prometheus_registry()` which is typically set by CLI flags handled by the Substrate service layer. - When a registry is present, all sources listed above register their metrics and make them available to the node’s metrics endpoint. --- ### 3) Summary of Exposed Metrics - Name: `qpow_metrics` (GaugeVec) - Labels: - `data_group`: one of: - `chain_height` → f64(height) - `block_time_sum` → f64(total_block_time) - `median_block_time` → f64(median_seconds) - `distance_threshold` → f64(high 64 bits of U512) - `difficulty` → f64(low 64 bits of U512) - `last_block_time` → f64(timestamp_seconds) - `last_block_duration` → f64(seconds) - Populated on block import by the `"monitoring_qpow"` task if a Prometheus registry is configured. Additionally, via Substrate’s integrations: - Transaction pool metrics (standard Substrate counters/gauges) when `with_prometheus(registry)` is used. - Network notification metrics when `register_notification_metrics(registry)` is called. - Other Substrate component metrics when the node is spawned with a registry. --- ### 4) How to Enable/Disable Telemetry (sc-telemetry): - Enabled per chain spec by providing non-empty telemetry endpoints. - Enabled for: Live Testnet (Resonance), Heisenberg. - Not configured for: Development. - To disable telemetry for a chain, remove or leave `with_telemetry_endpoints` empty in its `ChainSpec` builder. Prometheus metrics: - Enabled when a `prometheus::Registry` is present in `Configuration`. - If the registry is not provided (e.g., omitted CLI flags), no metrics registration or updates will occur. --- ## 5) Extending Metrics To add new custom metrics: 1. Obtain the registry from `Configuration.prometheus_registry()`. 2. Register your metric (Counter, Gauge, Histogram, etc.) with the registry. 3. Spawn a task and/or hook into relevant streams (e.g., import notifications) to update the metric. 4. Ensure any non-f64 values are converted appropriately before setting gauge values. For example, follow the structure in `chain/node/src/prometheus.rs`: - Create a registry-backed metric. - Spawn an essential task to observe events. - Update metrics within that loop. --- ### 6) Notes and Considerations - This codebase does not define custom sc-telemetry event types or JSON message schemas; it relies on upstream Substrate. - The `"telemetry"` and `"monitoring_qpow"` tasks are long-running and tied to the node lifecycle. - For large integers (U512), avoid lossy conversion where precise representation is required; gauges are f64 and thus inherently lossy. If exact values are needed, consider additional metrics or alternative encodings (e.g., split high/low parts or use counters/histograms where appropriate). - Avoid registering the same metric multiple times; metrics must be registered once per registry. --- ## Telemetry reporting in the Polkadot SDK This document explains how telemetry is implemented in the upstream Polkadot SDK, where it is used, which events are reported, and what message structures are sent to telemetry backends. The SDK uses the Substrate telemetry subsystem (`sc-telemetry`) to: - Describe a node and its environment to the telemetry server at connection time. - Emit structured events during normal operation (block import, finality, consensus, tx pool, etc.). - Transport events over WebSocket to one or more configured telemetry endpoints. This is a high-level guide intended to help you discover, reason about, and extend telemetry in this codebase. --- ### Architecture overview - Producer API - `sc_telemetry::telemetry!` macro is the primary API for reporting events. - Components carry an `Option<sc_telemetry::TelemetryHandle>`; when `Some`, the macro serializes a message and sends it to the background worker. - Worker and transport - `sc_telemetry::TelemetryWorker` receives messages over an internal channel. - `TelemetryWorker` manages connections to a set of endpoints (`sc_telemetry::TelemetryEndpoints`), one sink per endpoint. - Each endpoint has a maximum verbosity level. Events are routed only if `event_verbosity <= endpoint_max_verbosity`. - Transport is implemented with libp2p over WebSocket with a connection timeout and automatic reconnection (with randomized backoff). - Initialization - Nodes compose a `sc_telemetry::Telemetry` from `sc_telemetry::TelemetryWorkerHandle::new_telemetry(endpoints)`. - `Telemetry::start_telemetry(connection_message)` must be called once during service startup. - The `TelemetryHandle` returned by `Telemetry::handle()` can be cloned and passed across subsystems to report events. - Configuration - Chain specs optionally contain `telemetryEndpoints` (list of `(endpoint, verbosity)`). - If no endpoints are configured, telemetry is inactive and the `TelemetryHandle` is typically `None`. --- ### Message model and schema All messages sent to the telemetry server share a common envelope: - Envelope (added by the worker when dispatching): - `id`: Unique identifier (u64) for the node registration within the process. - `ts`: RFC 3339 timestamp (string) produced at dispatch time. - `payload`: The event payload (JSON object). - Payload (produced by `telemetry!` or connection init): - `msg`: Event name (string). Examples: `system.connected`, `block.import`, `afg.received_prevote`. - Additional fields: Event-specific key/value pairs, serialized via `serde_json`. - Verbosity - Each event is sent with a verbosity level (u8). Predefined constants: - `SUBSTRATE_INFO = 0` - `CONSENSUS_INFO = 1` - `CONSENSUS_WARN = 4` - `CONSENSUS_DEBUG = 5` - `CONSENSUS_TRACE = 9` - `SUBSTRATE_DEBUG = 9` - Endpoints receive the message only if their configured max-verbosity is greater than or equal to the event’s verbosity. --- ### Connection handshake payload When telemetry is started, the worker sends an initial connection payload to each endpoint with `msg = "system.connected"` and the following fields: - Node identity and runtime: - `name`: Node name. - `implementation`: Implementation identifier. - `version`: Implementation version. - `config`: Implementation-specific config string (may be empty). - `chain`: Chain name. - `genesis_hash`: Genesis block hash (string). - `authority`: Whether this node is an authority (bool). - `startup_time`: Milliseconds since UNIX epoch when the node started (string). - `network_id`: libp2p PeerId (base58). - Build/target platform: - `target_os`: Target operating system. - `target_arch`: Target ISA/architecture. - `target_env`: Target platform ABI/libc. - System info (optional): - `sysinfo` (object): - `cpu`: CPU model (string, optional). - `memory`: Total memory in bytes (u64, optional). - `core_count`: Physical CPU cores (u32, optional). - `linux_kernel`: Kernel version (string, optional). - `linux_distro`: Linux distribution (string, optional). - `is_virtual_machine`: Whether running under a VM (bool, optional). Note: The worker adds `ts` to the connection payload before sending. --- ### Event catalog Below is a non-exhaustive (but representative) catalog of telemetry events produced by the SDK, grouped by subsystem. For each event, we include its name (`payload.msg`), typical verbosity, and common fields. Field types are derived from the code and may vary depending on context. Tip: Treat the key/value pairs as a schema that may grow over time. Always handle unknown fields gracefully. #### System and client (block import, notifications, metrics) - block.import (verbosity: SUBSTRATE_INFO) - When a block is imported as new best (rate-limited during initial sync). - Fields: `height` (u64), `best` (hash), `origin` (enum-as-string). - notify.finalized (verbosity: SUBSTRATE_INFO) - When finality notification is issued. - Fields: `height` (stringified number), `best` (hash). - txpool.import (verbosity: SUBSTRATE_INFO) - Periodic summary after at least one tx import within the last second. - Fields: `ready` (u64), `future` (u64). - system.interval (verbosity: SUBSTRATE_INFO) - Periodic system snapshot; reported in two groups: - Chain and node usage snapshot: - Fields: `height` (u64), `best` (hash), `txcount` (u64), `finalized_height` (u64), `finalized_hash` (hash), `used_state_cache_size` (bytes as u64). - Network snapshot: - Fields: `peers` (u64), `bandwidth_download` (bytes/sec), `bandwidth_upload` (bytes/sec). #### Basic authorship - prepared_block_for_proposing (verbosity: CONSENSUS_INFO) - After producing a candidate block. - Fields: `number` (block number), `hash` (hash). #### Slot-based consensus (framework used by BABE/AURA) - slots.discarding_proposal_took_too_long (verbosity: CONSENSUS_INFO) - When a proposal attempt exceeded the allowed time. - Fields: `slot` (slot number). - slots.unable_fetching_authorities (verbosity: CONSENSUS_WARN) - Error fetching authorities for the current slot/parent. - Fields: `slot` (parent hash), `err` (debug string). - slots.skipping_proposal_slot (verbosity: CONSENSUS_DEBUG) - When the node backs off proposal due to sync/network conditions. - Fields: `authorities_len` (Option<u32> serialized value). - slots.starting_authorship (verbosity: CONSENSUS_DEBUG) - Beginning of authorship for a slot. - Fields: `slot_num` (slot number). - slots.unable_authoring_block (verbosity: CONSENSUS_WARN) - Failure to author a block. - Fields: `slot` (slot number), `err` (debug string). - slots.pre_sealed_block (verbosity: CONSENSUS_INFO) - Just before importing the pre-sealed block (post-hash / pre-hash info). - Fields: `header_num` (block number), `hash_now` (hash), `hash_previously` (hash). - slots.err_with_block_built_on (verbosity: CONSENSUS_WARN) - Error while importing the built block. - Fields: `hash` (parent hash), `err` (debug string). ### AURA consensus - aura.checked_and_importing (verbosity: CONSENSUS_TRACE) - A checked header is being imported. - Fields: `pre_header` (debug string/structure). - aura.header_too_far_in_future (verbosity: CONSENSUS_DEBUG) - AURA header rejected as too far in the future. - Fields: `hash` (hash), `a` (debug), `b` (debug). Note: Cumulus (parachain) AURA logic mirrors similar messages in parachain contexts. ### BABE consensus - babe.checked_and_importing (verbosity: CONSENSUS_TRACE) - A checked header is being imported. - Fields: `pre_header` (debug string/structure). - babe.header_too_far_in_future (verbosity: CONSENSUS_DEBUG) - BABE header rejected as too far in the future. - Fields: `hash` (hash), `a` (debug), `b` (debug). ### GRANDPA finality (messages prefixed with afg.*) Inbound, validation, and gossip: - afg.received_propose / afg.received_prevote / afg.received_precommit (verbosity: CONSENSUS_INFO) - Received vote messages from peers (when the validator set is small enough to report). - Fields: `voter` (stringified AuthorityId), `target_number` (block number), `target_hash` (hash). - afg.received_commit (verbosity: CONSENSUS_INFO) - Received commit message. - Fields: `contains_precommits_signed_by` (array of AuthorityIds as strings), `target_number` (block number), `target_hash` (hash). - afg.bad_msg_signature / afg.bad_commit_msg_signature / afg.bad_catch_up_msg_signature (verbosity: CONSENSUS_DEBUG) - Signature verification failures for various message types. - Fields: `signature` or `id` (AuthorityId), depending on context. - afg.malformed_compact_commit (verbosity: CONSENSUS_DEBUG) - Malformed compact commit detected. - Fields: `precommits_len` (usize), `auth_data_len` (usize), `precommits_is_empty` (bool). - afg.err_decoding_msg (verbosity: CONSENSUS_DEBUG) - Error while decoding a gossip message (fields may be empty/placeholders). Outbound and issuance: - afg.announcing_blocks_to_voted_peers (verbosity: CONSENSUS_DEBUG) - Notifying voted peers about blocks. - Fields: `block` (hash), `round` (round number), `set_id` (set id). - afg.prevote_issued / afg.precommit_issued (verbosity: CONSENSUS_DEBUG) - Local votes issued. - Fields: `round` (round number), `target_number` (block number), `target_hash` (hash). - afg.commit_issued (verbosity: CONSENSUS_DEBUG) - Local commit issued. - Fields: `target_number` (block number), `target_hash` (hash). Finalization and authority set: - afg.finalized_blocks_up_to (verbosity: CONSENSUS_INFO) - Finalizing blocks up to a given point. - Fields: `number` (block number), `hash` (hash). - afg.authority_set (verbosity: CONSENSUS_INFO) - Current authority set context published (e.g., on voter start). - Fields: `authority_id` (string), `authority_set_id` (set id), `authorities` (array of AuthorityIds as strings). - afg.generating_new_authority_set (verbosity: CONSENSUS_INFO) - Transition to a new authority set. - Fields: `number` (block number), `hash` (hash), `authorities` (array), `set_id` (new set id). - afg.applying_forced_authority_set_change / afg.applying_scheduled_authority_set_change (verbosity: CONSENSUS_INFO) - Applying authority set changes. - Fields: `block` (block number). - afg.loading_authorities (verbosity: CONSENSUS_DEBUG) - Loading authorities (e.g., from genesis). - Fields: `authorities_len` (usize). --- ### Endpoints and configuration - Endpoints format - `TelemetryEndpoints` accepts a list of (endpoint, verbosity) pairs. - Endpoints can be: - Multiaddresses (e.g., `/dns/telemetry.polkadot.io/tcp/443/x-parity-wss/%2Fsubmit%2F`). - WebSocket URLs (e.g., `wss://telemetry.polkadot.io/submit/`), which are internally converted to multiaddresses. - Chain spec - `telemetryEndpoints` is an optional list of `[endpoint, verbosity]`. - When present, services initialize the telemetry worker and start reporting. - Example (conceptual JSON): `"telemetryEndpoints": [["wss://telemetry.polkadot.io/submit/", 0]]`. - CLI / service layer - The service builds a `ConnectionMessage` with identity, chain, genesis hash, platform, and optional `sysinfo`, then calls `start_telemetry`. - The returned `TelemetryHandle` is propagated to subsystems. --- ### Runtime behavior and reliability - Connection lifecycle - Each endpoint is dialed independently over WebSocket. - On (re)connect, the worker sends buffered connection messages (e.g., `system.connected`) and emits connection notifications to subscribers (if any). - Randomized reconnect delay (10–20 seconds) mitigates thundering-herd effects. - Backpressure and channel behavior - Event sending is non-blocking for producers. - If the channel is full, the producer logs a trace and the message is dropped; the worker keeps running. - During reconnect, outgoing telemetry messages for that endpoint are discarded to avoid stalling producers. - Filtering by verbosity - Before dispatch, the worker checks the event’s verbosity against each endpoint’s max verbosity to decide routing. --- ### Security and privacy notes - Identity - `network_id` is the node’s libp2p PeerId (base58). - `genesis_hash`, `chain`, and `version` identify the network and build. - System info - Optional `sysinfo` fields (CPU model, memory, core count, distro, kernel, VM flag) are sent if available on the host system. - Event content - Event payloads include only chain data and debugging context relevant to consensus, syncing, and authorship. - No secrets or private keys are included in telemetry messages. --- ### Extending telemetry To add new telemetry events in a component: 1. Accept an `Option<TelemetryHandle>` in the component’s constructor or context. 2. Use the `telemetry!(telemetry; VERBOSITY; "your.message.name"; "key1" => value1, "key2" => ?debuggable_value2, ...);` macro where appropriate. - `=>` serializes a value as JSON. - `=> ?` stringifies via `Debug`. 3. Choose a sensible verbosity for the event (INFO/WARN/DEBUG/TRACE). 4. Ensure the component is wired with a real handle when telemetry is enabled (otherwise `Option::None` is fine). Example shape (conceptual, not a literal code snippet): message envelope contains `id`, `ts`, and `payload` where `payload` includes `msg: "your.message.name"` and the provided keys. --- ### Where to look in the codebase Key crates and areas that implement telemetry reporting and transport: - Transport and worker: - `substrate/client/telemetry` (worker, endpoints, transport, macros, connection message). - Service and initialization: - `substrate/client/service` (telemetry initialization, txpool telemetry, system interval metrics). - Client (block import and notifications): - `substrate/client/service/src/client` (block import, finality notifications). - Consensus: - Slots framework: `substrate/client/consensus/slots`. - AURA: `substrate/client/consensus/aura`. - BABE: `substrate/client/consensus/babe`. - GRANDPA: `substrate/client/consensus/grandpa` (authorities, communication, environment, lib). - Basic authorship: - `substrate/client/basic-authorship`. - Cumulus (parachain nodes): - AURA-related telemetry in `cumulus/client/consensus/aura`. --- ### Summary - Telemetry is opt-in and activated via chain spec `telemetryEndpoints`. - `sc-telemetry` provides the worker, transport, and a macro-based reporting API. - The SDK reports a rich set of events across block import, finality, consensus, slot authorship, and tx pool operations. - Messages are simple JSON objects wrapped by a common envelope (`id`, `ts`, `payload`), with a `msg` field naming the event and additional event-specific fields. - Endpoint verbosity controls which events are forwarded, enabling flexible and scalable observability across networks.