kafka design doc

# kafka design doc ## Overview & Goal This document details the proposed strategy for publishing real-time financial market data, specifically prices and order book depth for both Spot and Perpetual markets, from the Pulse service into Apache Kafka. The primary goal is to establish a reliable, scalable, and well-defined data pipeline that can be easily consumed by downstream applications, starting with a new Rust SDK. This design prioritizes clarity, performance, and future extensibility. ## Context ### Data Producer: The Pulse service, which aggregates and processes data from various sources (Exchanges, Blockchains via Fetchers, Solvers, Pricers). Data Types: Price and Orderbook, qualified by Market Type (Spot/Perpetual). Message Broker: Apache Kafka. ### Consumers: Ingestor Service, potentially internal systems like the AI Decision Engine, Backtesting Module, and Monitoring services. ### Infra Overview: Data flows from Exchanges/Blockchains -> Pulse (Data Ingestion & Processing) -> Kafka (New Component) -> Consumers. ### Key Dimension: Data must be differentiated based on market_type (e.g., "spot", "perpetual", "options") as stored in the market_types table. ## Data Types to Publish The primary data types are: ### Price Represents the calculated or fetched price of a specific instrument. ### Orderbook Represents the order book for a specific instrument and exchange. ## Kafka Topic Strategy We will maintain the Topic per Data Type strategy for primary organization, but the distinction between Spot and Perpetual will be handled within the message payload and the partitioning key. ### Topic 1: pragma-prices-v1 Purpose: To publish all price updates generated by Pulse (both Spot and Perpetual). Rationale: Keeps related price data together. Consumers needing only Spot or only Perpetual can filter based on a field in the message. Consumers needing both don't have to subscribe to multiple topics. ### Topic 2: pragma-orderbook-v1 Purpose: To publish all order book depth updates generated by Pulse (both Spot and Perpetual). Rationale: Similar to prices, keeps depth data together while allowing filtering. Naming Convention: Remains the same (pragma-{datatype}-v1). ## Partitioning Strategy To ensure scalability and ordered processing within a specific data feed including its market type, the partition key must incorporate the market_type. ### Partition Key: A structured string or compact representation combining the core identifiers of a data feed. Recommended format: `{market_type}-{network_id}-{source_id}-{token_ticker}` market_type: The type of market (e.g., spot, perpetual - derived from market_types.name or ID). network_id: Identifier for the network (e.g., 1, unknown). source_id: Identifier for the data source (e.g., 10, binance). token_ticker: The unique identifier for the asset or contract (e.g., "BTC" - from tokens.ticker). Example Keys: ``` perp-unknown-paradex-BTC spot-ethereum-ekubo-ETH ``` ### Rationale: #### Ordering: All messages for the exact same instrument feed (e.g., Binance Spot BTC/USD prices) will hash to the same partition, guaranteeing order for that specific feed. Crucially, Spot and Perpetual data for the same underlying asset (e.g., BTC Spot vs BTC Perp) will go to different partitions (unless the hash collides, which is unlikely with good hashing), maintaining their independent order. #### Parallelism: Different feeds are distributed across partitions for parallel processing. #### Targeted Consumption: Consumers can logically process partitions knowing the market_type associated with the data within, even before deserializing the full message, if partition assignments are tracked. ## Message Schemas Adding the market_type field to the message payload is essential for downstream consumers. (We should probably use protobuf here). Common Fields (Present in all messages): JSON ``` { "message_id": "uuid", "event_timestamp": "iso8601", "publish_timestamp": "iso8601", "producer": "pulse", "version": "1.1" // Schema version updated } ``` Schema: pragma-prices-v1 JSON ``` { // --- Common Fields --- "message_id": "...", "event_timestamp": "...", "publish_timestamp": "...", "producer": "pulse", "version": "1.1", // --- Price Specific Fields --- "data_type": "price", "market_type": "spot", // MANDATORY: "spot" or "perpetual" "token_ticker": "BTC", "source_id": "binance", "network_id": "unknown", "price": "65432.10", "metadata": { /* ... optional ... */ } } ``` Schema: pragma-orderbook-v1 JSON ``` { // --- Common Fields --- "message_id": "...", "event_timestamp": "...", "publish_timestamp": "...", "producer": "pulse", "version": "1.1", // --- Depth Specific Fields --- "data_type": "market_depth", "market_type": "spot", // MANDATORY: "spot" or "perpetual" "token_ticker": "BTC", "source_id": "binance", "network_id": "unknown", "bids": [ ["...", "..."] ], "asks": [ ["...", "..."] ], "metadata": { /* ... optional ... */ } } ``` ## Data Flow Diagram ![image](https://hackmd.io/_uploads/H1_O-CfA1g.png) ## Future Considerations (No changes from v1.0 in this section, but the considerations now apply to data including market types) Schema Evolution New Data Types Error Handling Monitoring Data Retention Things we need to figure out - 2 topics ? spot / perp ? -> pas de separation spot/perp dans topic - 1 seul consumer pour l’instant (investor) mais potentiellement + plus tard (analytics/monitoring etc) - producer: pulse / paramètre ack ? Imo: pas besoin d’attendre full confirmation -> attendre confirmation du leader - Param de retry pour pulse ? 2/3 ? Imo: meme pas besoin, on s’en fiche de perdre de la data dans notre cas nan ? -> 0 - Buffer size pour pulse ? -> https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html#buffer-memory - Taille des batchs ? Latency vs network throughput -> https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html#batch-size - Compacted topics ? -> https://docs.confluent.io/kafka/design/log_compaction.html - Active partition data in memory for broker -> ??? - Combien de partitions ? -> IA - Isoler les trucs critiques pour éviter de tout peter, qu’est ce qui est critique ? -> Non ## Implementation Notes for Rust SDK SDK needs to deserialize the market_type field. Consumers might want helper functions to filter streams by market_type. ## Action Items - Once approved, update/create the internal documentation page. - Create sub-issues/tasks. ## Other questions - P90 latency requirements? - Average batch size - Average message size