Clickhouse - HackMD

# Clickhouse ## Primary keys * Primary keys determine what columns are most frequently searched on * cannot be changed after table creation * Primary keys in ClickHouse are not unique for each row in a table * If a primary key isn't specified, "order by" is used as one. * The primary key of a ClickHouse table determines how the data is sorted when written to disk * The index created by the primary key does not contain an entry for every row but instead contains an entry for every 8,192 rows (referred to as index granularity) * Clickhouse uses sparse indicies- sorts the column and stores the value for every X values(granularity). Indices are stored in-memory for fast lookups. * Query to get metadata about the table: ``` SELECT part_type, path, formatReadableQuantity(rows) AS rows, formatReadableSize(data_uncompressed_bytes) AS data_uncompressed_bytes, formatReadableSize(data_compressed_bytes) AS data_compressed_bytes, formatReadableSize(primary_key_bytes_in_memory) AS primary_key_bytes_in_memory, marks, formatReadableSize(bytes_on_disk) AS bytes_on_disk FROM system.parts WHERE (table = 'hits_UserID_URL') AND (active = 1) FORMAT Vertical; ``` * If primary key isn't specified, order key will be assumed as the primary key * Sort order(Order by) can be different from primary key. If both are defined, primary key must be a prefix of the the order by tuple. * Primary keys should be ordered in magnitude of cardinality gains(lowest column first) to get the best performance * A column should be added to primary key only if: * Lots of queries will be made on the added column * Adding another column allows to skip quite long date ranges * If pimary key indexes do not fit in memory, Clickhouse will not start/crash. So keys should be chosen carefully * Options for creating additional primary indexes * Create two tables for same data but different primary keys * Use a projection. * Use a materialized view * Define a skipping index * How partitions related to primary key? * Partitions can help query performance but they're mainly useful for data management i.e to decide which storage should be discarded/cold stored * If query performance is the focus, good primary key is more important. ## Clickhouse architecture * Table engine determines how and where the data is stored. ### MergeTree * MergeTree is default defined. * Clickhouse inserts should happen in bulk. A million rows per second also works. * One row inserts are not recommended. If needed, use async insert * Each bulk insert creates a part. A part is stored in its own folder on disk * Each column gets stored in a file inside part folder in a sorted order * Over time, parts are merged in background by clickhouse- hence the name : mergetree * max_bytes_to_merge_at_max_space_in_pool defaults to 150GB- this parameter defines how big a part can get * Column files consist of compressed blocks. These blocks avoid Clickhouse to having to read the entire file * Blocks use LZ4 compression is used by default. min/max_compress_block_size: 64k-1MB * granule- a logical breakdown of rows inside uncompressed blocks * mark- contains block offsets for each granule #### Deduplication * What are the differences between ReplicatedMergeTree and CollapsingMergeTree/VersionedCollapsingMergeTree? * ReplacingMergeTree is easier to implement, but your queries need to either force a merge before running (using the FINAL clause - which is not typically recommended) or use a timestamp field to retrieve the latest value (using a function like argMax). * The collapsing engines uses a "sign" column of type Int8 that is set to either 1 or -1. To cancel a row, you insert the same data but change the sign column to -1. This requires you to know the previous state of a row so you can cancel it, but the benefit is that it is easy to find the current state of a row. ## Materialized * Materialized Views are particularly useful when: * Query results contain a small number of rows and/or columns relative to the base table (the table on which the view is defined) * Query results contain results that require significant processing, including: * Analysis of semi-structured data * Aggregates that take a long time to calculate * The query is on an external table (that is, data sets stored in files in an external stage), which might have slower performance compared to querying native database tables * The view's base table does not change frequently ## TTL * Removing old data: no surprise, you can delete rows or columns after a specified time interval * Moving data between disks: after a certain amount of time, you can move data between storage volumes - useful for deploying a hot/warm/cold architecture * Data rollup: rollup your older data into various useful aggregations and computations before deleting it * The TTL clause can appear after a column definition and at the end of the table definition. * A new TTL rule is not applied to the existing rows in the table. If you want your TTL rule to be applied to the existing rows, you need to materialize it: `ALTER TABLE example1 MATERIALIZE TTL` * Actual deletion of rows does not happen until its parts get merged. * Triggering TTL events: * The deleting or aggregating of expired rows is not immediate - it only occurs during table merges. If you have a table that's not actively merging (for whatever reason), there are two settings that trigger TTL events: * merge_with_ttl_timeout: defaults to 4hrs * merge_with_recompression_ttl_timeout: defaults to 4hrs * A hacky non-recommended solution: `OPTIMIZE TABLE example1 FINAL` OPTIMIZE initializes an unscheduled merge of the parts of your table, and FINAL forces a reoptimization if your table is already a single part. * Volume based TTL: MODIFY TTL trade_date TO VOLUME 'hot_volume', ## Misc * A clickhouse server can handle upto 10k tables and databases combined * Low cardinality columns shouldn't exceed more than 10k values * Partition by should usually be by day/week/month based on the retetion period. * Limiting factor for a single clickhouse server is mostly the number of partitions it has. 1000 is the recommended value * SET limits on queries in terms of number of rows they can query, amount of CPU/RAM a single query can consume, etc. * Search queries can use NLP functions possibly: https://clickhouse.com/docs/en/sql-reference/functions/nlp-functions ## Logs * Materialized views with severity levels are helpful. If there are 10k rows with only 1k rows of a particular severity level- CH will save processing 9k rows on a query like this: `SELECT * from log_messages where message LIKE '%INFO%'`. * Materialized views ## Metrics - Gorilla codec works well for gauge metric values - Double delta works well for increasing counters ## Observability misc - Multi-tenancy controls - Ingestion rate limit - Max global series per tenant - Max global series per metric - Max labels per metric - Max series per query - Look at Cortex/Thanos/Observatorium multi-tenant limits