## Apache Druid on AKS
 
<!-- Put the link to this slide here so people can follow -->
<!-- slide: https://hackmd.io/@y56/SkSUF0kD_ -->
Speaker: Eugene Wang
---
### Background
- Count unique users
- high dimensionality (100k cubes)
- time series data (250 days)
- 300M records, 30G
<!-- - Apache Druid (cost: 30k NTD/month)
- 30 days in 250 days
- 4.8, ()
- Azure Synapse (cost: 30k NTD/month)
- 30 days in 250 days
- 16s -->
 
<!--  -->
---
### MOLAP/Druid features
(**m**ulti-dimensional **o**n**l**ine **a**nalytical **p**rocessing)
- pre-computed aggregation :smiley:
- Distinct count not mergeable :cry:
- much work if records change often :cry:
- multidimensional indexing (eg, bitmap index)
- Smaller on-disk size
<!-- (our case, 150G json ==> 30G) -->
<!-- - ?? data redundancy :smiley: :cry:
-->

---
### Apache DataSketches
Most Frequent, Quantiles, ***Distinct Counting***!
- random uniform hash function
- Theta Sketches
- KMV (K-th Minimum Value)
- HyperLogLog Sketches
- find first set/number of leading zeros
- hardware operator
  
---

---

---
Druid arch

<!--  -->
<!-- - datasource (stream/batch)
- ingestion
- chunk, segment, granularity -->
---
(streaming cases) query data in process

---

---
## tuning
- JVM heap size (Prometheus to monitor JVM)
- off-memory size
- PV size
- premium SSD
- number of historicals/middle manages
Thank you!
---
{"metaMigratedAt":"2023-06-15T23:22:30.769Z","metaMigratedFrom":"YAML","title":"Apache Druid on AKS","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"ff3248ca-6193-489f-9f8f-549f1547b9a9\",\"add\":7226,\"del\":5005}]"}