# Metrics storage
Valuablish sources:
- [How to Choose the Right Database? - MongoDB, Cassandra, MySQL, HBase - Frank Kane - YouTube](https://www.youtube.com/watch?v=v5e_PasMdXc)
- Documentations
- many blog post which I've closed and not saved
## Assumptions:
- CAP Theorem priority
1. Consistency
2. ?Availability?
3. ?Partition-Tolerance?
- Scale not so much important
- Don’t need to be super quick
- Need to store 3Dish (probably 4Dish, even 5Dish) data nicely
- Need to handle lot of data
> Data gain per iteration (min per hour) = number of users * number of metrics.
Estimating: 2500 users * (100 metrics + intermediate metrics) = 250 000 data points per iteration and will grow because:__
- we will have more users
- we will have more metrics
__These leads to at least 250 000(data points)*24(minimal numbers of iterations per day)*100(days of LEK course) = 600 000 000 data points per edition__
__When frequency of firing PEN will grow, number of data points will grow very much (3 600 000 000 data points per edition when every 10 minutes)__
- Querer(he who query) need to be able to easily filter data from some time-dependant chunk
- for visualising time changing metrics
- for reports which are chunks of metrics
## Questions:
- Do we worry about the cost on maintaining?
## Solutions:
1. files + file storage(like S3)
1. Pros:
1. [How to Choose the Right Database? - MongoDB, Cassandra, MySQL, HBase - Frank Kane - YouTube](https://www.youtube.com/watch?v=v5e_PasMdXc) 11:00-13:00
2. the simplest
3. so far whole prototyping went with files
4. Pandas favourite
5. no need for high performance
2. Cons:
1. scalability handled by multipling directories
2. quering is imposible, we limit posibilites only to pandas(dask) information retrieval. This might make visualising time changing metrics difficult. But we can handle it in pipeline.
3. Options:
1. CSV
1. +best Pandas support
2. +Great Expectations compatible (data tests library)
3. +processing framework agnostic
2. Parquet
1. +storage optimised
2. +efficient (paralellable)
3. +processing framework agnostic
4. +nested data support
3. HDF5
1. +storage optimised
2. Document
1. Pros:
1. documents (not 2d)
2. scalability
2. Cons:
1. not SQL (well known query language)
2. harder for pandas
3. Airflow don't have operator for that, but have hook
4. not known by team
3. Options:
1. MongoDB
1. +distributed
2. -probably lot of configuration
2. Google Firestore
1. +google ecosystem
3. Google Datastore
1. +google ecosystem
#### Examples (not that easy to decide):
~~~~{
{
"timestamp": "2019-03-27 12:32:12:0013",
"users": {
512: {
lessons: {
1: "done",
2: "in-progress",
...
},
metric1: 0.15,
metric2: 123,
metric3: 1:12:13:123120,
...
},
513: {
lessons: {
1: "done",
2: "in-progress",
...
},
metric1: 0.15,
metric2: 123,
metric3: 1:12:13:123120,
...
},
}
}
~~~~
~~~~{
{
user_id: 512,
timestamps: {
"2019-03-27 12:32:12:0013": {
metric1: 0.15,
metric2: 123,
metric3: 1:12:13:123120,
...
},
"2019-03-27 12:32:12:0013": {
metric1: 0.15,
metric2: 123,
metric3: 1:12:13:123120,
...
},
}
}
~~~~
and many others (at least 8 combinations)...
3. Wide-column
1. Pros:
2. Cons:
3. Options
1. Cassandra
1. -trade off consistency (CAP theorem)
2. -probably lot of configuration
2. HBase
4. Relational
1. Pros:
1. known by team
2. Cons:
1. 2D data
2. scalability
3. Options:
1. MySQL
1. +known by team
2. PostgreSQL
1. +views
2. +more pythonic
3. +😎er
4. +new tech
5. -new tech
3. BigQuery
1. +known by team
2. +distributed
3. +scalable
4. -2s delay
5. Search engine
1. Elasticsearch
1. +known by team
2. -not sure if suited for problem
6. Multi-model
1. Amazon DynamoDB
### Dropped:
1. Key-value
1. Redis
1. -not suited for problem
2. Time Series
1. InfluxDB
1. -not suited for problem
2. Prometheus
1. -not suited for problem
3. Graph
1. Neo4j
1. -not suited for problem