# Delta format

---
This presentation is about my findings when
I was playing with the delta format (https://delta.io/)
during my trip in a train from the NEO event.
It is not a comprehensive overview, but it signifies
things which I liked especially.
---
## Intro
Apache Spark is first class citizen in the world of the big data (to be honest, I don't know anything else which is proven to be as a second option).
Delta Lake is the file format to store and query efficiently the big ammount of data
All these are open sourced projects
---
## Still parquet as underlying data format
- strongly typed
- immutable
- compressed
---
## Interesting properties
---
Partitions can be replaced without rewriting all the data

---
Each job is atomic (success or no changes to the original state)
```bash
# rename operation (starts)
/table
/INFORMATION_DATE=2021-10-19
A-00001.parquet
A-00002.parquet
/_tmp
/INFORMATION_DATE=2021-10-20
B-00001.parquet
B-00002.parquet.part
```
```bash
# rename operation (done)
/table
/INFORMATION_DATE=2021-10-19
A-00001.parquet
A-00002.parquet
/INFORMATION_DATE=2021-10-20
B-00001.parquet
B-00002.parquet.part
```
same applies to the spark structured streaming (minibatches)
---
transaction log. You can see it as the git repository's .git directory.
(show in hands on project)
---
Time traveling (possible due to the previous feature)

---
Schema evolution
- Schema validation on write (shown in hands on)
- Schema evolution for adding columns
- Changes data types of columns (needs rewrite + overwriteSchema=true)
- Adding the partial dataframe (subset of columns)
---

---
Replacing single partition

---
Thank you
---
{"metaMigratedAt":"2023-06-17T09:44:59.508Z","metaMigratedFrom":"Content","title":"Delta format","breaks":true,"contributors":"[{\"id\":\"2a089c74-c1e2-4106-ab8f-e7197ae3bbe3\",\"add\":1926,\"del\":11}]"}