# Thesis draft 2
- Interpreted vectorization vs code generation
- Modern database systems, not only for Spark
- Spark SQL (default for Spark) uses code generation
- Interpreted vectorization
- Dynamic dispatch mechanism to choose code to execute for a given input, process batch data and enable SIMD vectorization
-
- Advantages:
- Easier to develop and scale
- Better observability and readability
- Frameworks:
- Databricks Photon:
- [Photon: A Fast Query Engine for Lakehouse Systems](https://www.databricks.com/wp-content/uploads/2022/07/Photon-A-Fast-Query-Engine-for-Lakehouse-Systems.pdf)
- [Apache Spark and Photon receive SIGMOD awards](https://www.databricks.com/blog/2022/06/15/apache-spark-and-photon-receive-sigmod-awards.html)
- [Databricks Sets Official Data Warehousing Performance Record](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html)
- [New Performance Improvements in Databricks SQL](https://www.databricks.com/blog/2021/09/08/new-performance-improvements-in-databricks-sql.html)
-
- Code generation:
- Use a compiler to produce code for query
- Generate executing code at runtime
- Advantages:
- Better performance for complex trees expression or to prune unused column.
- Disadvantages:
- Hard to build and debug
- Hard to understand
- Spark SQL: https://www.youtube.com/watch?v=wVs1FZyKXMY
- https://books.japila.pl/spark-sql-internals/whole-stage-code-generation/
- https://www.slideshare.net/databricks/understanding-and-improving-code-generation

- Volcano iterator model:
- Each operator can be thought of an iterator
- Compose arbitrary operators


- Expression code generation:
- Don't have to traverse expression tree
- Optimize the code that we created -> improve performance
- Whole-stage code generation
- Inspired by [Efficiently compiling efficient query plans for modern hardware](https://dl.acm.org/doi/10.14778/2002938.2002940)
- Collapse entire query into a single operator and generate one function for the entire query
- Keep data in CPU registers

- Problems:
- Can generate 1 million lines of code
- Java limits method size to 64KB and JIT compilation is disabled when methods exceed 8KB
- Fallback to Volcano iterator model
- Spark execution plan: https://sparkbyexamples.com/spark/spark-execution-plan/
## Resource
- https://dl.acm.org/doi/pdf/10.1145/2723372.2742797
- https://www.databricks.com/session_na21/enabling-vectorized-engine-in-apache-spark
- https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de
- https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-tungsten.html
- https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/293651311471490/5382278320999420/latest.html
- https://www.waitingforcode.com/apache-spark-sql/why-code-generation-apache-spark-sql/read
- https://www.databricks.com/session/a-deep-dive-into-query-execution-engine-of-spark-sql