Thesis draft 2

# Thesis draft 2 - Interpreted vectorization vs code generation - Modern database systems, not only for Spark - Spark SQL (default for Spark) uses code generation - Interpreted vectorization - Dynamic dispatch mechanism to choose code to execute for a given input, process batch data and enable SIMD vectorization - - Advantages: - Easier to develop and scale - Better observability and readability - Frameworks: - Databricks Photon: - [Photon: A Fast Query Engine for Lakehouse Systems](https://www.databricks.com/wp-content/uploads/2022/07/Photon-A-Fast-Query-Engine-for-Lakehouse-Systems.pdf) - [Apache Spark and Photon receive SIGMOD awards](https://www.databricks.com/blog/2022/06/15/apache-spark-and-photon-receive-sigmod-awards.html) - [Databricks Sets Official Data Warehousing Performance Record](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) - [New Performance Improvements in Databricks SQL](https://www.databricks.com/blog/2021/09/08/new-performance-improvements-in-databricks-sql.html) - - Code generation: - Use a compiler to produce code for query - Generate executing code at runtime - Advantages: - Better performance for complex trees expression or to prune unused column. - Disadvantages: - Hard to build and debug - Hard to understand - Spark SQL: https://www.youtube.com/watch?v=wVs1FZyKXMY - https://books.japila.pl/spark-sql-internals/whole-stage-code-generation/ - https://www.slideshare.net/databricks/understanding-and-improving-code-generation ![](https://i.imgur.com/qwwouHt.png) - Volcano iterator model: - Each operator can be thought of an iterator - Compose arbitrary operators ![](https://i.imgur.com/Oy6oSbg.png) ![](https://i.imgur.com/zkv782E.png) - Expression code generation: - Don't have to traverse expression tree - Optimize the code that we created -> improve performance - Whole-stage code generation - Inspired by [Efficiently compiling efficient query plans for modern hardware](https://dl.acm.org/doi/10.14778/2002938.2002940) - Collapse entire query into a single operator and generate one function for the entire query - Keep data in CPU registers ![](https://i.imgur.com/hIAnLGq.png) - Problems: - Can generate 1 million lines of code - Java limits method size to 64KB and JIT compilation is disabled when methods exceed 8KB - Fallback to Volcano iterator model - Spark execution plan: https://sparkbyexamples.com/spark/spark-execution-plan/ ## Resource - https://dl.acm.org/doi/pdf/10.1145/2723372.2742797 - https://www.databricks.com/session_na21/enabling-vectorized-engine-in-apache-spark - https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de - https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-tungsten.html - https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/293651311471490/5382278320999420/latest.html - https://www.waitingforcode.com/apache-spark-sql/why-code-generation-apache-spark-sql/read - https://www.databricks.com/session/a-deep-dive-into-query-execution-engine-of-spark-sql