Thesis draft - HackMD

# Thesis draft ## Definition - [Lakehouse](https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) ### Intepreted Vectorization vs Code Generation for Execution Engine - Intepreted Vectorization: - Dynamic dispatch mechanism to choose code to execute for a given input, process batch data and enable SIMD vectorization. - Advantages: - Easier to develop and scale - Better observability and readability - Code generation: - Use a compile to produce code for query. - Advantages: - Better performance for complex trees expression or to prune unused column ## Databricks Photon ### Papers and articles - [Photon: A Fast Query Engine for Lakehouse Systems](https://www.databricks.com/wp-content/uploads/2022/07/Photon-A-Fast-Query-Engine-for-Lakehouse-Systems.pdf) - [Apache Spark and Photon receive SIGMOD awards](https://www.databricks.com/blog/2022/06/15/apache-spark-and-photon-receive-sigmod-awards.html) - [Databricks Sets Official Data Warehousing Performance Record](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) - [New Performance Improvements in Databricks SQL](https://www.databricks.com/blog/2021/09/08/new-performance-improvements-in-databricks-sql.html) ### Notes - Problem: Some organizations want to run data warehousing applications on the same datasets and don't want to manage multiple data systems -> need a lakehouse. - Lakehouse: a data store can do large scale processing and interactive SQL queries (combining data warehouse and data lake systems). - Photon: - C++ vectorized execution engine for Spark. - Spark programming interface. - Faster interactive queires and higher concurrency than Spark. - Speedup 3x over previous reuntime (SPark) and maxium speedup of 10x - Decision to build Photon: - Vectorized-intepreted model instead of code generation. Support runtime adaptivity: batch-data characteristics. - Implement Photon in a native language (C++) rather than using JVM. - Challenges: - Support raw, uncurated data - Support existing Spark API: integrate with DBR (a fork of Spark). Queries can run partially in PHoton and fall back to spark SQL. - Advantages: - Replacement for the existing Tungsten Execution engine (which uses Catalyst optimizer and Cost Based Optimizer). - Reduce JVM bottleneck. ### Practice - Set up Databricks (14-day trial) on Google Cloud ($300 free for new registration). - https://2849910696242028.8.gcp.databricks.com/?o=2849910696242028# ## Facebook Velox ### Projects - [velox](https://github.com/facebookincubator/velox) - [gluten](https://github.com/oap-project/gluten) ### Papers and articles - [Velox documentation](https://facebookincubator.github.io/velox/) - [Velox: Meta's Unified Execution Engine](https://research.facebook.com/file/477542930588455/Velox-Metas-Unified-Execution-Engine-p1030-pedreira-cr2-1.pdf) - [Introducing Velox: An open source unified execution engine](https://engineering.fb.com/2022/08/31/open-source/velox/) ### Notes - Can be integrated with Presto and PyToch (using TorchArrow). - Intepreted Vectorization - Can be integrated to Spark using Gluten. - Velox (Berkeley) is a compeletely different tool: serving machine learning predictions with Spark integration. ## Tungsten Execution Engine - https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-tungsten.html ## NVIDIA spark-rapids ### Papers and articles - [spark-rapids](https://nvidia.github.io/spark-rapids/) ### Notes - Not an execution engine - Plugin for Spark to accelerate using GPU ## Questions and problems - How to try Databricks Photon? The trial is only for 14 days. - Need a very big dataset to benchmark - Possible topic: Benchmark the tools and compare the cost and the performance ## Random - https://medium.com/@wang.y.katherine/analyzing-100gb-datasets-with-spark-sql-on-a-laptop-b16de6a25221 - Other query engine (not for Spark): - Apache Impala - Apache Hive - Apache Sqoop - Apache HBase - Apache Drill - Apache Phoenix