# Thesis draft
## Definition
- [Lakehouse](https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html)
### Intepreted Vectorization vs Code Generation for Execution Engine
- Intepreted Vectorization:
- Dynamic dispatch mechanism to choose code to execute for a given input, process batch data and enable SIMD vectorization.
- Advantages:
- Easier to develop and scale
- Better observability and readability
- Code generation:
- Use a compile to produce code for query.
- Advantages:
- Better performance for complex trees expression or to prune unused column
## Databricks Photon
### Papers and articles
- [Photon: A Fast Query Engine for Lakehouse Systems](https://www.databricks.com/wp-content/uploads/2022/07/Photon-A-Fast-Query-Engine-for-Lakehouse-Systems.pdf)
- [Apache Spark and Photon receive SIGMOD awards](https://www.databricks.com/blog/2022/06/15/apache-spark-and-photon-receive-sigmod-awards.html)
- [Databricks Sets Official Data Warehousing Performance Record](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html)
- [New Performance Improvements in Databricks SQL](https://www.databricks.com/blog/2021/09/08/new-performance-improvements-in-databricks-sql.html)
### Notes
- Problem: Some organizations want to run data warehousing applications on the same datasets and don't want to manage multiple data systems -> need a lakehouse.
- Lakehouse: a data store can do large scale processing and interactive SQL queries (combining data warehouse and data lake systems).
- Photon:
- C++ vectorized execution engine for Spark.
- Spark programming interface.
- Faster interactive queires and higher concurrency than Spark.
- Speedup 3x over previous reuntime (SPark) and maxium speedup of 10x
- Decision to build Photon:
- Vectorized-intepreted model instead of code generation. Support runtime adaptivity: batch-data characteristics.
- Implement Photon in a native language (C++) rather than using JVM.
- Challenges:
- Support raw, uncurated data
- Support existing Spark API: integrate with DBR (a fork of Spark). Queries can run partially in PHoton and fall back to spark SQL.
- Advantages:
- Replacement for the existing Tungsten Execution engine (which uses Catalyst optimizer and Cost Based Optimizer).
- Reduce JVM bottleneck.
### Practice
- Set up Databricks (14-day trial) on Google Cloud ($300 free for new registration).
- https://2849910696242028.8.gcp.databricks.com/?o=2849910696242028#
## Facebook Velox
### Projects
- [velox](https://github.com/facebookincubator/velox)
- [gluten](https://github.com/oap-project/gluten)
### Papers and articles
- [Velox documentation](https://facebookincubator.github.io/velox/)
- [Velox: Meta's Unified Execution Engine](https://research.facebook.com/file/477542930588455/Velox-Metas-Unified-Execution-Engine-p1030-pedreira-cr2-1.pdf)
- [Introducing Velox: An open source unified execution engine](https://engineering.fb.com/2022/08/31/open-source/velox/)
### Notes
- Can be integrated with Presto and PyToch (using TorchArrow).
- Intepreted Vectorization
- Can be integrated to Spark using Gluten.
- Velox (Berkeley) is a compeletely different tool: serving machine learning predictions with Spark integration.
## Tungsten Execution Engine
- https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-tungsten.html
## NVIDIA spark-rapids
### Papers and articles
- [spark-rapids](https://nvidia.github.io/spark-rapids/)
### Notes
- Not an execution engine
- Plugin for Spark to accelerate using GPU
## Questions and problems
- How to try Databricks Photon? The trial is only for 14 days.
- Need a very big dataset to benchmark
- Possible topic: Benchmark the tools and compare the cost and the performance
## Random
- https://medium.com/@wang.y.katherine/analyzing-100gb-datasets-with-spark-sql-on-a-laptop-b16de6a25221
- Other query engine (not for Spark):
- Apache Impala
- Apache Hive
- Apache Sqoop
- Apache HBase
- Apache Drill
- Apache Phoenix