Data Engineering Best Practices

# Data Engineering Best Practices ## Performance tips ### Use disk cache on Spark When the disk cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. |Feature|disk cache|Apache Spark cache| |-|-|-| Stored as|Local files on a worker node.|In-memory blocks, but it depends on storage level.| |Applied to|Any Parquet table stored on S3, ABFS, and other file systems.|Any DataFrame or RDD.| |Triggered|Automatically, on the first read (if cache is enabled).|Manually, requires code changes.| |Evaluated|Lazily.|Lazily.| |Force cache|`CACHE SELECT` command|.cache + any action to materialize the cache and .persist.| |Availability|Can be enabled or disabled with configuration flags, enabled by default on certain node types.|Always available.| |Evicted|Automatically in LRU fashion or on any file change, manually when restarting a cluster.|Automatically in LRU fashion, manually with unpersist.| 1. Cache a subset of the data ```sql CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ] ``` 2. Configure the disk cache ```sql spark.databricks.io.cache.maxDiskUsage 50g spark.databricks.io.cache.maxMetaDataCache 1g spark.databricks.io.cache.compression.enabled false ``` - `spark.databricks.io.cache.maxDiskUsage`: disk space per node reserved for cached data in bytes - `spark.databricks.io.cache.maxMetaDataCache`: disk space per node reserved for cached metadata in bytes - `spark.databricks.io.cache.compression.enabled`: should the cached data be stored in compressed format