# Data Engineering Best Practices
## Performance tips
### Use disk cache on Spark
When the disk cache is enabled, data that has to be fetched from a remote source is automatically added to the cache.
|Feature|disk cache|Apache Spark cache|
|-|-|-|
Stored as|Local files on a worker node.|In-memory blocks, but it depends on storage level.|
|Applied to|Any Parquet table stored on S3, ABFS, and other file systems.|Any DataFrame or RDD.|
|Triggered|Automatically, on the first read (if cache is enabled).|Manually, requires code changes.|
|Evaluated|Lazily.|Lazily.|
|Force cache|`CACHE SELECT` command|.cache + any action to materialize the cache and .persist.|
|Availability|Can be enabled or disabled with configuration flags, enabled by default on certain node types.|Always available.|
|Evicted|Automatically in LRU fashion or on any file change, manually when restarting a cluster.|Automatically in LRU fashion, manually with unpersist.|
1. Cache a subset of the data
```sql
CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]
```
2. Configure the disk cache
```sql
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
```
- `spark.databricks.io.cache.maxDiskUsage`: disk space per node reserved for cached data in bytes
- `spark.databricks.io.cache.maxMetaDataCache`: disk space per node reserved for cached metadata in bytes
- `spark.databricks.io.cache.compression.enabled`: should the cached data be stored in compressed format