[ML] Azure / Databricks
===
###### tags: `ML / Platform`
###### tags: `ML`, `Azure`, `Databricks`
<br>
[TOC]
<br>
<hr>
<br>
:::success
:memo: **Cynthia 資源**:
- [hackmd](https://hackmd.io/@Cynthia-Chuang/HJ__DD3M_)
- [slide](https://docs.google.com/presentation/d/1322sq5CflddMaTQQc8IZXJMnF4z44CNv5thqvsW15yM/edit?usp=sharing)
:::
<br>
## [官網](https://azure.microsoft.com/zh-tw/services/databricks/)
:::info
:information_source: **Azure Databricks**
快速、輕鬆並可共同作業的 Apache SparkTM 型分析服務
:::
### 快速使用最佳化的 Apache Spark 環境
[](https://i.imgur.com/7kOHMQd.png)
### 使用共用工作區和通用語言來提高生產力
[](https://i.imgur.com/x7mBDmT.png)
### 運用巨量資料提升機器學習
[](https://i.imgur.com/gHLpile.png)
### 獲得高效能的新式資料倉儲
[](https://i.imgur.com/q2IAI01.png)
### 關鍵服務功能
[](https://i.imgur.com/ksSdopd.png)
<br>
<hr>
<br>
## Hello, World
### [Portal](https://portal.azure.com/)
[](https://portal.azure.com/#allservices)
### 主頁面
[](https://i.imgur.com/GGvr7UE.png)
<br>
### [[Hello, World 1] 建立西雅圖安全資料的即時報表](https://docs.microsoft.com/zh-tw/azure/databricks/scenarios/quickstart-create-databricks-workspace-portal?tabs=azure-portal)
:::info
:information_source: **快速入門**:使用 Azure 入口網站在 Azure Databricks 工作區上執行 Spark 作業
:::
blob 資料來源 & 認證資料:
[](https://i.imgur.com/ceCYMc7.png)
---
[WASB](https://stackoverflow.com/questions/60277545/): Windows Azure Storage Blob
[](https://i.imgur.com/4NY4jIl.png)
---
透過 pyspark 讀取 blob 資料,並建立一個檢視表(source)
[](https://i.imgur.com/I9GZAke.png)
---
- `display(df)` 用 html 顯示
[](https://i.imgur.com/7BI0N6a.png)
- `df.show()` 用 text 顯示
[](https://i.imgur.com/kUr9deO.png)
---
資料集裡沒有他說的 `311_All`
```python
display(df.select('dataSubtype').distinct())
```
[](https://i.imgur.com/JzzqTcw.png)
---
從建立的檢視表 `source` 提取資料:
[](https://i.imgur.com/DQyqOsF.png)
---
資料轉換範例1:
```python
display(df.select('dataSubtype')
.distinct()
.withColumnRenamed('dataSubtype', 'Events'))
```

<br>
資料轉換範例2:
```python=
from pyspark.sql.functions import count
view1 = df.select('dataSubtype')
view2 = view1.groupBy('dataSubtype').agg(count('dataSubtype'))
display(view2)
view3 = view2.withColumnRenamed('dataSubtype', 'events')
.withColumnRenamed('count(dataSubtype)', 'count')
display(view3)
```

<br>
### [[Hello, World 2] 鑽石資料集:以鑽石色彩加以分組,並計算平均價格](https://docs.microsoft.com/zh-tw/azure/databricks/notebooks/visualizations/)
- ### 檢視目錄&檔案位置
```pyhton=
print("type(dbutils.fs.ls('/')):", type(dbutils.fs.ls('/')))
display(dbutils.fs.ls('/databricks-datasets'))
display(dbutils.fs.ls('dbfs:/databricks-datasets'))
```

<br>
- ### 插曲:看到基因資料 ([Genomics 指南](https://docs.microsoft.com/zh-tw/azure/databricks/applications/genomics/))
```python=
display(dbutils.fs.ls('dbfs:/databricks-datasets/genomics/'))
```

<br>
- ### 載入 CSV
- `inferSchema` 參數(預設:`inferSchema=None`)
```
This function will go through the input once to determine the input schema if
``inferSchema`` is enabled. To avoid going through the entire data once, disable
``inferSchema`` option or specify the schema explicitly using ``schema``.
inferSchema : str or bool, optional
infers the input schema automatically from data.
(從數據中自動推斷出輸入綱要。)
It requires one extra pass over the data.
(它需要對數據進行一次額外的傳遞。)
If None is set, it uses the default value, ``false``.
(如果設為 None,它將使用預設值``false'')
```
- 簡單講:是否要看過完整 data 才推論 schema(綱要)?
- `inferSchema` 設為 `True` 或 `False`,目前看不出差異性
<br>
- 載入結果
```python=
display(spark.read.csv(file_path,))
display(spark.read.csv(file_path, header=True))
```
[](https://i.imgur.com/KE0PHC0.png)
<br>
- ### 根據鑽石色彩,分組計算平均價格
```python=
from pyspark.sql.functions import avg, mean
df = spark.read.csv(file_path, header=True)
# select(...) 為選擇性,沒有就是表示使用所有欄位
df2 = df \
.select('color', 'price') \
.groupBy('color') \
.agg(avg('price'), mean('price')) \
.orderBy(avg('price'), ascending=False)
display(df2)
df2.display()
```

<br>
- ### 切換圖表顯示



<br>
### [Hello, World 3] 認識 dbfs 檔案路徑的前綴
:::warning
:warning: **搞混了**:資料集路徑 v.s. 系統檔案路徑?
:::
- ### 關係
| OS 檔案路徑 | dbutils 資料集路徑 | dbutils 資料集路徑(簡稱) | dbutils 可存取?|
| --------- | ----------------- | ---------- | ---------- |
| /dbfs/FileStore/ | dbfs:/FileStore/ | /FileStore/ | ✔ |
| /dbfs/databricks/ | dbfs:/databricks/ | /databricks/ | ✗ |
| /dbfs/databricks-datasets/ | dbfs:/databricks-datasets/ | /databricks-datasets/ | ✔ |
| /dbfs/databricks-results/ | dbfs:/databricks-results/ | /databricks-results/ | ✗ |
- dbfs: Databricks file system
- dbutils: Databricks utility/utilities
- ### Databricks 的 OS:Ubuntu 18.04
```python=
import subprocess
print(subprocess.check_output('lsb_release -a', shell=True).decode('utf8'))
```
```
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
```
- ### 系統的根目錄
```python=
import subprocess
print(subprocess.check_output('ls -ls /', shell=True).decode('utf8'))
```
```
total 100
4 drwxr-xr-x 2 root root 4096 Mar 25 07:21 bin
4 drwxr-xr-x 2 root root 4096 Apr 24 2018 boot
4 -r-xr-xr-x 1 root root 88 Jan 1 1970 BUILD
8 drwxr-xr-x 1 root root 4096 Apr 20 06:10 databricks
4 drwxrwxrwx 2 root root 4096 Apr 20 06:10 dbfs <--- 資料集入口點
0 drwxr-xr-x 6 root root 520 Apr 20 06:09 dev
8 drwxr-xr-x 1 root root 4096 Apr 20 06:09 etc
8 drwxr-xr-x 1 root root 4096 Feb 17 08:34 home
4 drwxr-xr-x 10 root root 4096 Feb 17 08:32 lib
4 drwxr-xr-x 2 root root 4096 Jan 18 21:03 lib64
4 drwxr-xr-x 7 ubuntu ubuntu 4096 Apr 20 06:10 local_disk0
4 drwxr-xr-x 2 root root 4096 Jan 18 21:02 media
4 drwxr-xr-x 1 root root 4096 Apr 20 06:09 mnt
4 drwxr-xr-x 4 root root 4096 Feb 17 08:34 opt
0 dr-xr-xr-x 227 nobody nogroup 0 Apr 20 06:09 proc
8 drwxr-xr-x 1 root root 4096 Apr 20 06:13 root
0 drwxr-xr-x 12 root root 540 Apr 20 06:15 run
4 drwxr-xr-x 2 root root 4096 Mar 25 07:21 sbin
4 drwxr-xr-x 2 root root 4096 Jan 18 21:02 srv
0 dr-xr-xr-x 12 nobody nogroup 0 Apr 20 06:09 sys
8 drwxrwxrwt 1 root root 4096 Apr 20 06:45 tmp
4 drwxr-xr-x 10 root root 4096 Apr 20 06:09 usr
8 drwxr-xr-x 1 root root 4096 Apr 20 06:09 var
```
- ### 資料集的根目錄
```python=
display(dbutils.fs.ls('/'))
# or
display(dbutils.fs.ls('dbfs:/'))
```
```
[FileInfo(path='dbfs:/FileStore/', name='FileStore/', size=0),
FileInfo(path='dbfs:/databricks-datasets/', name='databricks-datasets/', size=0),
FileInfo(path='dbfs:/databricks-results/', name='databricks-results/', size=0)]
```

- ### 測試讀檔1: 使用 python 的 `open()`
```python=
# 注意檔案路徑的前綴
display(dbutils.fs.ls('/databricks-datasets/README.md'))
display(dbutils.fs.ls('dbfs:/databricks-datasets/README.md'))
with open('/dbfs/databricks-datasets/README.md') as f:
print(''.join(f.readlines()))
```

- ### 測試讀檔2: 使用 pyspark
```python=
display(spark.read.text('/databricks-datasets/README.md'))
display(spark.read.text('dbfs:/databricks-datasets/README.md'))
#display(spark.read.text('/dbfs/databricks-datasets/README.md')) # Path does not exist
```

- ### 透過 notebook 列出檔案,並進行刪除
```python=
# 列出檔案清單
%ls -ls /dbfs/FileStore/tables/
# 刪除不需要的檔案
%rm /dbfs/FileStore/tables/train-1.csv
%rm /dbfs/FileStore/tables/train-2.csv
%rm /dbfs/FileStore/tables/train-3.csv
# 確認刪除
%ls -ls /dbfs/FileStore/tables/
```

- ### 相關文件
- [Databricks 檔案系統 (DBFS)](https://docs.microsoft.com/zh-tw/azure/databricks/data/databricks-file-system)

- [Azure Databricks 資料集](https://docs.microsoft.com/zh-tw/azure/databricks/data/databricks-datasets)
- [FileStore](https://docs.microsoft.com/zh-tw/azure/databricks/data/filestore)
<br>
### [Hello, World 4] 自行上傳資料集,並透過 pyspark 存取
- ### 自行上傳資料集
[](https://i.imgur.com/4UtqS1h.png)
<br>
[](https://i.imgur.com/rLtEMNM.png)
- ### 或是,從之前上傳的檔案選取來源

- ### [Create Table with UI] 
- ### 透過 notebook 讀取 dataframe
- 文件
- [pyspark.sql.DataFrameReader.csv](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.csv.html#pyspark.sql.DataFrameReader.csv)
- 方法一:
```python=
df = spark.read.csv('dbfs:/FileStore/tables/train.csv',
header=True, inferSchema=True, sep=',')
display(df)
```
- 方法二:
```python=
# File location and type
file_location = "/FileStore/tables/train.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
```
- 方法三:
```sql
%sql
select * from train_csv
```

<br>
### [[Hello, World 5] Databricks in 5 minutes](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3620158951254961/3019048134666355/2171944558356615/latest.html)
<br>
<hr>
<br>
## [Azure Databricks 文件](https://docs.microsoft.com/zh-tw/azure/databricks/scenarios/what-is-azure-databricks)
:::warning
:warning: [**帳戶限制**](https://docs.microsoft.com/zh-tw/azure/databricks/scenarios/quickstart-create-databricks-workspace-portal?tabs=azure-portal)
本教學課程不適用 Azure 免費試用版的訂用帳戶。
如果您有免費帳戶,請移至您的設定檔,並將訂用帳戶變更為 隨用隨付。
:::
### [Azure Databricks 工作區 / 教學課程 / 執行 ETL 作業]
- [教學課程:使用 Azure Databrick 擷取、轉換和載入資料](https://docs.microsoft.com/zh-tw/azure/databricks/scenarios/databricks-extract-load-sql-data-warehouse)
### [Azure Databricks 工作區 / 操作指南 / 使用者指南]
- [Notebooks](https://docs.microsoft.com/zh-tw/azure/databricks/notebooks/)
- [視覺效果](https://docs.microsoft.com/zh-tw/azure/databricks/notebooks/visualizations/) :+1:
- [display函式](https://docs.microsoft.com/zh-tw/azure/databricks/notebooks/visualizations/#display-function)
- [displayHTML函式](https://docs.microsoft.com/zh-tw/azure/databricks/notebooks/visualizations/#displayhtml-function)
:::info
:information_source: **Autocomplete**

We've improved autocomplete in our Python notebooks. To access autocomplete, press the Tab key. To display Python docstring hints, press Shift + Tab.
:::
### [Azure Databricks 工作區 / 操作指南 / 機器學習和深度學習指南 / 機器學習教學課程]
- [機器學習和深度學習指南](https://docs.microsoft.com/zh-tw/azure/databricks/applications/machine-learning/)
- [10分鐘教學課程:開始在 Azure Databricks 上使用機器學習服務](https://docs.microsoft.com/zh-tw/azure/databricks/applications/machine-learning/tutorial/)
### [Azure Databricks 工作區 / 操作指南 / 資料指南]
- [資料來源](https://docs.microsoft.com/zh-tw/azure/databricks/data/data-sources/)
- [CSV 檔案](https://docs.microsoft.com/zh-tw/azure/databricks/data/data-sources/read-csv)
- [鑽石資料集](https://docs.microsoft.com/zh-tw/azure/databricks/data/databricks-datasets#databricks-datasets) (沒看到鑽石資料集?)
- 無法在 ML 平台存取

- 列出所有 Databricks 資料集 (檔案路徑|檔名|大小)


- 列印 README 任何資料集的,以取得詳細資訊

- [Parquet 檔案](https://docs.microsoft.com/zh-tw/azure/databricks/data/data-sources/read-parquet)
<br>
<hr>
<br>
## 參考資料
- ### [dataframe 轉換:pyspark -> pandas](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.toPandas.html?highlight=topandas#pyspark-sql-dataframe-topandas)
- ### [PySpark 文件](https://spark.apache.org/docs/latest/api/python/)
- ### [pandas 文件](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html)