---
title: Day 4 Spark (Neo)
tags: '課程筆記'
description: None
---
# Spark (Neo)
[TOC]
## 前言
### Docker
* [Get Started - Mac Download](https://www.docker.com/get-started)
* 指令
```
docker container + ...
ls-list
rm-remove
--name 名稱
```
jupyter/(base)-notebook
jupyter/pyspark-notebook
```
docker run hello-world
```
```
docker run -p 5000:8888 --name base-nb -e GRANT_SUDO=yes --user root -e JUPYTER_ENABLE_LAB=yes -v ~/testing-env:/home/jovyan/work jupyter/base-notebook
```

Docker直接提供一個虛擬的環境(ex:直接不用雙系統)
---
## Spark
> 分散式、加速運算
> 原生語言:scala(推薦學習這個)->java->spark
* 起手式_初始化環境
```
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[2]") \
.appName("spark demo") \
.getOrCreate()
master - 要跑幾個CPU(local本機,yarn多台機器, standalone?)
config - 設定(i.e.帳號、密碼)
Worker - 負責運算
Node manager - 倉庫管理員工 告訴work資料在哪
Resource manager - 管理資源(運算資原ex:cpu)
```
```
df = spark.read.format("csv") \
.option("header","true") \
.option("inferSchema","true") \
.load("106_165-9.csv") \
.toDF("country","zone","village","tax_unit","total","mean","median","firstQ","thirdQ","std","var")
#infer:用來推論Schema用的,更嚴謹的方法是自己定義
#load:檔案位置
```
```
df.select("country").distinct().show(100)
#spark優點,支援SQL語法
```
**資料儲存:** 建議存parquet檔案(可以分開讀欄位,不用整個資料集讀取)
**打包:** Maven(java程式管理工具)
## 標籤2.0
資料:文章 domain features RT CHTdata CTR