--- title: Day 4 Spark (Neo) tags: '課程筆記' description: None --- # Spark (Neo) [TOC] ## 前言 ### Docker * [Get Started - Mac Download](https://www.docker.com/get-started) * 指令 ``` docker container + ... ls-list rm-remove --name 名稱 ``` jupyter/(base)-notebook jupyter/pyspark-notebook ``` docker run hello-world ``` ``` docker run -p 5000:8888 --name base-nb -e GRANT_SUDO=yes --user root -e JUPYTER_ENABLE_LAB=yes -v ~/testing-env:/home/jovyan/work jupyter/base-notebook ``` ![](https://i.imgur.com/RvlNErz.png) Docker直接提供一個虛擬的環境(ex:直接不用雙系統) --- ## Spark > 分散式、加速運算 > 原生語言:scala(推薦學習這個)->java->spark * 起手式_初始化環境 ``` from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[2]") \ .appName("spark demo") \ .getOrCreate() master - 要跑幾個CPU(local本機,yarn多台機器, standalone?) config - 設定(i.e.帳號、密碼) Worker - 負責運算 Node manager - 倉庫管理員工 告訴work資料在哪 Resource manager - 管理資源(運算資原ex:cpu) ``` ``` df = spark.read.format("csv") \ .option("header","true") \ .option("inferSchema","true") \ .load("106_165-9.csv") \ .toDF("country","zone","village","tax_unit","total","mean","median","firstQ","thirdQ","std","var") #infer:用來推論Schema用的,更嚴謹的方法是自己定義 #load:檔案位置 ``` ``` df.select("country").distinct().show(100) #spark優點,支援SQL語法 ``` **資料儲存:** 建議存parquet檔案(可以分開讀欄位,不用整個資料集讀取) **打包:** Maven(java程式管理工具) ## 標籤2.0 資料:文章 domain features RT CHTdata CTR