Spark

查看pyspark 的help

​​​​pyspark --help |& less
​​​​
​​​​|& : 把standard error轉到standard output

指令

​​​​--master MASTER_URL  (預設是 local) 叢集確認

​​​​--deploy-mode 誰執行

​​​​--executor-memory MEM (Default: 1G) 記憶體
​​​​--executor-cores NUM (Default: 全部,若與YARN整合為1顆 ) CPU
​​​​--driver-cores NUM (Default: 1) driverCPU

安裝pandas 與 pyarrow

​​​​pip3 install pandas pyarrow
​​​​pyarrow(用於優化python於spark的環境)

完整版PySpark(內建在 $SPARK_HOME/python)

*pip3 list看不到PySpark,可以設定看的到(完整版不要找不到跑去安裝簡單版PySpark)

路徑: /usr/local/bin/pyspark/
*因為是用pyspark啟動jupyter lab,所以會先建好Session,且jupyter會把所有環境設定設定好,所以在jupyter內的pip3 list時看的到完整版。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

完整版需用腳本起動,Memory配置設定在腳本中(不方便改)
所以完整版適合跑叢集。

簡單版PySpark(要安裝 PySpark套件)用於邊緣節點

*pip3 list看得到PySpark

輕量級路徑: /usr/local/lib/python3.8/dist-packages/pyspark/
*輕量級啟動jupyter lab時,會找不到spark Session,因為spark是用import的,所以要自己建立 Session。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

輕量級改Memory配置可以在自己創建時做設定(好修改)
所以輕量級彈性較高,適合寫程式。

Spark 簡單版的環境變數

要在login時載入

在cd /etc/profile.d/加login腳本

​​​​nano spark330.sh

​​​​寫入:
​​​​export PYARROW_IGNORE_TIMEZONE="1"

到login帳號執行

​​​​. /etc/profile.d/spark330.sh

確認是否套用成功

​​​​echo $PYARROW_IGNORE_TIMEZONE

安裝jupyterlab後設定

建立jupyter的spark捷徑腳本

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

編輯spark捷徑腳本

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

設定jupyter讓大家都可接

找設定檔路徑

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

編輯設定檔

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

執行啟動: ./pysparklab.sh


Spark網頁UI (4040/tcp)

cluster managers Web UI

http://xxxx.example.com:4040

看誰執行程式: Environment > spark.driver.host

Storage Memory只會使用設定Memory的一半(沒設定預設啟動Memory為1G,所以Storage Memory約500MB)

Spark 的 RDD 是 In-memory 的運算


PATH中若沒設定搜尋路徑,路徑要加 (./)


Spark叢集啟動指令

一.切換到hadoop帳號

​su - hadoop

二.到spark目錄

​cd $SPARK_HOME

三.開啟/關閉Spark叢集(只有bdse75要執行)

『開啟Spark』

1.開啟master

​sbin/start-master.sh

2.開啟大家的worker

​sbin/start-workers.sh

『關閉Spark』

*注意:要確認8080頁面都沒人在執行才可關)

1.關閉大家的worker

​sbin/stop-workers.sh

2.關閉master

​sbin/stop-master.sh