Spark
查看pyspark 的help
pyspark --help |& less
|& : 把standard error轉到standard output
指令
--master MASTER_URL (預設是 local) 叢集確認
--deploy-mode 誰執行
--executor-memory MEM (Default: 1G) 記憶體
--executor-cores NUM (Default: 全部,若與YARN整合為1顆 ) CPU
--driver-cores NUM (Default: 1) driverCPU
安裝pandas 與 pyarrow
pip3 install pandas pyarrow
pyarrow(用於優化python於spark的環境)
完整版PySpark(內建在 $SPARK_HOME/python)
*pip3 list看不到PySpark,可以設定看的到(完整版不要找不到跑去安裝簡單版PySpark)
路徑: /usr/local/bin/pyspark/
*因為是用pyspark啟動jupyter lab,所以會先建好Session,且jupyter會把所有環境設定設定好,所以在jupyter內的pip3 list時看的到完整版。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
完整版需用腳本起動,Memory配置設定在腳本中(不方便改)
所以完整版適合跑叢集。
簡單版PySpark(要安裝 PySpark套件)用於邊緣節點
*pip3 list看得到PySpark
輕量級路徑: /usr/local/lib/python3.8/dist-packages/pyspark/
*輕量級啟動jupyter lab時,會找不到spark Session,因為spark是用import的,所以要自己建立 Session。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
輕量級改Memory配置可以在自己創建時做設定(好修改)
所以輕量級彈性較高,適合寫程式。
Spark 簡單版的環境變數
要在login時載入
在cd /etc/profile.d/加login腳本
nano spark330.sh
寫入:
export PYARROW_IGNORE_TIMEZONE="1"
到login帳號執行
. /etc/profile.d/spark330.sh
確認是否套用成功
echo $PYARROW_IGNORE_TIMEZONE
安裝jupyterlab後設定
建立jupyter的spark捷徑腳本
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
編輯spark捷徑腳本
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
設定jupyter讓大家都可接
找設定檔路徑
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
編輯設定檔
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
執行啟動: ./pysparklab.sh
Spark網頁UI (4040/tcp)
cluster managers Web UI
http://xxxx.example.com:4040
看誰執行程式: Environment > spark.driver.host
Storage Memory只會使用設定Memory的一半(沒設定預設啟動Memory為1G,所以Storage Memory約500MB)
Spark 的 RDD 是 In-memory 的運算
PATH中若沒設定搜尋路徑,路徑要加 (./)
Spark叢集啟動指令
一.切換到hadoop帳號
su - hadoop
二.到spark目錄
cd $SPARK_HOME
三.開啟/關閉Spark叢集(只有bdse75要執行)
『開啟Spark』
1.開啟master
sbin/start-master.sh
2.開啟大家的worker
sbin/start-workers.sh
『關閉Spark』
*注意:要確認8080頁面都沒人在執行才可關)
1.關閉大家的worker
sbin/stop-workers.sh
2.關閉master
sbin/stop-master.sh