Spark

查看pyspark 的help

pyspark --help |& less

|& : 把standard error轉到standard output

指令

--master MASTER_URL  (預設是 local) 叢集確認

--deploy-mode 誰執行

--executor-memory MEM (Default: 1G) 記憶體
--executor-cores NUM (Default: 全部，若與YARN整合為1顆 ) CPU
--driver-cores NUM (Default: 1) driverCPU

安裝pandas 與 pyarrow

pip3 install pandas pyarrow
pyarrow(用於優化python於spark的環境)

完整版PySpark(內建在 $SPARK_HOME/python)

*pip3 list看不到PySpark，可以設定看的到(完整版不要找不到跑去安裝簡單版PySpark)

路徑: /usr/local/bin/pyspark/
*因為是用pyspark啟動jupyter lab，所以會先建好Session，且jupyter會把所有環境設定設定好，所以在jupyter內的pip3 list時看的到完整版。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

完整版需用腳本起動，Memory配置設定在腳本中(不方便改)
所以完整版適合跑叢集。

簡單版PySpark(要安裝 PySpark套件)用於邊緣節點

*pip3 list看得到PySpark

輕量級路徑: /usr/local/lib/python3.8/dist-packages/pyspark/
*輕量級啟動jupyter lab時，會找不到spark Session，因為spark是用import的，所以要自己建立 Session。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

輕量級改Memory配置可以在自己創建時做設定(好修改)
所以輕量級彈性較高，適合寫程式。

Spark 簡單版的環境變數

要在login時載入

在cd /etc/profile.d/加login腳本

nano spark330.sh

寫入:
export PYARROW_IGNORE_TIMEZONE="1"

到login帳號執行

. /etc/profile.d/spark330.sh

確認是否套用成功

echo $PYARROW_IGNORE_TIMEZONE

安裝jupyterlab後設定

建立jupyter的spark捷徑腳本

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

編輯spark捷徑腳本

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

設定jupyter讓大家都可接

找設定檔路徑

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

編輯設定檔

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

執行啟動: ./pysparklab.sh

Spark網頁UI (4040/tcp)

cluster managers Web UI

http://xxxx.example.com:4040

看誰執行程式: Environment > spark.driver.host

Storage Memory只會使用設定Memory的一半(沒設定預設啟動Memory為1G，所以Storage Memory約500MB)

Spark 的 RDD 是 In-memory 的運算

PATH中若沒設定搜尋路徑，路徑要加 (./)

Spark叢集啟動指令

一.切換到hadoop帳號

su - hadoop

二.到spark目錄

cd $SPARK_HOME

三.開啟/關閉Spark叢集(只有bdse75要執行)

『開啟Spark』

1.開啟master

sbin/start-master.sh

2.開啟大家的worker

sbin/start-workers.sh

『關閉Spark』

*注意:要確認8080頁面都沒人在執行才可關)

1.關閉大家的worker

sbin/stop-workers.sh

2.關閉master

sbin/stop-master.sh

Spark

查看pyspark 的help

指令

安裝pandas 與 pyarrow

完整版PySpark(內建在 $SPARK_HOME/python)

簡單版PySpark(要安裝 PySpark套件)用於邊緣節點

Spark 簡單版的環境變數

安裝jupyterlab後設定

建立jupyter的spark捷徑腳本

編輯spark捷徑腳本

設定jupyter讓大家都可接

找設定檔路徑

編輯設定檔

執行啟動: ./pysparklab.sh

Spark網頁UI (4040/tcp)

PATH中若沒設定搜尋路徑，路徑要加 (./)

Spark叢集啟動指令

一.切換到hadoop帳號

二.到spark目錄

三.開啟/關閉Spark叢集(只有bdse75要執行)

『開啟Spark』

『關閉Spark』

Read more

Linux 第三堂04/16

OSI模型(Open System Interconnection Model)

Phalcon 練習實做

防火牆（Firewall）