###### tags: `Hadoop` # 偽分佈式叢集架設不求人 * Hadoop三種模式: 1. 單機模式(Standalone Mode): - 這種模式下不會於啟動任何背景Java程式,此模式適合開發測試及除錯。 2. 偽分佈式模式(Pseudo-Distributed Mode) - Hadoop中的背景Java程式均運行於本機節點,可以模擬小規模的叢集。 3. 完全分佈式模式(Fully-Distributed Mode) - Hadoop中的背景Java程式運行數個主機上,[架設步驟請參閱](https://hackmd.io/@JeffWen/hadoop)。 |屬性| 本機模式 | 偽分佈式模式 | 完全分佈式模式 | |:---:|:--------:|:--------:|:--------:| |fs.defaultFS| file:/// | hdfs:/// | hdfs:/// | |dfs.replication|N/A|1|3| |mapreduce .framework.name|N/A|yarn|yarn| |yarn.resourcemanager.hostname|N/A|localhost|resourcemanager| |yarn.nodemanager.auxervices|N/A|mapreduce_shuffle|mapreduce_shuffle| ### <Big>綱目</Big> :::warning **1. [Hadoop叢集基礎架設](#base) 2. [Spark及Jupyter應用程式安裝](#spark) 3. [一般叢集開(關)機程序](#normal) 4. [偽分佈式叢集伺服器版下載](#server) 5. [SparkR架設,請參閱](https://hackmd.io/@JeffWen/sparkR)** ::: <h3 id="base">Hadoop叢集基礎架設</h3> 0. 要準備的事項 1. 擁有一台Linux(Desktop,Server都可以) :information_source: [Ubuntu 18.04 LTS Desktop安裝步驟](https://hackmd.io/@JeffWen/Ski-zUu0H) :information_source: [Ubuntu 18.04 LTS Server安裝步驟](https://hackmd.io/@JeffWen/ByVQYR2M8) 2. Linux 要具備 OpenSSH :information_desk_person: [OpenSSH科普](https://zh.wikipedia.org/wiki/OpenSSH) 1. 停用IPv6(**管理者身份**) 1. 檢查一下網路跟監聽的狀態(切換管理者) ```bash= ip addr show lsof -nPi ``` ![](https://i.imgur.com/mB24bEw.png) 2. 修改開機設定檔 ```bash nano /etc/default/grub ``` ![](https://i.imgur.com/Ruf4tH9.png) 3. 更新開機設定檔 ```bash update-grub # update-grub2 ``` 4. 重新開機 ```bash reboot ``` 5. 檢查一下IPv6是否已經停用了 ```bash= ip addr show lsof -nPi ``` 2. 安裝pip(**管理者身份**) 1. 安裝python開發工具箱 ```bash= sudo apt update sudo apt install python3-dev ``` 2. 安裝pip ```bash= #取得最新版pip腳本 wget https://bootstrap.pypa.io/get-pip.py python3 get-pip.py ``` --- 3. 建立hadoop帳號(**管理者身份**) 1.hadoop帳號 ```bash sudo adduser hadoop ``` 2. 檢查是否已經創立 ```bash= grep 'hadoop' /etc/passwd grep 'hadoop' /etc/group grep 'hadoop' /etc/shadow ls -l /home ``` 4. 安裝OpenJDK8(**管理者身份**) 1. 更新倉庫清單 ```bash apt update ``` 2. 安裝openjdk ```bash apt install openjdk-8-jdk ``` 3. 確認jdk及jre版本 ```bash= java -version javac -version ``` 4. 建立openjdk環境變數腳本 ```bash nano /etc/profile.d/jdk.sh ``` 5. 編輯openjdk環境變數 ```bash export JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64' ``` ![](https://i.imgur.com/NRulRia.png) 6. 重新載入設定檔,並檢查設定是否正確 ```bash source /etc/profile.d/jdk.sh # . /etc/profile.d/jdk.sh ``` 5. 建立無密碼login(**Hadoop身份**) 1. 切換hadoop帳號 ```bash su - hadoop ``` 3. 打造ssh公鑰及私鑰 ```bash ssh-keygen -t rsa ``` 4. 將打造好的公鑰複製一份給hadoop ```bash ssh-copy-id hadoop@localhost ``` 5. 測試一下無密碼登入(不用輸入密碼代表成功了) ```bash ssh hadoop@localhost ``` :warning:**測試完要馬上exit退出來!!!** --- 6. 建立Linux hotsts名單(**管理者身份**) ```bash= nano /etc/hosts ``` ![](https://i.imgur.com/oZlHEDr.png) --- 7. 下載及安裝hadoop(**管理者身份**) 1. 下載 ```bash= cd wget http://ftp.tc.edu.tw/pub/Apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz ``` :information_desk_person: 如果載點失效,請至[官網下載~~](https://spark.apache.org/downloads.html) 2. 解壓縮 ```bash tar -tvf hadoop-3.2.1.tar.gz #查看一下檔案內容 tar -xvf hadoop-3.2.1.tar.gz -C /usr/local ``` 3. 更名 ```bash mv /usr/local/hadoop-3.2.1 /usr/local/hadoop ``` 4. 改變資料夾及檔案擁有者 ```bash chown -R hadoop:hadoop /usr/local/hadoop ``` --- 8. 設定hadoop使用者環境變數 (**Hadoop身份**) 1. 設定.bashrc ```bash nano ~/.bashrc ``` ![](https://i.imgur.com/FdzPaih.png) ```bash= # Set HADOOP_HOME export HADOOP_HOME=/usr/local/hadoop # Set HADOOP_MAPRED_HOME export HADOOP_MAPRED_HOME=${HADOOP_HOME} # Add Hadoop bin and sbin directory to PATH export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin ``` 2. 重新載入設定檔 ```bash source ~/.bashrc # . .bashrc ``` 3. 查看環境變數 ![](https://i.imgur.com/o9tQbvl.png) --- 9. 更改 Hadoop運行程式時環境腳本(**Hadoop身份**) ```bash nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh ``` ![](https://i.imgur.com/SaJCOJS.png) ```bash= export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ``` --- 10. 更改 Hadoop core-site.xml(**Hadoop身份**) ```bash nano /usr/local/hadoop/etc/hadoop/core-site.xml ``` ![](https://i.imgur.com/0TdOD3O.png) ```xml= <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/data</value> <description>Temporary Directory.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://bdse.example.org</value> <description>Use HDFS as file storage engine</description> </property> ``` :information_source: Hadoop 3.2.0版之後有檢查語法指令 ```bash hadoop conftest ``` ![](https://i.imgur.com/k8nk5y2.png) --- 11. 更改 Hadoop mapred-site.xml(**Hadoop身份**) ```bash nano /usr/local/hadoop/etc/hadoop/mapred-site.xml ``` ![](https://i.imgur.com/fbKcXe0.png) ```xml= <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>bdse.example.org:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>bdse.example.org:19888</value> </property> ``` --- 12. 更改 Hadoop yarn-site.xml(**Hadoop身份**) ```bash nano /usr/local/hadoop/etc/hadoop/yarn-site.xml ``` ![](https://i.imgur.com/KWTS3L6.png) ```xml= <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>bdse.example.org</value> </property> <!-- 最大核心數可以依使用需求修改(預設max=4) <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property> --> <!-- 最大記憶體依使用需求修改(預設max=8192) <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>4096</value> </property> --> ``` --- 13. 更改Hadoop hdfs-site.xml(**Hadoop身份**) ```bash nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml ``` ![](https://i.imgur.com/oVJnLVh.png) ```xml= <property> <name>dfs.permissions.superusergroup</name> <value>hadoop</value> <description>The name of the group of super-users. The value should be a single group name.</description> </property> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description> </property> ``` --- 14. 建立Hadoop worker檔(**管理者身份**) ```bash nano /usr/local/hadoop/etc/hadoop/workers ``` ![](https://i.imgur.com/DgAhgdI.png) :information_desk_person: 校長兼工友的概念 --- 15. Namenode format(**hadoop身份**) ```bash hdfs namenode -format ``` --- 16. 啟動hdfs(**hadoop身份**) ```bash start-dfs.sh ``` ![](https://i.imgur.com/QFJH83m.png) --- 18. 啟動yarn(**hadoop身份**) ```bash start-yarn.sh ``` ![](https://i.imgur.com/zMkms3i.png) --- 19. 啟動History Server(**hadoop身份**) ```bash= mapred --daemon start historyserver #mr-jobhistory-daemon.sh start historyserver (deprecated) ``` ![](https://i.imgur.com/3qOJtzW.png) --- 20. 檢查啟動的Java程式 ```bash jps ``` ![](https://i.imgur.com/mNEFV3N.png) 21. 跑個pi 測試一下mapreduce (**hadoop身份**) ```bash hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar pi 30 100 ``` ![](https://i.imgur.com/AVmxAeq.png) ![](https://i.imgur.com/XgP9SCA.png) :information_desk_person: 會自動建立hadoop的目錄 ![](https://i.imgur.com/U1AlSHk.png) <Big>恭喜你完成第一階段Hadoop基本架設~~~</Big> --- <h3 id="spark">Spark及Jupyter應用程式安裝</h3> 0. 請先確認叢集已啟動服務 **:information_source: 請參閱[叢集開(關)機程序](#normal)** --- 1. 下載及安裝Spark(**管理者身份**) 1. 下載 ```bash= cd wget http://ftp.tc.edu.tw/pub/Apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz ``` 2. 解壓縮 ```bash tar -xvf spark-2.4.4-bin-hadoop2.7.tgz -C /usr/local ``` 3. 更名 ```bash mv /usr/local/spark-2.4.4-bin-hadoop2.7 /usr/local/spark ``` 4. 修改spark資料夾及檔案使用者 ```bash chown -R hadoop:hadoop /usr/local/spark ``` --- 2. 修改Spark環境變數(**hadoop身份**) 1. 設定.bashrc ```bash nano ~/.bashrc ``` ![](https://i.imgur.com/yxskMpp.png) 2. 重新載入設定檔 ```bash source ~/.bashrc #( . .bashrc) ``` 3. 查看環境變數 ![](https://i.imgur.com/ML51cdF.png) --- 3. 更改 Spark運行程式時環境腳本(**hadoop身份**) 1. 複製並建立一份spark-env腳本 ```bash cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh ``` 2. 編輯spark-env腳本 ```bash nano /usr/local/spark/conf/spark-env.sh ``` ![](https://i.imgur.com/7y189i7.png) ```bash= #export SPARK_MASTER_IP="192.168.34.81" #欲使用Spark Stanelone才要 export PYSPARK_PYTHON=python3 export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ``` --- 4. 跑個pi 測試一下Spark(**hadoop身份**) ```bash= cd $SPARK_HOME ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --driver-memory 1g \ --executor-memory 1g \ --queue default \ /usr/local/spark/examples/src/main/python/pi.py 100 ``` ![](https://i.imgur.com/joYDIIC.png) ![](https://i.imgur.com/Gt5qy10.png) --- 5. 使用PySpark shell(**hadoop身份**) 1. 使用Spark的readme當範本測試一下 ![](https://i.imgur.com/nlbyiVx.png) 2. 開啟pyspark shell ```bash= cd $SPARK_HOME ./bin/pyspark --master yarn --deploy-mode client ``` ![](https://i.imgur.com/MWK0j2d.png) ![](https://i.imgur.com/TUbP6G7.png) 3. 運行程式看看 ![](https://i.imgur.com/SlLsUzl.png) --- 6. 安裝jupter 系列及pyspark 等套件(**管理者身份**) 1. 安裝pyspark 套件 ```bash= pip3 install pyspark ``` 2. 安裝jupter 系列套件 ```bash= pip3 install jupyterlab ``` --- 8. jupyter 系列遠端使用及產生密碼(**一般使用者身份**) 1. 創建jupyter設定檔 ```bash jupyter notebook --generate-config ``` 2. 修改設定檔 ```bash nano .jupyter/jupyter_notebook_config.py ``` 3. 將登入網域開成全域 ```bash c.NotebookApp.ip = '0.0.0.0' ``` ![](https://i.imgur.com/YlJI0fJ.png) 4. 產生密碼 ```bash jupyter notebook password ``` 5. 開啟筆記本或是Lab ```bash jupyter notebook #jupyter lab ``` **就可以藉由瀏覽器登入** ex. 192.168.34.81:8888 <Big>恭喜你完成應用程式安裝~~~</Big> --- :information_source: 改Port埠號或是使用別名登入請參考下方範例 ![](https://i.imgur.com/uTYYOZW.png) 1. 修改Port埠號 ![](https://i.imgur.com/853wgNR.png) 修改confit.py檔 2. Windows中使用瀏覽器別名登入 ![](https://i.imgur.com/UU3EMfy.png) 修改hosts名單 ![](https://i.imgur.com/bercgAR.png) 新增FQDN及別名 --- :warning: Spark讀取檔案預設讀取HDFS~~ ![](https://i.imgur.com/3KSpq0s.png) **原因在Spark-env.sh有設定Hadoop環境變數路徑** ![](https://i.imgur.com/h9mx202.png) 如果要讀取本機資料請使用 ```bash= spark.read.csv("file:///PATH") #PATH 檔案路徑 ``` --- <h3 id="normal">叢集開(關)機程序</h3> 1. 叢集開機 ```bash= start-dfs.sh start-yarn.sh mapred --daemon start historyserver ``` 2. 叢集關機 ```bash= mapred --daemon stop historyserver stop-yarn.sh stop-dfs.sh ``` --- <h3 id="server">偽分佈式叢集伺服器版下載</h3> **[檔案下載連結](https://github.com/JeffWen0105/wen/tree/master/iiiEduBdse/release/bdse12VM)** 1. 壓縮檔案總計 4.32 GB 2. 使用VMware player等級以上的虛擬機程式 3. 使用前請先看README.txt