--- title: Apache Hadoop 實作紀錄 tags: Apache, Hadoop description: Apache Hadoop 實作紀錄 --- # Apache hadoop 實作紀錄 hadoop 是一個 base on 分散式系統的大型資料儲存系統,由一到多個 cluster 所組成,由一到多台的 server 所組成,而資料在 cluster 中採取 HDFS 分散式檔案系統 (Hadoop Distributed File System),在資料處理使用 MapReduce 平行運算架構加速處理巨量數據的速度。 ## 使用 docker 環境來架設 hadoop 這裡使用 ubuntu18.04 的 docker 環境來 demo ,此須準備一台 master 主機(主機叫作master)和三台 slave 主機(分別叫作 slave01、slave02、slave03)  先建立 docker 內 bridge 網段 ``` $ docker network create hadoop ``` 啟動並進入 ubuntu18.04 container ``` $ docker run -it --net hadoop --name master ubuntu:18.04 bash ``` **master 和 slave 機器裡須做以下** 下載 java8 ``` $ apt-get update $ sudo apt-get install openjdk-8-jdk $ java -version ``` 下載 openssh-server,並且下載 docker sudo 權限 ``` $ apt-get install openssh-server $ service ssh start $ apt-get -y install sudo ``` 新增使用者,後續操作都會在使用者端進行 ``` $ useradd -m hadoop -s /bin/bash $ passwd hadoop $ adduser hadoop sudo $ su hadoop ``` 下載 hadoop,解壓縮到 /usr/local/,並設定權限 ``` $ cd ~ $ sudo apt install wget $ sudo wget https://dlcdn.apache.org/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz $ sudo tar -zxvf ./hadoop-2.10.1.tar.gz -C /usr/local $ cd /usr/local/ $ sudo mv ./hadoop-2.10.1/ ./hadoop $ sudo chown -R hadoop:hadoop ./hadoop ``` 修改 bashrc 文件 ``` $ sudo nano ~/.bashrc ``` 在最上面加入 ``` # set JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 ``` 儲存後應用 bashrc ``` $ source ~/.bashrc ``` **master 機器裡須做以下** SSH 無密碼登入,從 master 登入 slave 要可以使用無密碼登入 ``` $ suo apt-get install ssh $ mkdir ~/.ssh $ cd ~/.ssh $ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ scp ~/.ssh/id_rsa.pub hadoop@slave01:/home/hadoop/ $ scp ~/.ssh/id_rsa.pub hadoop@slave02:/home/hadoop/ $ scp ~/.ssh/id_rsa.pub hadoop@slave03:/home/hadoop/ ``` 測試是否可以使用 shh 無密碼登入 ``` $ ssh slave01 $ ssh slave02 $ ssh slave03 ``` ## 配置 Config 檔 **配置 core-site.xml 檔** ``` $ sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml ``` 在 <configuration> 中新增以下內容 ``` <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> </configuration> ``` **配置 hdfs-site.xml 檔** ``` $ sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml ``` 在 <configuration> 中新增以下內容 ``` <configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>master:50090</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration> ``` **配置 mapred-site.xml 檔** 默認情況下,/usr/local/hadoop/etc/hadoop/ 文件夾下有 mapred.xml.template 文件,我們要復制該文件,並命名為 mapred.xml,該文件用於指定 MapReduce 使用的框架。 ``` $ sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml ``` 在 <configuration> 中新增以下內容 ``` <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> <description>MapReduce JobHistory Server IPC host:port</description> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> <description>MapReduce JobHistory Server Web UI host:port</description> </property> </configuration> ``` **配置 yarn-site.xml 檔** ``` $ sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml ``` 在 <configuration> 中新增以下內容 ``` <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> <description>The http address of the RM web application. If only a host is provided as the value, the webapp will be served on a random port.</description> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> <description>The Kerberos principal for the resource manager.</description> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> <description>The address of the applications manager interface in the RM.</description> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> <description>The address of the RM admin interface.</description> </property> </configuration> ``` 為了要讓 hadoop 可以認得目前所有的 data node,因此需要修改 /usr/local/hadoop/etc/hadoop/slaves 檔案,在裡面加入 data node 的連結資訊 ``` $ sudo nano /usr/local/hadoop/etc/hadoop/slaves ``` 在裡面移除原本的 localhost ,並且新增 data node 位置(一台一行) ``` master slave01 slave02 slave03 ``` 接下來,我們要將 master 上面的 hadoop 資料夾打包,傳輸給所有 slave 主機: ``` $ cd /usr/local $ rm -r ./hadoop/tmp $ sudo tar -zcf ./hadoop.tar.gz ./hadoop $ scp ./hadoop.tar.gz slave01:/home/hadoop $ scp ./hadoop.tar.gz slave02:/home/hadoop $ scp ./hadoop.tar.gz slave03:/home/hadoop ``` 傳輸完畢,我們分別到 slave01、slave02、slave03 主機裡,進行操作: ``` $ sudo tar -zxf ~/hadoop.tar.gz -C /usr/local $ sudo chown -R hadoop:hadoop /usr/local/hadoop ``` 第一次執行 Hadoop 時,需要先格式化 name node,這步驟需要到 master 中 hadoop 所在的資料夾下執行 ``` $ cd /usr/local/hadoop $ bin/hdfs namenode -format ``` 接著啟動 Hadoop ``` $ sbin/start-all.sh ``` 如果沒有出現其他錯誤訊息,可以使用 jps 查看目前已經啟動的服務有哪些 ``` $ jps ```  查看 node資訊 ``` $ hdfs dfsadmin -report ```  停止所有服務 ``` $ sbin/stop-all.sh ``` --- ## 參考 * https://medium.com/@sleo1104/hadoop-3-2-0-%E5%AE%89%E8%A3%9D%E6%95%99%E5%AD%B8%E8%88%87%E4%BB%8B%E7%B4%B9-22aa183be33a * https://www.cc.ntu.edu.tw/chinese/epaper/0036/20160321_3609.html * https://hadoop.apache.org/docs/r2.10.1/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html ## Thank you! :dash: You can find me on - GitHub: https://github.com/shaung08 - Email: a2369875@gmail.com
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up