李念臻
  • NEW!
    NEW!  Connect Ideas Across Notes
    Save time and share insights. With Paragraph Citation, you can quote others’ work with source info built in. If someone cites your note, you’ll see a card showing where it’s used—bringing notes closer together.
    Got it
      • Create new note
      • Create a note from template
        • Sharing URL Link copied
        • /edit
        • View mode
          • Edit mode
          • View mode
          • Book mode
          • Slide mode
          Edit mode View mode Book mode Slide mode
        • Customize slides
        • Note Permission
        • Read
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Write
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invite by email
        Invitee

        This note has no invitees

      • Publish Note

        Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

        Your note will be visible on your profile and discoverable by anyone.
        Your note is now live.
        This note is visible on your profile and discoverable online.
        Everyone on the web can find and read all notes of this public team.

        Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Explore these features while you wait
        Complete general settings
        Bookmark and like published notes
        Write a few more notes
        Complete general settings
        Write a few more notes
        See published notes
        Unpublish note
        Please check the box to agree to the Community Guidelines.
        View profile
      • Commenting
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Suggest edit
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
      • Emoji Reply
      • Enable
      • Versions and GitHub Sync
      • Note settings
      • Note Insights New
      • Engagement control
      • Make a copy
      • Transfer ownership
      • Delete this note
      • Save as template
      • Insert from template
      • Import from
        • Dropbox
        • Google Drive
        • Gist
        • Clipboard
      • Export to
        • Dropbox
        • Google Drive
        • Gist
      • Download
        • Markdown
        • HTML
        • Raw HTML
    Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Engagement control Make a copy Transfer ownership Delete this note
    Import from
    Dropbox Google Drive Gist Clipboard
    Export to
    Dropbox Google Drive Gist
    Download
    Markdown HTML Raw HTML
    Back
    Sharing URL Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Customize slides
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Hadoop、Spark 雲端平台建置-使用VirtualBox模擬Cluster [Reference:Python+Spark2.0+Hadoop 機器學習與大數據分析](https://pythonsparkhadoop.blogspot.com/) [Reference:30天認識主流大數據框架:Hadoop + Spark + Flink](https://ithelp.ithome.com.tw/users/20138939/ironman/6415) ## 前言 教學中會建置Hadoop (Multi Node Cluste),並當作Spark的資料處理引擎,並在其上使用Python。最後在Jupyter Notebook執行Spark。 主要流程依照[Python+Spark2.0+Hadoop 機器學習與大數據分析]進行。 ### 相依性問題 過程中會牽扯到一堆版本相依性問題,java啦Python、Hadoop、Spark... 說了都是重做了不少次的血與淚(遇到太多問題才下定決心寫的教學),若需使用不同版本,記得查詢版本問題,主要例如: Spark官方文件提到(節錄自3.5.版本): ![upload_a38e4c9a57a19087bbb785b53e677e72](https://hackmd.io/_uploads/r1-KnSskbx.png) [Reference](https://spark.apache.org/docs/3.5.4/) ![image](https://hackmd.io/_uploads/Hk3QZAYyWl.png) [Reference](https://spark.apache.org/downloads.html) 有人整理Hadoop與java版本: ![image](https://hackmd.io/_uploads/SyvIbRK1be.png) [Reference](https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions) #### 如何查看將安裝的特定java版本 * **查看javajdk-8-jdk** ![upload_e77e086b2ede2f2a2e789fea913913bf](https://hackmd.io/_uploads/BJIqpro1-l.png) ## 作業環境與安裝版本 * OS:Windows 11 * Virtual Box:7.0.26 * 虛擬機掛載OS: ubuntu-20.04.6 x64 * Hadoop:3.3.6 [連結複製點](https://archive.apache.org/dist//hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz) * Scala:3.3.6 [連結複製點](https://github.com/scala/scala3/releases/download/3.3.6/scala3-3.3.6.tar.gz ) * Spark:3.4.1 [連結複製點](https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz) * Anaconda:2023.03-0-Linux-x86_64[連結複製點](https://repo.anaconda.com/archive/Anaconda3-2023.03-0-Linux-x86_64.sh),[其他版本](https://repo.continuum.io/archive/index.html) ## 架構 首先會建立一台虛擬機,實作hadoop單節點,接著會複製這台打造以下一master,三workers的架構: ![image](https://hackmd.io/_uploads/B1WK17iJ-e.png) ## 虛擬環境建置 ### 安裝VirtulBox 基本上一直按【下一步】即可。 若遇到以下警示頁面: ![螢幕擷取畫面 2025-10-14 155539](https://hackmd.io/_uploads/ByFaOHoJWx.png) (網上截圖,所以和文章版本不一) 說明需安裝Microsoft Visual C++Redistributable: 1. 搜尋 [Microsoft Visual C++ 可轉散發套件的最新支援下載項目](https://learn.microsoft.com/zh-tw/cpp/windows/latest-supported-vc-redist?view=msvc-170#latest-supported-redistributable-version) 1. 找到以下畫面,選擇x64版本點選下載並安裝(依OS系統選擇恰當版本,windows可在設定中的系統資訊查看) ![螢幕擷取畫面 2025-10-14 160050](https://hackmd.io/_uploads/HyE4FSi1Wl.png) ### 建立虛擬機 按下新增按鈕開始建立 (名稱改為HadoopNV) 依圖設定虛擬機名稱、選擇資料夾、掛載Iso檔(通常選擇完iso檔後後續OS選項會自動填入) ![upload_15ecf599838d3e2a76d0e5ea971272e5](https://hackmd.io/_uploads/BJVBqBiJbx.png) 下一步設定username和密碼即可 ![upload_bc152a46a713b984f1c9f33ce851b96f](https://hackmd.io/_uploads/r1Sq5rokZl.png) 再來依圖設置虛擬硬碟。 能配置越高當然越好,CPU和記憶體也可以多給點。 ![upload_04c114f89f89b2210a171be76dacc379](https://hackmd.io/_uploads/S1PpcroJZx.png) ### 虛擬機設置 <!-- #### 設置最佳下載伺服器 避免因伺服器連線問題而無法安裝或更新 ![upload_935643853575c98ea43d2256f827e8d6](https://hackmd.io/_uploads/Sk9ZsHsybe.png) 1. 在設定中找到 software & updates 2. 點選 Ubuntu Software 3. 點選download from 下拉清單按鈕 4. 選擇其他 5. 選擇 select best server (如圖) 之後自行接續操作 --- --> #### 設置剪貼簿 ![upload_f3f3dbbee2e3f554204f4c16e5eec983](https://hackmd.io/_uploads/H1cmiSokZl.png) 依圖設置 --- #### 設置 Terminal ![upload_e4551ed13a912da889a9646ce34494ca](https://hackmd.io/_uploads/SJK4orjk-l.png) 點選左上角 Activities 呼叫出搜尋框找到terminal,並按住圖示將之拖曳到側攔,較方便後續使用 **【Tips:terminal啟動問題】** 若點選terminal等啟動terminal的方式都沒反應則可採取以下解法: ![upload_f0f47ae88d62012b5d9103e349a44f5e](https://hackmd.io/_uploads/rkMLjBikZe.png) 1. 最簡潔的方式是在設定中找到 Region & language並點選開啟如圖畫面。 1. 原始設置是US(如圖),現在隨便選其他國家的英文設定。 1. 然後重新啟動即可。 [Reference](https://www.youtube.com/watch?v=ewFvNOP2oKc) --- #### 設置 Guest Additions CD ![upload_bbfe86e30514f147668d00b2b1555647](https://hackmd.io/_uploads/ryEPjHs1Zl.png) 找到插入 Guest Additions CD 映像檔,點選運行(運行VirtualBox Guest Additions安裝程式) 也可以採取以下方法進行: 點選側欄Vbox的光碟圖示,開啟如圖畫面,點選右上角Run software ![upload_f1dd6cff4b6508cc17c7fa945c52dd4e](https://hackmd.io/_uploads/B11tjBi1Zg.png) --- **【Tips:Guest Additions安裝程式沒反應或遇如下訊息】** ![upload_5fef5112689df55f19aeebe8fb5637b8](https://hackmd.io/_uploads/BJdoiSoJ-g.png) ![upload_7618d3963f0605be36532b71c45abc04](https://hackmd.io/_uploads/rkx3jro1Zl.png)巴拉巴拉.... 主要的問題是: “此系統目前未設定為建置核心模組。須安裝 gcc make perl 軟體包。 ” 這意味著需要安裝建構核心模組所需的軟體包: 1. 開啟terminal 2. 輸入以下指令: ``` sudo apt update sudo apt install build-essential dkms ``` 這兩行是開啟更新包清單並安裝gcc、make、perl以及通常用於編譯核心模組的必要檔案 3. 重新運行VirtualBox Guest Additions安裝程式 --- ## 安裝 Hadoop Single Node 在終端機運行以下指令: * **更新apt-get:** ``` sudo apt-get update ``` --- **【Tips:User is not in the sudoers file.問題】沒遇到跳至線以下安裝java** ![upload_330e9e9b9891853568e9307c7d9b458d](https://hackmd.io/_uploads/BkmATBo1bg.png) 修復步驟: ![upload_a5ed09dccce7e33f51ba54b5ba635dda](https://hackmd.io/_uploads/ByfeAHsk-g.png) 啟動時按住shift鍵不放,進入 ![upload_294af20ff7a4ad305bf5019eecc320ee](https://hackmd.io/_uploads/S1i-RHiybg.png) 選擇: ![upload_4655b26c59488f418b1711171bd76c84](https://hackmd.io/_uploads/SkeVAHs1-e.png) ![upload_7b5a08c25d0c30f37acc8c4f81d6c16a](https://hackmd.io/_uploads/BynNCrjyZx.png) 如圖鍵入: ![upload_a0947ad25b22b66078b9251e02776959](https://hackmd.io/_uploads/rk18RSjJbe.png) [Reference](https://www.youtube.com/watch?v=ZxOwFOtcaaA) ___ * **安裝 java,並取得安裝路徑:** ``` sudo apt install openjdk-8-jdk update-alternatives --display java ``` ![image](https://hackmd.io/_uploads/Sk17Vyq1Zx.png) 安裝路徑如java8的話是(/usr/lib/jvm/java-8-openjdk-amd64)待會會用到 * **設定ssh(使用rsa產生金鑰):** ``` sudo apt-get install ssh sudo apt-get install rsync ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys ``` ![螢幕擷取畫面 2025-11-06 150318](https://hackmd.io/_uploads/BJ5fi6KkWx.png) 說明: -P '' 用於設置無密碼登入的ssh * **測試ssh** ``` ssh localhost ``` ![螢幕擷取畫面 2025-11-06 150329](https://hackmd.io/_uploads/H1Y-jTtyZg.png) * **退出ssh** ``` exit ``` * **安裝hadoop並移動至其他資料夾** ``` wget https://archive.apache.org/dist//hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz sudo tar -zxvf hadoop-3.3.6.tar.gz sudo mv hadoop-3.3.6 /usr/local/hadoop ls /usr/local/hadoop ``` ![image](https://hackmd.io/_uploads/HyHUyJqy-g.png) * **設定hadoop-env.sh檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh ``` ![image](https://hackmd.io/_uploads/HyBB-1qJ-l.png) 取消註解並加上java路徑: ![image](https://hackmd.io/_uploads/H1CcZ151Wg.png) 依據你的java路徑填入 * **設定bashrc檔:** ``` sudo nano ~/.bashrc ``` 將以下內容在檔案最下方貼上: ``` #Hadoop Variables export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH #Hadoop Variables ``` ![image](https://hackmd.io/_uploads/BJZsYWoJbl.png) 使修改生效: ``` source ~/.bashrc ``` * **測試Hadoop安裝是否成功:** ``` hadoop classpath ``` ![image](https://hackmd.io/_uploads/r1EYl-5k-g.png) 以常用指令測試,如果能顯示內容而不是「command not found」,表示指令可用。 * **設定core-site.xml檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml ``` 開啟檔案後在<configuration></configuration>之中插入以下文字: ``` <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> ``` ![image](https://hackmd.io/_uploads/rkWcFyckbl.png) **設定hdfs-site.xml檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml ``` 開啟檔案後在<configuration></configuration>之中插入以下文字: ``` <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/hadoop_data/hdfs/datanode</value> </property> ``` ![image](https://hackmd.io/_uploads/SJ02ZeqyZe.png) dfs.replication=3 這是資料副本數(理解為背後多操作幾台虛擬機,幾個節點的概念),在多節點 cluster 是正確的設定(每個 block 會有 3 個副本)。 dfs.namenode.name.dir NameNode metadata 存放路徑,本地路徑即可。 在 cluster 中,每個 NameNode 都要有自己的目錄。 dfs.datanode.data.dir DataNode 存放 HDFS block 的資料目錄,每個 DataNode 需要本地磁碟路徑。 * **設定yarn-site.xml檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml ``` 開啟檔案後在<configuration></configuration>之中插入以下文字: ``` <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> ``` ![image](https://hackmd.io/_uploads/ByXlMl5JZx.png) * **設定mapred-site.xml檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml ``` 開啟檔案後在<configuration></configuration>之中插入以下文字: ``` <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> ``` * **建立目錄:** ``` sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode ``` * **格式化目錄:** 注意第一行的兩user代表填的是建立虛擬機一開始所自行設置的username ``` sudo chown user:user -R /usr/local/hadoop hdfs namenode -format ``` * **啟動Hadoop:** ``` start-dfs.sh start-yarn.sh jps ``` ![image](https://hackmd.io/_uploads/rJ_V5-5kZe.png) 有以下項目則成功: * SecondaryNameNode * ResourceManager * NodeManager * DataNode * NameNode --- 【Tips:如果遇到ERROR: Cannot set priority of datanode process】 ![image](https://hackmd.io/_uploads/HJGRc-5J-x.png) 解法是提升limit 限制: ``` sudo nano /etc/security/limits.conf ``` 將以下貼在檔案最下面: ``` @hadoop hard nice -15 @hadoop hard priority -15 ``` [Reference](https://issues.apache.org/jira/browse/HDFS-13397) --- * **firefox察看Hadoop single 結果:** http://localhost:8088 ![image](https://hackmd.io/_uploads/Bk9DhZ5yWl.png) http://localhost:9870 ![image](https://hackmd.io/_uploads/BJHgTb9kWg.png) ## 安裝 Hadoop Multi Node Cluster 實作前小提醒:記得認清楚每步驟修改的是哪台機器。 ### 複製虛擬機為dataNV1 依圖順序建立(由上至下閱讀): ![upload_456f4afb8018c350b2cd86a7dda5e101](https://hackmd.io/_uploads/SySXJLok-x.png) ![image](https://hackmd.io/_uploads/H1seRbcJ-x.png) ![image](https://hackmd.io/_uploads/BJzBR-qybl.png) 下一步 ![upload_8e9fb54864e3337ddfa0f2a668b1f115](https://hackmd.io/_uploads/ry-FkIi1Zx.png) 完成後找到設定,設定網路介面卡*2 ![upload_38a795f4efd8a415694223b938514235](https://hackmd.io/_uploads/B13qkUsJWx.png) ![upload_57860bf55d755cda27286246fc4c51c1](https://hackmd.io/_uploads/rJCqkIok-x.png) ### 設定dataNV1虛擬機 * **設定網卡:** ``` sudo nano /etc/netplan/01-network-manager-all.yaml ``` 開啟為下圖: ![upload_360fdc89cfa0bcd96e55410456e066e6](https://hackmd.io/_uploads/r1k61IskWl.png) 增加以下文字,將內部ip設為192.168.56.101: ``` ethernets: # 網卡 1: DHCP - 供外部網路連線使用 (enp0s3 上的ip 是 DHCP 取得的) enp0s3: dhcp4: yes # 網卡 2: 靜態 IP - 供內部或 Host-Only 網路使用 (enp0s8) enp0s8: dhcp4: no addresses: [192.168.56.101/24] ``` ![image](https://hackmd.io/_uploads/S1hIlz9kbx.png) 套用設定: ``` sudo netplan apply ``` ![image](https://hackmd.io/_uploads/B1eTrMqy-g.png) * **設定host:** ``` sudo nano /etc/hostname ``` 改為下圖: ![image](https://hackmd.io/_uploads/Skh0ffckbe.png) 重新啟動,可發現hostname修改了: ![image](https://hackmd.io/_uploads/HJxgEzckWg.png) ``` sudo nano /etc/hosts ``` ![upload_44bbb7a9c966a1834401de6d20c12ef2](https://hackmd.io/_uploads/rkiklUikWl.png) 新增以下內容: ``` 192.168.56.100 masterNV 192.168.56.101 dataNV1 192.168.56.102 dataNV2 192.168.56.103 dataNV3 ``` ![image](https://hackmd.io/_uploads/r1v8Xz9k-g.png) * **設定core-site.xml檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml ``` 開啟檔案後在<configuration></configuration>之中改為以下文字: ``` <property> <name>fs.defaultFS</name> <value>hdfs://masterNV:9000</value> </property> ``` ![image](https://hackmd.io/_uploads/SkP8LGq1Wl.png) * **設定hdfs-site.xml檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml ``` 開啟檔案後在<configuration></configuration>之中改為以下文字: ``` <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.data.dir</name> <value> file:/usr/local/hadoop/hadoop_data/hdfs/datanode</value> </property> ``` ![image](https://hackmd.io/_uploads/ryXa8fqkWl.png) * **設定yarn-site.xml檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml ``` 開啟檔案後在<configuration></configuration>之中改為以下文字: ``` <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>masterNV:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>masterNV:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>masterNV:8050</value> </property> ``` ![image](https://hackmd.io/_uploads/ry6VvG51We.png) * **設定mapred-site.xml檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml ``` 開啟檔案後在<configuration></configuration>之中改為以下文字: ``` <property> <name>mapred.job.tracker</name> <value>masterNV:54311</value> </property> ``` ![image](https://hackmd.io/_uploads/HyDjPG5J-e.png) ### 複製dataNV1虛擬機為dataNV2、dataNV3、masterNV 依序如圖: ![image](https://hackmd.io/_uploads/SyRquM5JZx.png) ![upload_8e9fb54864e3337ddfa0f2a668b1f115](https://hackmd.io/_uploads/ry-FkIi1Zx.png) 重複複製dataNV1建立datNV2、datNV3、masterNV 設定記憶體(僅16G的話): 建議masterNV 4G、dataNV1 2G、dataNV2 2G、dataNV3 2G (目前分配:8 4 4 4)配置盡量越高越好 ### 設定dataNV2、dataNV3、masterNV的ip與hostname 以data2為例,重複以下動作修改成各自的ip與hostname * **設定網卡:** ``` sudo nano /etc/netplan/01-network-manager-all.yaml ``` ip的 101改為102: ![image](https://hackmd.io/_uploads/HkxwY5MqyZg.png) 套用設定: ``` sudo netplan apply ``` * **設定hostname:** ``` sudo nano /etc/hostname ``` 改為下圖: ![image](https://hackmd.io/_uploads/S1QQif91bx.png) 重複做dataNV3、masterNV的ip與hostname喔 ### 設定master * **設定hdfs-site.xml檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml ``` 開啟檔案後在<configuration></configuration>之中改用以下文字: ``` <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value> file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value> </property> ``` ![image](https://hackmd.io/_uploads/ryPa6G5yWx.png) * **設定masters檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/masters ``` 新增檔案後鍵入master節點名稱,如下圖: ![image](https://hackmd.io/_uploads/ryOnCf5y-g.png) * **設定workers檔:** ``` sudo nano /usr/local/hadoop/etc/hadoop/workers ``` 新增檔案後鍵入剩下的節點名稱,如下圖: ![image](https://hackmd.io/_uploads/ByHQ1Qc1Zl.png) ### masterNV連線dataNV1、2、3 首先啟動master、dataNV1、2、3虛擬機 在master的終端機輸入指令: ``` ssh dataNV1 sudo rm -rf /usr/local/hadoop/hadoop_data/hdfs sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode sudo chown user:user -R /usr/local/hadoop exit ``` 連線 dataNV1之後 ![image](https://hackmd.io/_uploads/Bkl1f0q1-e.png) 和dataNV1斷開連線後來到dataNV2 ``` ssh dataNV2 sudo rm -rf /usr/local/hadoop/hadoop_data/hdfs sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode sudo chown user:user -R /usr/local/hadoop exit ``` ![image](https://hackmd.io/_uploads/S1rwzC5Jbg.png) 和dataNV2斷開連線後來到dataNV3 ``` ssh dataNV3 sudo rm -rf /usr/local/hadoop/hadoop_data/hdfs sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode sudo chown user:user -R /usr/local/hadoop exit ``` ![image](https://hackmd.io/_uploads/SJsTf0cJZg.png) 重建masterNV的hdfs目錄 ``` sudo rm -rf /usr/local/hadoop/hadoop_data/hdfs mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode sudo chown -R user:user /usr/local/hadoop hdfs namenode -format ``` ![image](https://hackmd.io/_uploads/BkesFA5kbl.png) * **啟動 Hadoop Multi Node Cluster** ``` start-dfs.sh start-yarn.sh ``` ![image](https://hackmd.io/_uploads/Syb85C5JZl.png) http://localhost: http://masternv: 皆可 ![image](https://hackmd.io/_uploads/SkQyjA5kbx.png) ![image](https://hackmd.io/_uploads/Hycgi05J-g.png) ![image](https://hackmd.io/_uploads/SyTzoC9J-e.png) ## HDFS命令 啟動Hadoop Multi Node Cluster之後: > 這部分基本都是按該參考書籍的命令實作: > [Reference:Python+Spark2.0+Hadoop第六章](https://pythonsparkhadoop.blogspot.com/2016/09/6-hadoop-hdfs.html) ### 建立與查看HDFS目錄 * **建立HDFS目錄** ``` hadoop fs -mkdir /user hadoop fs -mkdir /user/user hadoop fs -mkdir /user/user/test ``` 說明:hadoop fs -mkdir /user/{帳戶名稱} ![image](https://hackmd.io/_uploads/BkBE1ksJZl.png) * **查看HDFS目錄** ``` hadoop fs -ls hadoop fs -ls / hadoop fs -ls /user hadoop fs -ls /user/user hadoop fs -ls -R / ``` ![image](https://hackmd.io/_uploads/Hk4NgJjyWg.png) * **一次建立所有HDFS子目錄** ``` hadoop fs -mkdir -p /dir1/dir2/dir3 ``` ![image](https://hackmd.io/_uploads/HyNhxJj1Wg.png) * **從本機複製檔案到 HDFS** ``` hadoop fs -copyFromLocal /usr/local/hadoop/README.txt /user/user/test hadoop fs -copyFromLocal /usr/local/hadoop/README.txt /user/user/test/test1.txt ``` 結果: ![image](https://hackmd.io/_uploads/B1EObkiyWe.png) 說明: 1. 把本機的 /usr/local/hadoop/README.txt 複製到 HDFS 的 /user/user/test/ 資料夾裡。 2. 把檔案複製到 /user/user/test/ 裡,並改名為 test1.txt。 ``` hadoop fs -cat /user/user/test/README.txt hadoop fs -cat /user/user/test/README.txt|more ``` 結果: ![image](https://hackmd.io/_uploads/S1w2GysJbl.png) 說明: 1. 直接將HDFS上的 /user/user/test/README.txt 內容顯示到終端機上。 2. 同樣印出檔案內容,但使用more命令分頁顯示。 ``` hadoop fs -copyFromLocal -f /usr/local/hadoop/README.txt /user/user/test ``` 結果: ![image](https://hackmd.io/_uploads/BkG4m1s1Wl.png) 說明: 強迫複製重複檔案至HDFS目錄。 ``` hadoop fs -copyFromLocal /usr/local/hadoop/NOTICE.txt /usr/local/hadoop/LICENSE.txt /user/user/test hadoop fs -copyFromLocal /usr/local/hadoop/etc /user/user/test ``` 結果: ![image](https://hackmd.io/_uploads/Hk7SV1o1Wl.png) ![image](https://hackmd.io/_uploads/HJGwEJi1Ze.png) 說明: 1. 將多個本地檔案 (NOTICE.txt 和 LICENSE.txt) 同時複製到 HDFS 的 /user/user/test 目錄。 2. 將本地的 /usr/local/hadoop/etc 整個目錄複製到 HDFS /user/user/test。 ``` hadoop fs -put /usr/local/hadoop/README.txt /user/user/test/test2.txt echo abc | hadoop fs -put - /user/user/test/echoin.txt hadoop fs -cat /user/user/test/echoin.txt ``` 結果: ![image](https://hackmd.io/_uploads/r1MeHJjk-x.png) 說明: 1. 把本地的 README.txt 上傳到 HDFS 路徑 /user/user/test/test2.txt。 2. 在 HDFS 上建立一個內容為 abc 的檔案。 3. 將 HDFS 上的 echoin.txt 內容印出來。 ``` ls /usr/local/hadoop | hadoop fs -put - /user/user/test/hadooplist.txt hadoop fs -cat /user/user/test/hadooplist.txt ``` 結果: ![image](https://hackmd.io/_uploads/SJnGUJsybg.png) 說明: 1. 取得本地命令輸出的結果(本地目錄列表),將輸出的結果直接寫入 HDFS 檔案。 2. 查看複製的輸出內容 * **將HDFS上的檔案複製到本機** ``` mkdir test cd test hadoop fs -copyToLocal /user/user/test/hadooplist.txt ll ``` ![image](https://hackmd.io/_uploads/HkMBD1o1-l.png) ``` hadoop fs -get /user/user/test/README.txt localREADME.txt ``` ![image](https://hackmd.io/_uploads/ryXCPyiJZl.png) 說明:把 HDFS 上的 /user/user/test/README.txt 複製到當前目錄,並命名為 localREADME.txt * **複製與刪除HDFS檔案** ``` hadoop fs -mkdir /user/user/test/temp hadoop fs -cp /user/user/test/README.txt /user/user/test/temp hadoop fs -ls /user/user/test/temp hadoop fs -ls /user/user/test hadoop fs -rm /user/user/test/test2.txt hadoop fs -ls /user/user/test hadoop fs -rm -R /user/user/test/etc ``` ![image](https://hackmd.io/_uploads/S1tlUgsyWx.png) * **使用UI 查看HDFS檔案** http://masternv:9870/ 遠端伺服器: ![image](https://hackmd.io/_uploads/r121dejJZl.png) ![image](https://hackmd.io/_uploads/BkdOugikbx.png) ## Spark安裝與設置 ### Scala安裝 原本應該用coursier安裝,方便管理scala版本等 ![image](https://hackmd.io/_uploads/S1V87bs1Wg.png) 但一開始CPU給太少,若想用coursier安裝:[指路](https://www.scala-lang.org/download/) 若CPU和教學剛開始設的一樣是4顆,咱就用wget直接安裝 ``` wget https://github.com/scala/scala3/releases/download/3.3.6/scala3-3.3.6.tar.gz sudo tar -zxvf scala3-3.3.6.tar.gz sudo mv scala3-3.3.6 /usr/local/scala ``` * **設定~/.bashrc** ``` sudo nano ~/.bashrc ``` 在最下方加上: ``` #Scala Variables export SCALA_HOME=/usr/local/scala export PATH=$PATH:$SCALA_HOME/bin #Scala Variables ``` ![image](https://hackmd.io/_uploads/HyVzUboJbg.png) ``` source ~/.bashrc ``` ### 安裝Spark ``` wget https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz sudo tar -zxvf spark-3.4.1-bin-hadoop3.tgz sudo mv spark-3.4.1-bin-hadoop3 /usr/local/spark/ ``` * **設定~/.bashrc** ``` sudo nano ~/.bashrc ``` 在最下方加上: ``` #Spark Variables export SPARK_HOME=/usr/local/spark export PATH=$PATH:$SPARK_HOME/bin #Spark Variables ``` ![image](https://hackmd.io/_uploads/BJbuYbj1-g.png) ``` source ~/.bashrc ``` ### 啟動Scala ``` scala ``` 測試與退出pyspark ![image](https://hackmd.io/_uploads/rJG9nWoyWx.png) ### 啟動python spark ``` pyspark ``` ![image](https://hackmd.io/_uploads/H1_89Wo1be.png) 測試與退出pyspark ![image](https://hackmd.io/_uploads/SyFciZjybe.png) * **設定pyspark 減少顯示log訊息** ``` cd /usr/local/spark/conf cp log4j2.properties.template log4j2.properties sudo nano log4j2.properties ``` 將info ![image](https://hackmd.io/_uploads/H1CaTWiJbe.png) 修改成warn ![image](https://hackmd.io/_uploads/HJjeRWiJbe.png) ### 以測試Word Count測試 Spark功能 ``` mkdir -p ~/wordcount/input cp /usr/local/hadoop/LICENSE.txt ~/wordcount/input ll ~/wordcount/input ``` ![image](https://hackmd.io/_uploads/ryYbJMoyWl.png) 如果沒啟動Hadoop Multi-Node Cluster,記得這裡要啟動 ``` hadoop fs -mkdir -p /user/user/wordcount/input cd ~/wordcount/input hadoop fs -copyFromLocal LICENSE.txt /user/user/wordcount/input hadoop fs -ls /user/user/wordcount/input ``` ![image](https://hackmd.io/_uploads/S1HggzskWe.png) ``` pyspark --master local[*] sc.master textFile=sc.textFile("file:/usr/local/spark/README.md") textFile.count() ``` ![image](https://hackmd.io/_uploads/HJ5dbGoJWl.png) 換成讀取HDFS檔案 ``` textFile=sc.textFile("hdfs://masterNV:9000/user/user/wordcount/input/LICENSE.txt") textFile.count() exit() ``` ![image](https://hackmd.io/_uploads/SkcbMzikbg.png) ### 在Hadoop YARN執行pyspark ``` HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client sc.master ``` ![image](https://hackmd.io/_uploads/B1cimGj1We.png) ``` textFile=sc.textFile("hdfs://masterNV:9000/user/user/wordcount/input/LICENSE.txt") textFile.count() ``` ![image](https://hackmd.io/_uploads/S1rm4zikZg.png) 查看PySparkShell App: http://localhost:8088 ![image](https://hackmd.io/_uploads/HylnVzoybl.png) ### 建置Spark standalone cluster 上一章節記得exit() ``` cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh sudo nano /usr/local/spark/conf/spark-env.sh ``` 在最下方加上: ``` export SPARK_MASTER_IP=masterNV export SPARK_WORKER_CORES=1 export SPARK_WORKER_MEMORY=2048m export SPARK_EXECUTOR_INSTANCES=4 ``` 1. 設定masterNV IP或伺服器名 2. 每個worker的CPU使用數量,不建議設成虛擬機的全部核心,要預留系統開銷 3. Worker節點可分配的記憶體大小 4. 執行個體數:例,3個 Worker,各2核 → 總共6核,可以設 3~4 個 executors ![image](https://hackmd.io/_uploads/ByY-OMokZe.png) * **將masterNV 的Spark 複製到dataNV1、2、3** 以dataNV1為例: ``` ssh dataNV1 sudo mkdir /usr/local/spark sudo chown user:user /usr/local/spark exit sudo scp -r /usr/local/spark user@dataNV1:/usr/local ``` ![image](https://hackmd.io/_uploads/SyHWYzjJbe.png) 再重複操作dataNV2、3,記得注意命令中該修改的hostname和帳號名 * **修改workers檔案** ``` sudo nano /usr/local/spark/conf/workers ``` 增加: ``` dataNV1 dataNV2 dataNV3 ``` ![image](https://hackmd.io/_uploads/HyjD9zo1be.png) 啟動Spark standalone cluster ``` /usr/local/spark/sbin/start-master.sh /usr/local/spark/sbin/start-workers.sh ``` ![image](https://hackmd.io/_uploads/BkAgsfsybl.png) ``` pyspark --master spark://masterNV:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m sc.master ``` ![image](https://hackmd.io/_uploads/rJ0ojfsJ-l.png) 測試: ``` textFile=sc.textFile("file:/usr/local/spark/README.md") textFile.count() textFile=sc.textFile("hdfs://masterNV:9000/user/user/wordcount/input/LICENSE.txt") textFile.count() ``` ![image](https://hackmd.io/_uploads/SykN3MsJbl.png) 查看: http://masternv:8080/ ![image](https://hackmd.io/_uploads/SydDnfj1Zg.png) 停止: ``` /usr/local/spark/sbin/stop-all.sh ``` ![image](https://hackmd.io/_uploads/Skxm6ziy-e.png) ## 在Jupyter Notebook 執行Spark ### 安裝Anaconda ``` wget https://repo.anaconda.com/archive/Anaconda3-2023.03-0-Linux-x86_64.sh bash Anaconda3-2023.03-0-Linux-x86_64.sh -b ``` * **修改~/.bashrc** ``` sudo nano ~/.bashrc ``` 在下方加入: ``` # Anaconda Variables export PATH=/home/user/anaconda3/bin:$PATH export ANACONDA_PATH=/home/user/anaconda3 export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/bin/jupyter export PYSPARK_PYTHON=$ANACONDA_PATH/bin/python # Anaconda Variables ``` ![image](https://hackmd.io/_uploads/Syr29Nik-g.png) ``` source ~/.bashrc ``` 若要以cluster執行Spark,在dataNV1、2、3也需重複以上動作安裝並設置Anaconda。 查看Python版本: ``` python --version ``` ![image](https://hackmd.io/_uploads/ByOKYmjyWl.png) * **建立Jupyter** ``` mkdir -p ~/pythonwork/ipynotebook cd ~/pythonwork/ipynotebook PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark ``` 會自動開啟Jupyter Notebook 新增Notebook: ![image](https://hackmd.io/_uploads/HkvlZEik-x.png) 修改Notebook名稱: ![image](https://hackmd.io/_uploads/SyINbVjkbe.png) **在Jupyter Notebook執行pyspark** ``` sc.master textFile=sc.textFile("file:/usr/local/spark/README.md") textFile.count() ``` ``` textFile=sc.textFile("hdfs://masterNV:9000/user/user/wordcount/input/LICENSE.txt") textFile.count() ``` 記得先啟動cluster ![image](https://hackmd.io/_uploads/H1qbQEsJZg.png) 要跳出按ctrl+c * **在hadoop yarnclient模式執行** 同樣切換到工作目錄 ~/pythonwork/ipynotebook ``` PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client ``` ![image](https://hackmd.io/_uploads/Sy6k6No1bl.png) ![image](https://hackmd.io/_uploads/HJd-pVo1Zg.png) * **在Spark Stand Alone模式執行** ``` /usr/local/spark/sbin/start-master.sh /usr/local/spark/sbin/start-workers.sh ``` ``` PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master spark://masterNV:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m ``` ![image](https://hackmd.io/_uploads/SybfUHo1bx.png) ![image](https://hackmd.io/_uploads/rkuQUSik-e.png) ![image](https://hackmd.io/_uploads/rk_ArBiybx.png) ![image](https://hackmd.io/_uploads/HyRuUSjJbx.png) --- 結束~~

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully