Introduction to Distributed File System

# Introduction to Distributed File System ###### tags: `Spark` `Distributed System` `Hadoop` `MapReduce` `RDD` ## Hadoop 簡介 ### What is Hadoop? 首先，想像有個檔案大小超過 PC 能夠儲存的容量，那便無法儲存在你的電腦裡，對吧？ * Hadoop讓你儲存超過一個伺服器所能容納的超大檔案，還能同時儲存、處理、分析幾千幾萬份這種超大檔案，所以每提到大數據，便會提到 Hadoop 這套技術。 * 簡單來說，Hadoop 是一個能夠儲存並管理大量資料的雲端平台，為 Apache 軟體基金會底下的一個開放原始碼、社群基礎、而且完全免費的軟體，被各種組織和產業廣為採用，非常受歡迎。 ### Hadoop 優點 * 大量數據 (Vast amounts of data) * 成本 (Economic) * 效率 (Efficient) * 可擴展 (Scalable) * 可靠性 (Reliable) * 與傳統關聯式資料庫管理系統比較![](https://i.imgur.com/OHCbPUt.png) ## Hadoop MapReduce MapReduce其實是一種開發模式（Program Model），基本上可以把整個邏輯分成為Map階段和Reduce階段。 * Map階段會做filtering和sorting並且傳出一對(key，value)做結果（以wordcount為例，每一個字會作為最後的key，而value則是1代表有一筆） * Reduce階段會做整合（以wordcount為例，從Map傳過來的key如果一樣，表示同一個字，因此把一樣的key加總出總數） * 範例:WordCount 流程 ![](https://i.imgur.com/MTfSNLE.png) # Spark介紹與實例 ## What is Spark? * Apache Spark是一個open source的cluster運算框架，最初由加州大學柏克萊分校AMP Lab所開發。 * 有快速且通用的engine來處理大量數據支援SQL, streaming, advanced analytics的libraries * Spark在記憶體內執行程式的運算速度能做到比Hadoop MapReduce的運算速度快上100倍，即便是執行程式於硬碟時，Spark也能快上10倍速度。 * Spark允許使用者將資料載入至cluster記憶體，並多次對其進行查詢，非常適合用於機器學習演算法。 ## Spark與Hadoop比較 Hadoop存在如下一些缺點： * 表達能力有限將所有任務僅分解成Map操作與Reduce操作，不是所有的應用都能只靠這兩類操作解決問題 * IO成本大所有運算的中間結果都進到硬碟，對記憶體的利用度較差 * 延遲高任務操作之間的需要花費IO成本Ｍap操作的結果不是直接給Reduce操作，而是先寫入硬碟，再由Reduce操作去提出來執行在前一個任務執行完成之前，其他任務就無法開始，難以勝任複雜、多階段的平行計算任務 ## Spark優點 * 運行速度快：使用DAG(Directed Acyclic Graph，有向無環圖)執行引擎，對任務進行分解與調度 :::info 補充:Directed Acyclic Graph（有向無環圖） DAG 是十分重要的特例，往往存在速度極快的演算法。由於 Tree 和 DAG 沒有cycle、方向明確，所以我們很容易安排出一個計算順序，循序漸進求得答案。 ![](https://i.imgur.com/wxSbY5G.png) ::: * 容易使用：支援多種程式語言(例如:Scala、Java、Python和R語言)，可以通過Spark Shell進行互動式指令執行，提升了應用程式的開發效率 * 通用性： Spark提供了完整而強大的技術軟體組合，包括資料庫查詢(Spark SQL)、串流計算(Spark Streaming)、機器學習(MLlib)和圖演算法(Graph X)元件 * 運行模式多樣：可運行於獨立的Spark集群模式中；可運行於Hadoop中，由YARN進行資源調度，由HDFS進行資料儲存 ## Spark基本概念 ### Spark 執行架構 ![](https://i.imgur.com/URw5EAe.png) * Master Node: 對整個Spark應用程式進行資源的分配和管理調度。例如：程式提交後產生有向無環圖DAG、對DAG分成多個階段、對多個階段進行任務拆解、把拆解後的任務分配到相關Worker Node執行。 * WorkerNode：用來運行Task的機器 * Client Node：透過SparkContext這個class讓應用程式與Spark集群對接並進行控管 ```python= # Linking Spark & Initializing Spark from pyspark import SparkContext, SparkConf import random appName = "wordcount" master = "yarn" conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) ``` ## Spark 核心概念:RDD(Resilient Distributed Datasets) 在Spark中變數被稱為RDD(Resilient Distributed Datasets彈性分散式資料集)。其實RDD就是我們常見的集合概念，比較特別的是資料集可以為橫跨數個節點(例如:PC...)所組成。 #### RDD有三個特性： 1. **不可更動(Immutable):** 每個RDD都是不能被改變的(可以像Java的String一樣)，想要更新的？從既有的RDD之中再建立另一個新的RDD。這樣的作法看起來感覺怪怪的，但要讓資料用於分散式系統，Immutable是關鍵的一環，因為每個RDD都保證不會被更動，才能保證資料一致。 2. **彈性(Resilient)：** 分散式環境中忽然有節點失效是很正常的，那Spark會幫你重建Spark上正在使用或建立的RDD。 3. **分散式(Distributed)：** 資料集可跨多個節點，並儲存在每個節點的記憶體內，優點當然就是執行速度較快，不過也因此Spark就是記憶體怪獸，所以要注意相關操作 (例如:shuffle:使不同組的資料重新分配或重新分區的機制)。 :::info reduceBykey()、groupByKey()、sortBykey()、cartesian()......都算是shuffle指令可能會拖慢執行速度，需要小心使用 ::: ## RDD 運作方式 > 官方網站參考: > https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html. RDD的操作依性質主要分為兩類： * Transformation　(轉換類操作)：操作一個或多個RDD，並產生出新的RDD * Action (行動類操作)：將操作結果回傳給Driver(執行Spark的機器)，或是對RDD元素執行一些操作，但不會產生新的RDD ![](https://i.imgur.com/iatuTiR.png) RDD的執行過程如下：讀入外部資料來源建立RDD，並分散儲存到不同的工作節點上 RDD經過一系列的轉換(Transformation)操作，每一次都會產生不同的RDD，供給下一個操作使用 RDD最後經過一個動作(Action)操作進行處理，才會完成運算並輸出結果 ## 創建RDD的方法有2種方法create RDD 1. 複製集合中的元素以形成可以並行操作的分散式dataset 2. 引用外部資料集（例如shared file system，HDFS，HBase…中的dataset） ### 1.複製集合中的元素以形成可以平行操作的分散式dataset 複製集合中的元素以形成可以平行操作的分散式dataset。 ``` data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) ``` ### 2.引用外部資料集 Spark支援以下檔案類型: * 文字檔(Text files) * 序列文件(SequenceFiles) * 任何Hadoop InputFormat(Any other Hadoop InputFormat) 可以使用SparkContext的textFile方法創建RDD的文字檔(Text files)。此方法採用文件的URI（電腦上的本地路徑或hdfs：//，s3a：//等URI）不支持HTTP / HTTPS URL。 ```python= distFile = sc.textFile("data.txt") # Saving and Loading SequenceFiles rdd = sc.parallelize(range(1, 4)).map(lambda x: (x, "a" * x)) rdd.saveAsSequenceFile("path/to/file") sorted(sc.sequenceFile("path/to/file").collect()) >>> [(1, u'a'), (2, u'aa'), (3, u'aaa')] ``` :::info 如果您在本地文件系統上使用路徑，必須在worker node上的相同路徑下訪問該文件。將文件複製給所有workers，或者使用網絡安裝的共享文件系統。 ::: :::warning 建議使用第3種方法:上傳檔案到HDFS，再從HDFS上引用 **Hadoop 相關操作指令** > Apache Hadoop官方網站: > https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html | 一般Linux指令 | Hadoop 指令 | | -------- | -------- | |mkdir directory|hadoop fs -mkdir directory| |ls directory|hadoop fs -ls directory| |cat word.txt|hadoop fs -cat word.txt| --- | Hadoop 指令 | 說明 | |:------------------------------------- | ----------------------------- | | hadoop fs -copyFromLocal [src] [dest] | 從本機(local)上傳資料到HDFS上 | | hadoop fs -put [src] [dest] | 與上述類似| | hadoop fs -copyToLocal [src] [dest] | 從HDFS上下載資料到本機(local)上 | | hadoop fs -get [src] [dest] | 與上述類似| | hadoop fs -cp [src] [dest] | HDFS上複製資料 | ::: ## RDD 操作 ### Transformation | transformation API | 說明 | | ------------------ | --------------------------------------------------| | map(func) | 將每個元素傳遞到函數func中，並將結果返回為一個新的資料集 | | flatmap(func) | 與map()相似但每個輸入可以映射到0或多個輸出結果 | | filter(func) | 篩選出滿足函數func的元素，並返回一個新的資料集 | |cartesian(otherDataset)|在類型T和U的資料集時，返回（T，U）的資料集。| |groupByKey()|應用於(K,V)鍵值對的資料集，返回一個新的(K,Iterable<V>)形式資料集。| |reduceByKey(func)|應用於(K,V)鍵值對的資料集，返回一個新(K,V)形式資料集，其中值是將每個key傳遞到函數func中進行聚合| ### Transformation轉換指令介紹 ``` 示範範例--word.txt內容: Hadoop is good Spark is fast Spark is better ``` * **map()**：這個Transformation API可以將RDD中的每個元素，透過某個函數1對1轉換成另一個新RDD之對應元素，故其輸出仍然是一個RDD。 ``` #利用word.txt文件產生一個名為lines的RDD lines = sc.textFile("hdfs://[HDFS Server Name]:8020/user/[帳號]/word.txt") #利用split(" ")將lines內的每個RDD元素(每一行文字)切割成單字，並轉換成另一個名為words的RDD。 words = lines.map(lambda line:line.split(" ")) ``` ![](https://i.imgur.com/rP7iCIq.png) --- * **flatMap()**： 注意:M是大寫 這個Transformation API，map()類似，它也會做 map，map同時會做拍碎(flat)的工作。透過某個函數0對多轉換成另一個以list為主的新RDD ``` #利用word.txt文件產生一個名為lines的RDD lines = sc.textFile("hdfs://[HDFS Server Name]:8020/user/[帳號]/word.txt") #用split(" ")將lines內的每個RDD元素(每一行文字)切割成單字，並拍碎(flat)成不同的RDD元素，再轉換成另一個名為words的RDD words = lines.flatMap(lambda line:line.split(" ")) ``` ![](https://i.imgur.com/9OOLeWB.png) --- * **filter()**：這個Transformation API可以使用一個條件進行篩選，將符合條件的元素留下，以達到過濾的功能，輸出仍然是一個RDD。 ``` #利用word.txt文件產生一個名為lines的RDD lines = sc.textFile("hdfs://[HDFS Server Name]:8020/user/[帳號]/word.txt") #只包含特定詞"Spark"的行，並轉換成名為linesWithSpark的RDD linesWithSpark = lines.filter(lambda line: "Spark" in line) ``` ![](https://i.imgur.com/toZRfSu.png) --- ### Actions | Actions API | 說明 | | ------------ | ---- | | reduce(func)|使用函數func（接受兩個參數並返回一個）來聚合資料集的元素。 | |collect()|以陣列型式返回資料集中所有元素| |count()|返回資料集中的元素數量| |first()|返回資料集中的第一個元素| |foreach(func)|將資料集的中每個元素傳到函數func中執行| * **reduce()**：這個Action API，對RDD元素進行處理 **reduce()的執行結果是資料值，不是RDD。** ``` #先產生一個名為rdd1的RDD物件 rdd1 = sc.parallelize([1,2,3,4]) #進行資料累加 rdd1.reduce(lambda a,b:a+b) # 執行結果: 10 ``` reduce實際動作: 1. 將可疊代物件中的前兩個元素先進行Lambda運算式的運算。 1. 接著將第一個步驟的運算結果和可疊代物件中的下一個元素(第三個)傳入Lambda函式進行運算。 1. 直到可疊代物件的元素都運算完成。 * **count()**：這個Action API可以統計出一個RDD的資料元素個數。 ``` #先產生一個名為rdd2的RDD物件 rdd2 = sc.parallelize([1,2,3,4]) #，計算該RDD物件中的元素個數， rdd2.count() # 執行結果: 4 ``` * **collect()**：這個Action API會以list的型式回傳RDD內的所有元素。 ``` #先產生一個名為rdd3的RDD物件 rdd3 = sc.parallelize([1,2,3,4]) #以list的型式回傳RDD物件中的所有元素 rdd3.collect() # 執行結果: [1,2,3,4] ``` :::info 可以自己創一個word.txt推上HDFS後執行下列程式看看輸出 ```python= from pyspark import SparkContext, SparkConf master = "yarn" conf = SparkConf().setMaster(master) sc = SparkContext(conf=conf) #Change your HDFS Server Name & Account first lines = sc.textFile("hdfs://[HDFS Server Name]]:8020/user/[帳號]/word.txt") map_result=lines.map(lambda line: line.split(" ")) print "map_result:",map_result.collect() flatMap_result=lines.flatMap(lambda line: line.split(" ")) print "flatMap_result:",flatMap_result.collect() filter_result=lines.filter(lambda line: "Spark" in line) print "filter_result:",filter_result.collect() rdd1=sc.parallelize([1,2,3,4]) print "reduce_result:",rdd1.reduce(lambda a,b:a+b) print "count_result:",rdd1.count() print "collect_result:",rdd1.collect() ``` ::: > **更多資料都在這裡:** > 更多API:https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions > 引用:http://debussy.im.nuu.edu.tw/sjchen/BigData-Spark/%E5%B7%A8%E9%87%8F%E8%B3%87%E6%96%99%E6%8A%80%E8%A1%93%E8%88%87%E6%87%89%E7%94%A8%E6%93%8D%E4%BD%9C%E8%AC%9B%E7%BE%A9-RDD%E9%81%8B%E4%BD%9C%E5%9F%BA%E7%A4%8E.html ## WordCount Examples with Hadoop and Spark > sources:https://spark.apache.org/examples.html ### WordCount Hadoop-MapReduce Java版 ```java= import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } ``` ### WordCount Spark-python版 ```python= inputFilePath = "hdfs://[HDFS Server Name]:8020/user/[帳號]/word.txt" lines = sc.textFile(inputFilePath) #讀入文件內容以建立名為lines的RDD counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.collect() print (counts.collect()) #顯示結果 counts.saveAsTextFile("hdfs://...") #把結果存回"hdfs://..." ``` :::info ``` counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) ``` 簡單說明這行關鍵程式: * **lines包含多行的RDD元素。** * lines.**flatMap**(lambda line: line.split(" ")) 會掃描lines這個RDD物件內的元素(每行文字內容)，每掃描到一行文字時，會將該行文字賦予給變數line，以執行Lamda表達式line: line.split(" ")。 * lambda **line: line.split(" ")** 是Lamda表達式，左邊是輸入變數，右邊是利用函數對輸入的資料進行相關處理。在此，是將一行字文內容，以空格作為分隔符號進行文字切割，將一行文字切割成多個單字以構成單字集合。因此，這個表達式的執行可以得到多個單字集合。 * **lines.flatMap()** 這個操作會把上面多個單字集合集結成一個大型RDD。針對上面的大型單字集合的RDD，執行map(lambda word: (word, 1))操作，map操作會掃描這個單字集合中的每個單字，每當掃描到一個單字時，就會把這個單字賦予給變數word，並執行Lambda表達式word: (word, 1)。 * **lambda word : (word, 1)** 這個Lambda表達式是以左邊的參數word當作輸入，然後執行右方的操作，這個操作會針對輸入的字建構一個鍵值對RDD元素，其結構為(key, value)，其中key是word，value是1 (表示該單字出現1次)。 * **reduceByKey(lambda a, b: a + b)** 這個操作，把所有鍵值對(key, value)按照key進行分組，然後使用函式lambda a, b: a + b，將具有相同key值的多個value進行加總的工作，並回傳加總後的(key, value)。例如：(This, 1)和(This, 1)具有相同的key (即：This)，加總後就可以得到(This, 2)，這樣就得到了這個單字的字數統計了。 * 圖解: ![](https://i.imgur.com/utN7nf7.png) ::: ## 啟動程式並測試環境創一個calculate-pi.py的檔案 ``` $ touch calculate-pi.py ``` `calculate-pi.py`的內容 ```python= # Linking Spark & set Spark Manager from pyspark import SparkContext, SparkConf import random appName = "calculate-pi" master = "yarn" conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) # Pi Estimation def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 NUM_SAMPLES=10**6 count = sc.parallelize(xrange(0, NUM_SAMPLES)).filter(inside).count() print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES) ``` 在終端機裡輸入spark-submit指令讓calculate-pi.py使用Spark執行 ``` #執行程式，將結果寫入到calculate-pi.log，並輸出結果 $ spark-submit calculate-pi.py > calculate-pi.log ; cat calculate-pi.log ``` :::info PS:每次計算有誤差是正常的 ::: ## Spark Practice :::info 建議使用python撰寫 [python lambda教學連結](https://www.learncodewithmike.com/2019/12/python-lambda-functions.html) [python 強制轉型](https://medium.com/ccclub/ccclub-python-for-beginners-tutorial-d26900b9280e) ::: ## exam1 #### 題目:從practice4-1.txt中找出包含有“**shoe**”的每一段顯示在螢幕上。用Python編寫一個Spark程式，並儲存在~ 並將程式命名為:exam1.py ``` $ cd ~ #移動到~這個資料夾 $ touch exam1.py #建立exam1.py這個檔案 ``` 請下載practice4-1.txt並將其下載到您的HDFS文件夾中： ``` $ wget https://pastebin.com/raw/P0FXCARK -O practice4-1.txt $ hadoop fs -put practice4-1.txt ``` 您的程式必須從`hdfs://[HDFS Server Name]:8020/user/[帳號]/practice4-1.txt`讀取文件。 :::warning 請根據密碼卡上的HDFS Server Name、帳號填寫 ::: :::info exam1答案: Cinderella obeyed, and the Fairy, touching it with her wand, turned it into a grand coach. Then she desired Cinderella to go to the trap, and bring her a rat. The girl obeyed, and a touch of the Fairy’s wand turned him into a very smart coachman. Two mice were turned into footmen; four grasshoppers into white horses. Next, the Fairy touched Cinderella’s rags, and they became rich satin robes, trimmed with point lace. Diamonds shone in her hair and on her neck and arms, and her kind godmother thought she had seldom seen so lovely a girl. Her old **shoe**s became a charming pair of glass slippers, which shone like diamonds. However, the Prince’s search was rewarded by his finding the glass slipper, which he well knew belonged to the unknown Princess. He loved Cinderella so much that he now resolved to marry her; and as he felt sure that no one else could wear such a tiny **shoe** as hers was, he sent out a herald to proclaim that whichever lady in his kingdom could put on this glass slipper should be his wife. ::: :::warning 繳交方法: ``` cp exam1.py ./handin/ # 把exam1.py複製到handin這個資料夾 ``` 如果要確認自己有沒有繳交完成可以看result1.txt 示意圖:(不會即時更新，可能需要稍等片刻) ![](https://i.imgur.com/akeuFcD.png) ::: ## exam2 #### 題目:平面上存在10000個座標，請找出所有座標中某兩點最小的距離，並將結果使用print()輸出到log.log 用Python編寫一個Spark程式，並儲存在~ 並將程式命名為:exam2.py ``` $ cd ~ #移動到~這個資料夾 $ touch exam2.py #建立exam2.py這個檔案 ``` 請下載文件並將其上傳到您的HDFS文件夾中： ``` $ wget https://pastebin.com/raw/KZJrgL6C -O practice4-2.txt $ hadoop fs -put practice4-2.txt ``` 您的程式必須從`hdfs://[HDFS Server Name]:8020/user/[帳號]/practice4-2.txt`讀取資料。 :::warning 請根據密碼卡上的HDFS Server Name、帳號填寫 ::: :::warning 繳交方法: ``` cp exam2.py ./handin/ # 把exam2.py複製到handin這個資料夾 ``` 如果要確認自己有沒有繳交完成可以看result2.txt 示意圖:(不會即時更新，可能需要稍等片刻) ![](https://i.imgur.com/e7trL81.png) ::: :::danger tips: ``` $ time spark-submit exam2.py > exam2.log #time:查看執行這個程式所需時間若要查看Result請開另一個terminal (Visual Studio Code 遠端編輯的step7然後再輸入一次密碼) $ cd ~ $ tail -f exam2.log #持續監控程式執行結果 #ctrl+C 可以退出監控請注意運算是要丟到Spark上分散式執行，不是在local上執行,若在local執行將會扣分! ``` 可能會用到的python/程式概念的連結:lambda、強制轉型、函式 [python lambda教學連結](https://www.learncodewithmike.com/2019/12/python-lambda-functions.html) [python 強制轉型](https://medium.com/ccclub/ccclub-python-for-beginners-tutorial-d26900b9280e) [python 函式寫法](https://medium.com/ccclub/ccclub-python-for-beginners-tutorial-244862d98c18) ``` # python 指數寫法: x**y #x的y次方 #範例: print "5 的平方 = ", 5**2 #執行結果: 5的平方 = 25 #範例: print "4 的平方根 = ", 4**0.5 #執行結果: 4的平方根 = 2 ``` ::: ## 使用Visual Studio Code 登入server並遠端編輯可以用`sftp`套件讓編輯自動與遠端桌面保持同步，並直接在編輯器中開啟終端機。 * step1: 打開VSCode左側的extension按鈕，搜尋並安裝`sftp` ![](https://i.imgur.com/q7td85P.png) * step2: 在編輯器中建立資料夾作為工作目錄 ![](https://i.imgur.com/sgIiAQp.png) 建立資料夾: ![](https://i.imgur.com/p3S14Qr.png) 選取新建的資料夾: ![](https://i.imgur.com/6tJN71v.png) * step3: 按下`Ctrl+Shift+P`打開Command Pallete，輸入SFTP並點選`SFTP: Config` ![](https://i.imgur.com/kRBFkdE.png) * step4: 輸入正確的登入資訊並儲存檔案`(Ctrl+S)`， ![](https://i.imgur.com/LnvaC8u.png) :::info **host 和 port 欄位請根據收到的密碼卡填寫** **remotePath 欄位請填入 /home/[帳號]** 範例:密碼卡範例 |帳號| 密碼 | host:port |HDFS Server Name| | -------- | -------- | -------- |---| |AI_test|******| pdc7.csie.ncu.edu.tw:10022|hadoop-master-svc | 那host那格請改成:"pdc7.csie.ncu.edu.tw" port那格改成10022 username請改成"AI_test" remotePath改成"/home/AI_test" **範例程式&測驗所需的HDFS Server Name 也請照密碼卡上的HDFS Server Name更換** ::: * step5: 進入SFTP頁面(最下面) ![](https://i.imgur.com/JkKzqyd.png) * step6: 點擊My Server ![](https://i.imgur.com/Qeo2lKA.png) 輸入正確的密碼 ![](https://i.imgur.com/xMsCB2c.png) * step7: My Server上按右鍵開啟Linux終端機畫面 ![](https://i.imgur.com/MSEqZJt.png) * step8: 在終端機畫面輸入yes ![](https://i.imgur.com/hTQvVX6.png) 之後再輸入一次密碼 ![](https://i.imgur.com/mBeYAEI.png) * step9: 檔案右鍵選取Edit in Local就可以開始撰寫程式 ![](https://i.imgur.com/6SY3uwW.png) * step10: 寫完程式儲存`(Ctrl+S)`就可以了 :::info 提示: 因為terminal用到的介面都是用Linux 最底下附錄有Linux指令教學照著步驟執行應該都可以完成如果有問題/需求可以先參考一下相關操作自行無法處理再找助教協助 ::: ## 附錄:Linux 基本指令操作講解 :::info 最前方有$的符號表示這行是linux指令需要在終端機(terminal)上輸入 ::: * 這個指令該怎麼用？查詢文件（Read manual) > $ man ls #用man查看ls使用說明: :::info 有些程式需要用<指令> -h來觀看簡單的說明訊息，例如hadoop -h。 ::: - 列出現在這個目錄(資料夾)底下的檔案 > $ ls * 列出現在這個目錄(資料夾)底下的檔案，包含**隱藏檔**，並顯示詳細資訊 > $ ls -al * 新增資料夾 myDir > $ mkdir myDir * 切換目錄到 myDir > $ cd myDir * 新增檔案 myFile.txt > $ touch myFile.txt :::info 副檔名(例如:txt.py...)需要當成檔名一起輸入 ::: * 印出 myFile.txt 的檔案內容 (**假設myfile.txt內容為 hello world**) > $ cat myFile.txt # 輸出 myFile.txt 的內容: hello world * 複製 myFile.txt 到某資料夾 > $ cp myFile.txt /home/[user name] #複製myFile.txt到/home/[user name]這個資料夾 * 移動 myFile.txt 到某資料夾也可用作更名 > $ mv myFile.txt /home/[user name] #移動myFile.txt到/home/[user name]這個資料夾 $ mv myFile.txt temp.txt #把 myFile.txt 名字改成 temp.txt * 刪除 myFile.txt 檔案 > $ rm myFile.txt * 刪除 myDir 資料夾和資料夾內的所有檔案 $ rm -r myDir :::danger -r 參數代表 recursive 遞迴刪除, 使用時要格外小心, 因為會把目錄內所有檔案及目錄一同刪除: ::: * pipe:`|` 前方指令的輸出當做後方指令的輸入 *stdout* `|` *stdin* (grep 為搜尋檔案輸出內容) > $ cat myFile.txt | grep hello # grep 輸出內容: hello world * 自動補滿可用指令:按鍵[tab] > $ ls + [tab] # 顯示結果: myFile.txt * 登出 > $ exit $ logout $ Ctrl+D #3種都可以使用 ### 常見的符號 * 家目錄(/home/<使用者名稱>) > ~ * 這個目錄 (現在所在的工作目錄) > . * 上一層目錄 > ..