--- title: 科技英文:EMR Training: Running Jobs (3 of 11) --- # 3. EMR Training: Running Jobs (3 of 11) 1 00:00:00,000 --> 00:00:02,259 as promised we are now going to run a 按照承諾,我們現在將運行 2 00:00:02,459 --> 00:00:05,200 Hadoop job so at a high level running a Hadoop工作,因此可以在較高級別上運行 3 00:00:05,400 --> 00:00:09,310 job has three steps. actually four if you 工作分三個步驟,實際上四個步驟 4 00:00:09,509 --> 00:00:11,559 include the setup that we did in the 包括我們在 5 00:00:11,759 --> 00:00:13,599 previous section so the first step here 上一節,所以這裡的第一步 6 00:00:13,798 --> 00:00:16,839 is we're going to upload our Hadoop job 是我們要上傳我們的Hadoop工作 7 00:00:17,039 --> 00:00:18,640 jar and whatever data we want to process jar和我們要處理的任何資料 8 00:00:18,839 --> 00:00:24,310 to s3 then we tell Amazon what we want 到s3,然後我們告訴亞馬遜我們想要什麼 9 00:00:24,510 --> 00:00:26,230 to do with our job which basically means 與我們的工作有關,這基本上意味著 10 00:00:26,429 --> 00:00:28,690 what kind of job we're running what data 我們正在執行什麼樣的工作 11 00:00:28,890 --> 00:00:31,359 we're processing for input or we want 我們正在處理輸入或我們想要 12 00:00:31,559 --> 00:00:32,979 the results to go what kind of logging 結果去什麼樣的測井 13 00:00:33,179 --> 00:00:35,890 we want etc and finally we run the job 我們想要等等,最後我們開始工作 Etc:etcetera 14 00:00:36,090 --> 00:00:38,198 we wait for it to finish we can monitor 我們等待它完成,我們可以監控 Monitor:see something 15 00:00:38,399 --> 00:00:42,549 it and we can look at the results so the 它,我們可以看一下結果,所以 16 00:00:42,750 --> 00:00:44,288 first step is setting up that s3 bucket 第一步是設置s3存儲桶 17 00:00:44,488 --> 00:00:46,029 we created the bucket in the previous 我們在上一個創建了存儲桶 18 00:00:46,229 --> 00:00:49,239 section but now we need to setup the 部分,但現在我們需要設置 19 00:00:49,439 --> 00:00:51,669 four different sub directories that 四個不同的子目錄 Sub:small part 20 00:00:51,869 --> 00:00:53,198 contain the elements of a job 包含工作要素 21 00:00:53,399 --> 00:00:54,969 so one of those elements is the Hadoop 所以其中之一就是Hadoop 22 00:00:55,170 --> 00:00:57,159 job jar in this example that I'm going 在這個例子中,我要去的工作罐 23 00:00:57,359 --> 00:00:59,320 to be doing we've got a bucket called 我們要做的是 24 00:00:59,520 --> 00:01:02,948 AWS test KK in there as a job directory AWS測試KK作為作業目錄 Directory:路徑 25 00:01:03,149 --> 00:01:05,200 and in that job directory we put the job 然後在該工作目錄中放置該工作 26 00:01:05,400 --> 00:01:08,528 jar then we've got our input data I'm jar然後我們有輸入資料 27 00:01:08,728 --> 00:01:09,849 going to upload some data that our job 要上傳一些資料,我們的工作 28 00:01:10,049 --> 00:01:13,238 is going to process we have the 將要處理的 29 00:01:13,438 --> 00:01:14,500 directory where we want to put our 我們要放置目錄的目錄 30 00:01:14,700 --> 00:01:16,179 results and we have the directory where 結果,我們有目錄 31 00:01:16,379 --> 00:01:18,009 we want we're going to tell Amazon that 我們希望我們要告訴亞馬遜 32 00:01:18,209 --> 00:01:19,840 we want it to put the log files from the 我們希望它可以將日誌檔從 33 00:01:20,040 --> 00:01:22,988 job and again we can use the AWS console 工作,我們可以再次使用AWS控制台 34 00:01:23,188 --> 00:01:25,299 to create all these directories inside 在裡面創建所有這些目錄 35 00:01:25,500 --> 00:01:27,549 the bucket and to handle uploading files 存儲桶並處理上傳檔 36 00:01:27,750 --> 00:01:30,730 so let's go do that alright we're going 所以我們去做吧,我們要去 37 00:01:30,930 --> 00:01:32,200 to walk through the steps required to 完成所需的步驟 38 00:01:32,400 --> 00:01:34,869 get the data up into s3 that we need to 將資料收集到我們需要的s3中 39 00:01:35,069 --> 00:01:36,969 be able to run our job so once again we 能夠勝任工作,所以我們再次 40 00:01:37,170 --> 00:01:38,649 start at the top level of the ADA bus 從ADA匯流排的頂層開始 41 00:01:38,849 --> 00:01:40,988 console I'm going to click on s3 because 控制台,我要按一下s3,因為 42 00:01:41,188 --> 00:01:43,269 we're going to be pushing data up into 我們將把資料推入 43 00:01:43,469 --> 00:01:45,340 an s3 bucket that we'd previously 我們之前使用過的s3存儲桶 44 00:01:45,540 --> 00:01:47,859 created so over here we've got this AWS 在這裡創建了這個AWS 45 00:01:48,060 --> 00:01:50,500 test cake a bucket now one of the first 現在測試一個桶的蛋糕是第一個 46 00:01:50,700 --> 00:01:52,058 things I typically do here is I create 我通常在這裡做的事情是我創建的 47 00:01:52,259 --> 00:01:54,819 some folders so inside this bucket I 一些資料夾,所以我在這個桶裡 48 00:01:55,019 --> 00:01:57,390 have a folder I'm going to call job and 有一個我要打電話給我的資料夾, 49 00:01:57,590 --> 00:02:00,219 this is where I'm going to put the job 這是我要安排工作的地方 50 00:02:00,420 --> 00:02:02,918 jar I'm going to create another folder jar我要創建另一個資料夾 51 00:02:03,118 --> 00:02:05,649 and this one I'm going to call logs and 這個我要稱為日誌 52 00:02:05,849 --> 00:02:08,140 this is where I'm going to tell mr to 這是我要告訴先生的地方 53 00:02:08,340 --> 00:02:10,990 put the job log files at the end of the 將作業日誌檔放在 54 00:02:11,189 --> 00:02:12,030 job 工作 55 00:02:12,229 --> 00:02:13,980 I'm also going to create a folder here 我還要在這裡創建一個資料夾 56 00:02:14,180 --> 00:02:16,349 called data where I'm going to upload my 在我要上傳資料的地方調用了資料 57 00:02:16,549 --> 00:02:19,050 input data and I'm going to create 輸入資料,我將創建 58 00:02:19,250 --> 00:02:21,840 another folder here called results which 這裡的另一個資料夾稱為results其中 59 00:02:22,039 --> 00:02:23,130 is what I'm going to use for the results 這就是我要用於結果的 60 00:02:23,330 --> 00:02:24,810 of the job so I've got these folders 工作,所以我有這些資料夾 61 00:02:25,009 --> 00:02:28,050 created so now I'm going to drill into 創建,所以現在我要深入研究 drill into:power drill(to make a hole)/practice 62 00:02:28,250 --> 00:02:30,719 the job directory which is empty and now 現在是空的作業目錄 63 00:02:30,919 --> 00:02:36,840 I'm going to upload a job jar so let us 我要上傳一個工作罐,所以讓我們 64 00:02:37,039 --> 00:02:42,030 go find a job jar to upload here it is 去找一個工作罐子上傳到這裡 65 00:02:42,229 --> 00:02:43,200 and you can see it's seven point eight 你會看到七點八 66 00:02:43,400 --> 00:02:46,289 megabytes so it's going to take a little 百萬位元組,這將需要一點時間 7.8MB 67 00:02:46,489 --> 00:02:53,819 while we can watch it over here it's 雖然我們可以在這裡看 68 00:02:54,019 --> 00:02:56,670 going pretty fast alright so while that 很快就可以了 69 00:02:56,870 --> 00:02:58,679 is uploading I can actually start other 正在上傳,我實際上可以開始其他 70 00:02:58,878 --> 00:03:00,060 uploads but in this case it's going to 上傳,但在這種情況下 71 00:03:00,259 --> 00:03:04,710 finish fast enough okay so now let's go 完成足夠快,好吧,現在開始 72 00:03:04,909 --> 00:03:08,399 back to here and let us open up the data 回到這裡,讓我們打開資料 73 00:03:08,598 --> 00:03:10,590 directory and now I'd like to upload 目錄,現在我要上傳 74 00:03:10,789 --> 00:03:16,039 some input data so here I've got some 一些輸入資料,所以在這裡我有一些 75 00:03:16,239 --> 00:03:18,480 Wikipedia data I previously prepared 我之前準備的維琪百科資料 76 00:03:18,680 --> 00:03:21,360 small sample of it and this shouldn't 它的小樣本,這不應該 77 00:03:21,560 --> 00:03:23,689 take very long because it's pretty small 花很長時間,因為它很小 78 00:03:23,889 --> 00:03:26,640 all right so at this point I've got both 好的,所以在這一點上,我已經 79 00:03:26,840 --> 00:03:29,129 the job jar uploaded and the input data 上載的作業jar和輸入資料 80 00:03:29,329 --> 00:03:31,770 uploaded now the second step is to do 現在上傳第二步是做 81 00:03:31,969 --> 00:03:34,920 what elastic MapReduce calls creating 彈性MapReduce調用創建什麼 82 00:03:35,120 --> 00:03:38,009 the job flow. this job flow has a whole 這個工作流程有一個整體 83 00:03:38,209 --> 00:03:41,219 bunch of settings specifically you have 一堆具體的設置 Whole bunch:口語so much 84 00:03:41,419 --> 00:03:42,420 to give it a name you have to tell it 給它起個名字,你必須告訴它 85 00:03:42,620 --> 00:03:44,849 what kind of job it is you need to 你需要做什麼工作 86 00:03:45,049 --> 00:03:47,909 specify the cluster what type of servers 指定集群什麼類型的伺服器 Specify:to say clearly 87 00:03:48,109 --> 00:03:50,039 how many you need to tell it what key 你需要告訴幾個關鍵 88 00:03:50,239 --> 00:03:51,780 pair to use to run the job where to put 對用於將作業放置在何處 89 00:03:51,979 --> 00:03:53,759 the log files few other things that you 日誌檔中還有其他一些東西 90 00:03:53,959 --> 00:03:56,159 typically don't need to care about so 通常不需要在意 91 00:03:56,359 --> 00:03:57,420 we're going to go and we're going to set 我們要走了,我們要設定 92 00:03:57,620 --> 00:04:00,118 up a job flow and now I can go and 工作流,現在我可以去了 93 00:04:00,318 --> 00:04:03,569 actually create the job flow using the 實際使用創建作業流程 94 00:04:03,769 --> 00:04:06,000 EMR interface so I'm going to click on EMR介面,所以我要點擊 95 00:04:06,199 --> 00:04:09,990 elastic MapReduce here it's going to 彈性MapReduce在這裡 96 00:04:10,189 --> 00:04:11,368 show me that I don't have any job flows 告訴我我沒有工作流 97 00:04:11,568 --> 00:04:15,980 so I'm going to create a new job flow 所以我要創建一個新的工作流程 98 00:04:16,180 --> 00:04:19,740 and I'll call this 我稱這個 99 00:04:19,939 --> 00:04:23,980 Wikipedia processing I'm going to run my Wikipedia處理我要運行我的 100 00:04:24,180 --> 00:04:25,900 own application versus there are some 自己的應用程式與有一些 Application:軟體 Versus:比較 101 00:04:26,100 --> 00:04:27,370 samples that have been pre-created and 預先創建的樣本和 102 00:04:27,569 --> 00:04:30,090 the job type here that I'm running is 我正在運行的工作類型是 103 00:04:30,290 --> 00:04:36,730 going to be a custom jar now it's going 現在將成為自訂的罐子 104 00:04:36,930 --> 00:04:38,860 to ask me where this jar is located and 問我這個罐子在哪裡 105 00:04:39,060 --> 00:04:41,259 here I need to put in the path starting 我需要在這裡開始 106 00:04:41,459 --> 00:04:43,569 with the bucket in s3 where the job is 在工作所在的s3中使用存儲桶 107 00:04:43,769 --> 00:04:45,160 the job jars located 位於的工作罐 108 00:04:45,360 --> 00:04:49,620 so I know I put this into AWS test kk / 所以我知道我將其放入AWS測試kk / 109 00:04:49,819 --> 00:04:55,829 job / and it's called the Wikipedia and job /,它被稱為Wikipedia和 110 00:04:56,029 --> 00:05:01,210 grams - job jar now I have to specify 克-現在必須指定工作罐 111 00:05:01,410 --> 00:05:02,470 the arguments and these are the 參數,這些是 112 00:05:02,670 --> 00:05:04,329 arguments that are actually going to the 實際上要去的論點 113 00:05:04,529 --> 00:05:07,660 main method of the class that's been 該類的主要方法 114 00:05:07,860 --> 00:05:10,090 specified in my job jars manifest so 在我的工作罐中指定的清單如此顯示 Manifest:big plan 115 00:05:10,290 --> 00:05:12,850 here I know I need to specify the input 在這裡我知道我需要指定輸入 Specify:presents 116 00:05:13,050 --> 00:05:14,290 file that I'm going to be processing and 我將要處理的檔 117 00:05:14,490 --> 00:05:18,310 note that here I'm using real HDFS paths 請注意,這裡我使用的是真實的HDFS路徑 118 00:05:18,509 --> 00:05:21,460 so I'm specifying s3 as the protocol 所以我指定s3作為協議 Protocol:correct way to do sth 119 00:05:21,660 --> 00:05:23,259 because input file is going to be coming 因為輸入檔即將到來 120 00:05:23,459 --> 00:05:26,290 from s3 and of course it's coming out of 從s3開始,當然是從 121 00:05:26,490 --> 00:05:30,939 the AWS test KK bucket at the location 該位置的AWS測試KK存儲桶 122 00:05:31,139 --> 00:05:38,770 of the data subtor andean wiki split xml 資料替代者andean Wiki拆分xml的說明 123 00:05:38,970 --> 00:05:42,278 I also have to tell my program where the 我還必須告訴我的程式 124 00:05:42,478 --> 00:05:43,778 output is going so I'm going to say - 輸出將繼續,所以我要說- 125 00:05:43,978 --> 00:05:46,810 output Durer in this case again it needs 在這種情況下再次需要輸出Durer 126 00:05:47,009 --> 00:05:49,509 to go into s3 because otherwise it's 進入s3,因為否則 Otherwise:要不然 127 00:05:49,709 --> 00:05:50,860 just going to disappear when the cluster 當集群消失時 128 00:05:51,060 --> 00:05:54,069 terminates so I'm going to again put it 終止,所以我要再次放入 Terminates:trun off or end 129 00:05:54,269 --> 00:06:00,040 into that same AWS test KK bucket in the 放入同一AWS測試KK存儲桶中 130 00:06:00,240 --> 00:06:05,468 results directory and I also can specify 結果目錄,我也可以指定 131 00:06:05,668 --> 00:06:07,300 an additional parameter my program that 我程式的另一個參數是 132 00:06:07,500 --> 00:06:09,040 says I only want to use one reduced task 說我只想使用一項簡化的任務 133 00:06:09,240 --> 00:06:11,100 so I'll end up with a single output file 所以我將得到一個輸出檔 134 00:06:11,300 --> 00:06:14,650 so I click the continue button and now 所以我點擊繼續按鈕,現在 135 00:06:14,850 --> 00:06:16,960 it's letting me pick the type and the 讓我選擇類型和 136 00:06:17,160 --> 00:06:18,218 number of the servers that are going 即將運行的伺服器數量 137 00:06:18,418 --> 00:06:20,468 into my cluster so for my master I'm 進入集群,所以對於我的主人 Master:the leader 138 00:06:20,668 --> 00:06:22,980 going to use an m1 small instance here 將在這裡使用m1小實例 139 00:06:23,180 --> 00:06:25,660 for my slaves I'm going to use 2m 1 我的奴隸要用2m 1 140 00:06:25,860 --> 00:06:28,120 small instances and I'm not going to use 小實例,我將不使用 Instances: a moment 141 00:06:28,319 --> 00:06:30,790 any tasks only instances we'll talk 我們將要討論的所有任務 142 00:06:30,990 --> 00:06:32,680 about that in the last module of the 關於這一點的最後一個模組 143 00:06:32,879 --> 00:06:33,040 course 課程 144 00:06:33,240 --> 00:06:35,499 where you can use a tax task instance 您可以在其中使用稅務任務實例 Tax:勞心勞力 145 00:06:35,699 --> 00:06:37,778 group and request spot pricing for it 分組並要求現貨定價 146 00:06:37,978 --> 00:06:39,160 and there's good reasons for doing that 這是有充分理由的 147 00:06:39,360 --> 00:06:40,569 but we don't need to that for this 但是我們不需要這個 148 00:06:40,769 --> 00:06:45,189 particular example and for keys I'm 具體的例子,我的關鍵 149 00:06:45,389 --> 00:06:47,649 using that AWS test key that I 使用我使用的那個AWS測試金鑰 150 00:06:47,848 --> 00:06:49,838 previously created for the key pair I 先前為金鑰對I創建的 151 00:06:50,038 --> 00:06:52,240 don't need to have a virtual private 不需要虛擬私人 Virtual:VR 152 00:06:52,439 --> 00:06:57,189 cloud for the logs that are being 雲為正在被日誌 153 00:06:57,389 --> 00:06:59,379 generated by the job I want them to go 我希望他們去做的工作產生的 Generated:something made by computer/machine 154 00:06:59,579 --> 00:07:01,838 into that AWS test cake a bucket in the 到那個AWS測試蛋糕中 155 00:07:02,038 --> 00:07:05,490 LOB slogs sub directory I'm not doing any LOB slogs子目錄我什麼都沒做 156 00:07:05,689 --> 00:07:08,410 special debugging logging if I did want 如果需要的話,進行特殊的調試日誌記錄 157 00:07:08,610 --> 00:07:10,420 to do this I'd have to use simple dB 為此,我必須使用簡單的dB 158 00:07:10,620 --> 00:07:13,660 we'll talk about that later and I don't 我們稍後再談,我不會 159 00:07:13,860 --> 00:07:16,389 need to keep my cluster around once the 需要保持我的集群周圍 160 00:07:16,589 --> 00:07:21,228 job finishes so keepalive is set to no I 工作完成,因此keepalive設置為否 161 00:07:21,620 --> 00:07:25,389 click continue let's me decide whether I 按一下繼續讓我決定是否 162 00:07:25,589 --> 00:07:27,009 want to use anything any bootstrap 想要使用任何引導程式 163 00:07:27,209 --> 00:07:28,600 actions and we'll talk about that again 行動,我們將再次討論 164 00:07:28,800 --> 00:07:31,360 in the last module this is a way to sort 在最後一個模組中,這是一種排序方式 to sort of :有一點 165 00:07:31,560 --> 00:07:33,249 of alter the configuration of my cluster 更改集群的配置 Alter=change 166 00:07:33,449 --> 00:07:35,050 or do special setup with servers on my 或對我伺服器上的伺服器進行特殊設置 167 00:07:35,250 --> 00:07:36,759 cluster I don't need any of that 群集我不需要任何 168 00:07:36,959 --> 00:07:39,399 so I click continue it gives me one last 所以我點擊繼續,它給了我最後一個 169 00:07:39,598 --> 00:07:41,468 chance to check over all the settings 有機會檢查所有設置 170 00:07:41,668 --> 00:07:45,430 and then I can create the job flow once 然後我可以創建一次工作流程 171 00:07:45,629 --> 00:07:47,559 the job fell has been created I can go 工作下降已經創建了我可以去 Fell fall fall 172 00:07:47,759 --> 00:07:49,149 back over here and it will show me that 回到這裡,它會告訴我 173 00:07:49,348 --> 00:07:52,209 I've got this job that's starting up and 我已經開始這項工作了, 174 00:07:52,408 --> 00:07:55,059 at some point typically a couple of 在某些時候通常是幾個 175 00:07:55,259 --> 00:07:56,769 minutes my cluster will be running which 我的集群將在幾分鐘內運行 176 00:07:56,968 --> 00:08:00,218 means elastic MapReduce has allocated 表示已分配彈性MapReduce Allocated: 分配 177 00:08:00,418 --> 00:08:03,069 the servers that I asked for provision 我要求供應的伺服器 178 00:08:03,269 --> 00:08:06,819 them with Hadoop downloaded my job jar 他們與Hadoop下載了我的工作罐 179 00:08:07,019 --> 00:08:08,680 and started up the job and so we're 然後開始工作,所以我們 180 00:08:08,879 --> 00:08:10,240 going to wait until that happens and 等到那件事發生 181 00:08:10,439 --> 00:08:11,980 we're going to take a look at the job as 我們將看一下這份工作, 182 00:08:12,180 --> 00:08:15,338 it's running while your job is running 它正在運行,而您的作業正在運行 183 00:08:15,538 --> 00:08:18,129 you can use the AWS console to monitor 您可以使用AWS控制台進行監控 184 00:08:18,329 --> 00:08:20,230 it to find out what state it's in is it 找出它處於什麼狀態 185 00:08:20,430 --> 00:08:21,670 starting up is actually running the job 開始實際上是在工作 186 00:08:21,870 --> 00:08:25,180 is it terminating is it done you can 它終止了嗎,你可以 187 00:08:25,379 --> 00:08:26,889 also see how long it's been running and 還可以看到它運行了多長時間 188 00:08:27,089 --> 00:08:28,930 you get an estimate of roughly how much 您估計大概有多少 Estimate(N)/(V) roughly(adj) 189 00:08:29,129 --> 00:08:30,670 it's going to cost you and if need be 這會花你的錢,如果需要的話 190 00:08:30,870 --> 00:08:32,859 you can terminate the job so we've 您可以終止工作,所以我們已經 191 00:08:33,059 --> 00:08:34,659 started our job let's go take a look at 開始我們的工作,讓我們來看一下 192 00:08:34,860 --> 00:08:37,120 it okay now you see we're actually 好吧,現在您看到我們實際上 193 00:08:37,320 --> 00:08:39,159 running the job so the status has 運行作業,因此狀態為 194 00:08:39,360 --> 00:08:42,370 changed to running and here it shows 更改為運行,它在這裡顯示 195 00:08:42,570 --> 00:08:44,740 that you have this normalized instance 你有這個規範化的實例 Normalized:by SOP 196 00:08:44,940 --> 00:08:47,019 hours what that's saying is that as 幾個小時的意思是 197 00:08:47,220 --> 00:08:49,409 this cluster starts actually running I'm 這個集群實際上開始運行我 198 00:08:49,610 --> 00:08:53,709 being charged for three servers times up 被收取三台伺服器的費用 199 00:08:53,909 --> 00:08:57,339 to an hour and if I actually had a job 一個小時,如果我真的有工作 200 00:08:57,539 --> 00:08:58,779 that ran longer than an hour then you'd 跑了一個多小時,然後你會 201 00:08:58,980 --> 00:09:00,729 see this jumping up to six instance 看到這跳到六個實例 202 00:09:00,929 --> 00:09:02,919 hours this is one of the reasons why you 小時,這就是你為什麼的原因之一 203 00:09:03,120 --> 00:09:04,659 want to avoid having a job that fails 想要避免工作失敗 204 00:09:04,860 --> 00:09:06,128 right away because you're still going to 立即開始,因為您仍然要 205 00:09:06,328 --> 00:09:09,008 pay for the number of servers times at 支付伺服器數量乘以 206 00:09:09,208 --> 00:09:10,568 least one hour even if your job only 至少一小時,即使僅工作 207 00:09:10,769 --> 00:09:12,299 runs for 10 seconds 運行10秒 208 00:09:12,500 --> 00:09:14,438 now the actual runtime for this job 現在該作業的實際執行時間 209 00:09:14,639 --> 00:09:15,818 isn't very long so I expect pretty 不太長,所以我希望漂亮 210 00:09:16,019 --> 00:09:19,029 quickly this job will succeed if I go 如果我去,這項工作很快就會成功 211 00:09:19,230 --> 00:09:22,928 down here and look at these steps what 在這裡,看看這些步驟是什麼 212 00:09:23,129 --> 00:09:25,178 you'll see here is I've got essentially 你會看到這裡是我的本質 Essentially:basic 213 00:09:25,379 --> 00:09:28,599 one step in my flow which is the single 我流程中的第一步 214 00:09:28,799 --> 00:09:31,568 jar a job that's running with my 阻止與我一起執行的工作 215 00:09:31,769 --> 00:09:35,419 parameters that I passed it down here 我在這裡傳遞的參數 216 00:09:36,799 --> 00:09:39,159 once the job finishes the status will 工作完成後,狀態將 217 00:09:39,360 --> 00:09:42,149 change to shutting down at that point 更改為在此時關閉 218 00:09:42,350 --> 00:09:46,149 the results of the run are being copied 運行結果被複製 219 00:09:46,350 --> 00:09:48,339 up to s3 which includes both the results 最多s3,其中包括兩個結果 220 00:09:48,539 --> 00:09:53,438 and also the log files so you can see 還有日誌檔,這樣您就可以看到 221 00:09:53,639 --> 00:09:55,709 the status just changed to shutting down 狀態剛剛變為關閉 222 00:09:55,909 --> 00:09:58,299 my elapsed time and the elapsed time 我經過的時間和經過的時間 Elapsed:past 223 00:09:58,500 --> 00:10:00,998 actually doesn't start until the cluster 直到簇才開始 224 00:10:01,198 --> 00:10:03,818 is actually up and running the job so 實際上已經做好了工作,所以 225 00:10:04,019 --> 00:10:05,078 total elapsed time for this job was only 這項工作的總耗時僅為 226 00:10:05,278 --> 00:10:08,289 four minutes now the job is finished I 四分鐘現在工作完成了 227 00:10:08,490 --> 00:10:09,849 can actually go take a look at the 可以去看看 228 00:10:10,049 --> 00:10:13,358 results when I set up my job I specified 我指定的工作結果 229 00:10:13,558 --> 00:10:15,128 my output directory and that was 我的輸出目錄是 230 00:10:15,328 --> 00:10:17,498 actually a parameter to the job that I 實際上是我工作的一個參數 231 00:10:17,698 --> 00:10:20,078 was running so I told it where an s3 to 在運行,所以我告訴它一個S3 232 00:10:20,278 --> 00:10:22,779 put it using the s3 end protocol now 現在使用s3終端協定 233 00:10:22,980 --> 00:10:25,448 because the Hadoop cluster goes away at 因為Hadoop集群在 234 00:10:25,649 --> 00:10:27,159 the end of the job it means all the 工作的結束意味著 235 00:10:27,360 --> 00:10:30,099 drives that are used for HDFS they're 用於HDFS的驅動器 236 00:10:30,299 --> 00:10:32,878 ephemeral which means they disappear so 短暫的意味著他們消失了 Ephemeral:短時間 237 00:10:33,078 --> 00:10:35,799 the only way to persist data typically 持久存儲資料的唯一方法 Persist:keep going 238 00:10:36,000 --> 00:10:38,589 is that you have to write it to s3 是你必須將其寫入s3 239 00:10:38,789 --> 00:10:42,428 now you can set up job flows where they 現在您可以在他們 240 00:10:42,629 --> 00:10:44,019 are alive that means they don't 活著意味著他們沒有 241 00:10:44,220 --> 00:10:45,428 terminate at the end of your job and 在工作結束時終止並 242 00:10:45,629 --> 00:10:48,428 that's a great way to debug jobs when 這是調試作業的好方法 243 00:10:48,629 --> 00:10:49,568 you're first getting started with them 你首先開始與他們 244 00:10:49,769 --> 00:10:52,178 and we'll talk about that more later but 我們稍後再討論 245 00:10:52,379 --> 00:10:55,448 a typical run has everything going into 一個典型的運行將一切 246 00:10:55,649 --> 00:10:58,358 s3 which means you get both the results s3表示您同時獲得了兩個結果 247 00:10:58,558 --> 00:11:00,519 and typically that's because your 通常是因為 248 00:11:00,720 --> 00:11:00,909 program 程式 249 00:11:01,110 --> 00:11:03,128 is using some output path that you 正在使用您的某些輸出路徑 250 00:11:03,328 --> 00:11:05,198 specify and you tell it that you want to 指定並告訴您您想要 251 00:11:05,399 --> 00:11:08,529 write it into s3 and secondly elastic 將其寫入s3,其次是彈性 252 00:11:08,730 --> 00:11:09,849 MapReduce is going to copy all of the MapReduce將複製所有 253 00:11:10,049 --> 00:11:11,469 log files up to the location you 將檔記錄到您所在的位置 254 00:11:11,669 --> 00:11:14,439 specified in s3 and in our case we used 在s3中指定,在本例中我們使用 255 00:11:14,639 --> 00:11:19,349 the bucket name AWS test skk slash logs 存儲桶名稱AWS Test skk斜杠日誌 256 00:11:19,549 --> 00:11:22,448 so let's go look at the job results now 所以現在我們來看一下工作結果 257 00:11:22,649 --> 00:11:29,219 if I go over to s3 and I take a look in 如果我轉到s3,然後看看 258 00:11:29,419 --> 00:11:34,939 my AWS test KK bucket 我的AWS測試KK存儲桶 259 00:11:35,419 --> 00:11:37,689 you see I've got my four directories 你看我有四個目錄 260 00:11:37,889 --> 00:11:41,870 there if I look in the results directory 如果我在結果目錄中查看 261 00:11:43,070 --> 00:11:46,120 my job is created two sub directories 我的工作是創建兩個子目錄 262 00:11:46,320 --> 00:11:47,679 inside their raw counts and sorted 在原始計數內並排序 raw counts: 263 00:11:47,879 --> 00:11:49,990 counts sort accounts I know is the final 計算我知道的最終分類帳戶 264 00:11:50,190 --> 00:11:54,818 output and here you see the typical 輸出,在這裡您可以看到典型的 265 00:11:55,019 --> 00:11:56,948 Hadoop success file and then also a part Hadoop成功檔,然後也是一部分 266 00:11:57,149 --> 00:12:00,729 file from the reducer so part - r5 zeros 來自reducer的檔,所以部分-r5零 Reducer:to make small 267 00:12:00,929 --> 00:12:03,729 and I can actually download this file 我實際上可以下載該檔 268 00:12:03,929 --> 00:12:07,809 and I'll open it up with BBEdit 我將使用BBEdit打開它 269 00:12:08,009 --> 00:12:12,370 my editor of choice and that displays 我選擇的編輯器,顯示 270 00:12:12,570 --> 00:12:14,979 results it look like this now what this 結果現在看起來像這樣 271 00:12:15,179 --> 00:12:17,409 job does is it generates bigram counts 的工作是它會生成雙字母組計數 272 00:12:17,610 --> 00:12:20,289 from text found in Wikipedia stories so 從維琪百科故事中找到的文字 273 00:12:20,490 --> 00:12:21,969 you can see that the most common diagram 您可以看到最常見的圖表 274 00:12:22,169 --> 00:12:24,370 was a space and it occurred 4,000 .25 是一個空間,它發生了4,000 25 Occurred:happen 275 00:12:24,570 --> 00:12:27,189 times in that snippet from Wikipedia 維琪百科中該片段的時間 Snippet:a little piece 276 00:12:27,389 --> 00:12:30,909 that I uploaded into the s3 data 我上傳到s3數據中 277 00:12:31,110 --> 00:12:35,979 directory subdirectory of my bucket now 現在我的存儲桶的目錄子目錄 278 00:12:36,179 --> 00:12:40,679 if we go back up to the bucket level I 如果我們回到存儲桶級別,我 279 00:12:40,879 --> 00:12:44,469 look in logs what you'll see here is its 查看日誌,您會在這裡看到它的 280 00:12:44,669 --> 00:12:48,389 created a J - and then a job ID 創建了一個J-然後是一個工作ID 281 00:12:48,589 --> 00:12:50,828 directory so this has information about 目錄,因此它具有有關的資訊 282 00:12:51,028 --> 00:12:53,948 the actual job if I open up this one 實際的工作,如果我打開這個 283 00:12:54,149 --> 00:12:55,089 you'll see there's a bunch of sub 你會看到那裡有一堆子 284 00:12:55,289 --> 00:12:57,219 directories most of this isn't that 目錄大部分不是 285 00:12:57,419 --> 00:12:58,479 interesting it's information being 有趣的是信息 286 00:12:58,679 --> 00:13:00,279 logged by the hadoop system itself but 由hadoop系統本身記錄,但 287 00:13:00,480 --> 00:13:03,758 if you look in steps I only had one step 如果你看一步,我只有一步 288 00:13:03,958 --> 00:13:05,229 so it's just a single subdirectory 所以它只是一個子目錄 289 00:13:05,429 --> 00:13:08,198 called one if I look inside of that 如果我看裡面的話 290 00:13:08,399 --> 00:13:10,359 you'll see I've got a number of files in 您會看到我有很多文件 291 00:13:10,559 --> 00:13:12,939 here the interesting one is standard out 這裡有趣的是標準配置 292 00:13:13,139 --> 00:13:14,779 this is where my login goes 這是我的登錄資訊所在 293 00:13:14,980 --> 00:13:16,490 so if I double-click at this because 所以如果我按兩下這個,因為 294 00:13:16,690 --> 00:13:19,490 it's a regular text file I'll just get a 這是一個普通的文字檔,我會得到一個 295 00:13:19,690 --> 00:13:21,409 window popped up with the results with 彈出帶有結果的視窗 296 00:13:21,610 --> 00:13:23,899 the contents of this and this is what my 這的內容,這就是我的 297 00:13:24,100 --> 00:13:26,659 job was logging it was giving me 工作正在記錄,這給了我 298 00:13:26,860 --> 00:13:28,339 information about the job tracker and I 有關我和工作跟蹤器的資訊 299 00:13:28,539 --> 00:13:29,959 printed out some information about input 列印出一些有關輸入的資訊 300 00:13:30,159 --> 00:13:32,839 and output paths and the Engram size I 和輸出路徑以及Engram大小I 301 00:13:33,039 --> 00:13:38,779 was using etc so as you can see elastic 使用等,如您所見,彈性 Etc:etcetera 302 00:13:38,980 --> 00:13:41,029 MapReduce automatically uploaded this MapReduce自動上傳了這個 303 00:13:41,230 --> 00:13:43,189 information into the directory that I'd 資訊進入我想要的目錄 304 00:13:43,389 --> 00:13:45,289 specified as my logs directory when I 當我指定為我的日誌目錄 305 00:13:45,490 --> 00:13:48,099 was defining my job and if in particular 在定義我的工作,特別是 306 00:13:48,299 --> 00:13:50,479 my job failed then this is where I'd 我的工作失敗了,這就是我要去的地方 307 00:13:50,679 --> 00:13:52,519 start looking for clues as to what went 開始尋找發生了什麼的線索 308 00:13:52,720 --> 00:13:55,309 wrong and remember that there's a lag 錯了,請記住有一個滯後 Lag:late 309 00:13:55,509 --> 00:13:57,409 between when the job finishes and this 在工作完成到此之間 310 00:13:57,610 --> 00:13:59,629 data gets uploaded especially if your 資料會上傳,特別是如果您的 311 00:13:59,830 --> 00:14:02,000 jobs doing a lot of logging then you can 做很多日誌的工作,那麼你可以 312 00:14:02,200 --> 00:14:04,039 have potentially many many gigabytes of 可能有許多千百萬位元組的 313 00:14:04,240 --> 00:14:05,779 log files and those will take some time 日誌檔,這些檔需要一些時間 314 00:14:05,980 --> 00:14:07,429 to upload so if you have a large cluster 進行上傳,如果您的集群很大 315 00:14:07,629 --> 00:14:09,019 generating lots of logging output I 生成大量的日誌輸出I 316 00:14:09,220 --> 00:14:10,639 typically wait at least five minutes 通常等待至少五分鐘 317 00:14:10,840 --> 00:14:12,799 more like ten minutes after my job is 我的工作大概是十分鐘後 318 00:14:13,000 --> 00:14:15,469 finished before I start looking for logs 在開始查找日誌之前完成 319 00:14:15,669 --> 00:14:18,039 from the job alright to summarize 從工作上總結一下 320 00:14:18,240 --> 00:14:21,289 running jobs using elastic MapReduce is 使用彈性MapReduce運行作業是 321 00:14:21,490 --> 00:14:23,659 very simple you define the jobs using 很簡單,您使用定義工作 322 00:14:23,860 --> 00:14:26,240 the ADA based console your input data 基於ADA的控制台您的輸入資料 323 00:14:26,440 --> 00:14:31,639 and your job jar gets loaded from s3 the 並且您的作業jar從s3中載入 324 00:14:31,840 --> 00:14:33,229 results of your job which include the 您的工作結果包括 325 00:14:33,429 --> 00:14:34,969 log files get pushed back up to s3 and 日誌檔被推回至s3和 326 00:14:35,169 --> 00:14:37,429 you can use the AWS console to monitor 您可以使用AWS控制台進行監控 327 00:14:37,629 --> 00:14:40,189 the status of your job now the next 您的工作狀態現在下一個 328 00:14:40,389 --> 00:14:41,809 module we're going to go and look at the 模組,我們將去看看 329 00:14:42,009 --> 00:14:44,120 different options you have for servers 伺服器有不同的選擇 330 00:14:44,320 --> 00:14:45,740 that you can use to create your Hadoop 您可以用來創建Hadoop的 331 00:14:45,940 --> 00:14:50,940 cluster