---
title: 科技英文:EMR Training: Running Jobs (3 of 11)
---
# 3. EMR Training: Running Jobs (3 of 11)
1
00:00:00,000 --> 00:00:02,259
as promised we are now going to run a
按照承諾,我們現在將運行
2
00:00:02,459 --> 00:00:05,200
Hadoop job so at a high level running a
Hadoop工作,因此可以在較高級別上運行
3
00:00:05,400 --> 00:00:09,310
job has three steps. actually four if you
工作分三個步驟,實際上四個步驟
4
00:00:09,509 --> 00:00:11,559
include the setup that we did in the
包括我們在
5
00:00:11,759 --> 00:00:13,599
previous section so the first step here
上一節,所以這裡的第一步
6
00:00:13,798 --> 00:00:16,839
is we're going to upload our Hadoop job
是我們要上傳我們的Hadoop工作
7
00:00:17,039 --> 00:00:18,640
jar and whatever data we want to process
jar和我們要處理的任何資料
8
00:00:18,839 --> 00:00:24,310
to s3 then we tell Amazon what we want
到s3,然後我們告訴亞馬遜我們想要什麼
9
00:00:24,510 --> 00:00:26,230
to do with our job which basically means
與我們的工作有關,這基本上意味著
10
00:00:26,429 --> 00:00:28,690
what kind of job we're running what data
我們正在執行什麼樣的工作
11
00:00:28,890 --> 00:00:31,359
we're processing for input or we want
我們正在處理輸入或我們想要
12
00:00:31,559 --> 00:00:32,979
the results to go what kind of logging
結果去什麼樣的測井
13
00:00:33,179 --> 00:00:35,890
we want etc and finally we run the job
我們想要等等,最後我們開始工作
Etc:etcetera
14
00:00:36,090 --> 00:00:38,198
we wait for it to finish we can monitor
我們等待它完成,我們可以監控
Monitor:see something
15
00:00:38,399 --> 00:00:42,549
it and we can look at the results so the
它,我們可以看一下結果,所以
16
00:00:42,750 --> 00:00:44,288
first step is setting up that s3 bucket
第一步是設置s3存儲桶
17
00:00:44,488 --> 00:00:46,029
we created the bucket in the previous
我們在上一個創建了存儲桶
18
00:00:46,229 --> 00:00:49,239
section but now we need to setup the
部分,但現在我們需要設置
19
00:00:49,439 --> 00:00:51,669
four different sub directories that
四個不同的子目錄
Sub:small part
20
00:00:51,869 --> 00:00:53,198
contain the elements of a job
包含工作要素
21
00:00:53,399 --> 00:00:54,969
so one of those elements is the Hadoop
所以其中之一就是Hadoop
22
00:00:55,170 --> 00:00:57,159
job jar in this example that I'm going
在這個例子中,我要去的工作罐
23
00:00:57,359 --> 00:00:59,320
to be doing we've got a bucket called
我們要做的是
24
00:00:59,520 --> 00:01:02,948
AWS test KK in there as a job directory
AWS測試KK作為作業目錄
Directory:路徑
25
00:01:03,149 --> 00:01:05,200
and in that job directory we put the job
然後在該工作目錄中放置該工作
26
00:01:05,400 --> 00:01:08,528
jar then we've got our input data I'm
jar然後我們有輸入資料
27
00:01:08,728 --> 00:01:09,849
going to upload some data that our job
要上傳一些資料,我們的工作
28
00:01:10,049 --> 00:01:13,238
is going to process we have the
將要處理的
29
00:01:13,438 --> 00:01:14,500
directory where we want to put our
我們要放置目錄的目錄
30
00:01:14,700 --> 00:01:16,179
results and we have the directory where
結果,我們有目錄
31
00:01:16,379 --> 00:01:18,009
we want we're going to tell Amazon that
我們希望我們要告訴亞馬遜
32
00:01:18,209 --> 00:01:19,840
we want it to put the log files from the
我們希望它可以將日誌檔從
33
00:01:20,040 --> 00:01:22,988
job and again we can use the AWS console
工作,我們可以再次使用AWS控制台
34
00:01:23,188 --> 00:01:25,299
to create all these directories inside
在裡面創建所有這些目錄
35
00:01:25,500 --> 00:01:27,549
the bucket and to handle uploading files
存儲桶並處理上傳檔
36
00:01:27,750 --> 00:01:30,730
so let's go do that alright we're going
所以我們去做吧,我們要去
37
00:01:30,930 --> 00:01:32,200
to walk through the steps required to
完成所需的步驟
38
00:01:32,400 --> 00:01:34,869
get the data up into s3 that we need to
將資料收集到我們需要的s3中
39
00:01:35,069 --> 00:01:36,969
be able to run our job so once again we
能夠勝任工作,所以我們再次
40
00:01:37,170 --> 00:01:38,649
start at the top level of the ADA bus
從ADA匯流排的頂層開始
41
00:01:38,849 --> 00:01:40,988
console I'm going to click on s3 because
控制台,我要按一下s3,因為
42
00:01:41,188 --> 00:01:43,269
we're going to be pushing data up into
我們將把資料推入
43
00:01:43,469 --> 00:01:45,340
an s3 bucket that we'd previously
我們之前使用過的s3存儲桶
44
00:01:45,540 --> 00:01:47,859
created so over here we've got this AWS
在這裡創建了這個AWS
45
00:01:48,060 --> 00:01:50,500
test cake a bucket now one of the first
現在測試一個桶的蛋糕是第一個
46
00:01:50,700 --> 00:01:52,058
things I typically do here is I create
我通常在這裡做的事情是我創建的
47
00:01:52,259 --> 00:01:54,819
some folders so inside this bucket I
一些資料夾,所以我在這個桶裡
48
00:01:55,019 --> 00:01:57,390
have a folder I'm going to call job and
有一個我要打電話給我的資料夾,
49
00:01:57,590 --> 00:02:00,219
this is where I'm going to put the job
這是我要安排工作的地方
50
00:02:00,420 --> 00:02:02,918
jar I'm going to create another folder
jar我要創建另一個資料夾
51
00:02:03,118 --> 00:02:05,649
and this one I'm going to call logs and
這個我要稱為日誌
52
00:02:05,849 --> 00:02:08,140
this is where I'm going to tell mr to
這是我要告訴先生的地方
53
00:02:08,340 --> 00:02:10,990
put the job log files at the end of the
將作業日誌檔放在
54
00:02:11,189 --> 00:02:12,030
job
工作
55
00:02:12,229 --> 00:02:13,980
I'm also going to create a folder here
我還要在這裡創建一個資料夾
56
00:02:14,180 --> 00:02:16,349
called data where I'm going to upload my
在我要上傳資料的地方調用了資料
57
00:02:16,549 --> 00:02:19,050
input data and I'm going to create
輸入資料,我將創建
58
00:02:19,250 --> 00:02:21,840
another folder here called results which
這裡的另一個資料夾稱為results其中
59
00:02:22,039 --> 00:02:23,130
is what I'm going to use for the results
這就是我要用於結果的
60
00:02:23,330 --> 00:02:24,810
of the job so I've got these folders
工作,所以我有這些資料夾
61
00:02:25,009 --> 00:02:28,050
created so now I'm going to drill into
創建,所以現在我要深入研究
drill into:power drill(to make a hole)/practice
62
00:02:28,250 --> 00:02:30,719
the job directory which is empty and now
現在是空的作業目錄
63
00:02:30,919 --> 00:02:36,840
I'm going to upload a job jar so let us
我要上傳一個工作罐,所以讓我們
64
00:02:37,039 --> 00:02:42,030
go find a job jar to upload here it is
去找一個工作罐子上傳到這裡
65
00:02:42,229 --> 00:02:43,200
and you can see it's seven point eight
你會看到七點八
66
00:02:43,400 --> 00:02:46,289
megabytes so it's going to take a little
百萬位元組,這將需要一點時間
7.8MB
67
00:02:46,489 --> 00:02:53,819
while we can watch it over here it's
雖然我們可以在這裡看
68
00:02:54,019 --> 00:02:56,670
going pretty fast alright so while that
很快就可以了
69
00:02:56,870 --> 00:02:58,679
is uploading I can actually start other
正在上傳,我實際上可以開始其他
70
00:02:58,878 --> 00:03:00,060
uploads but in this case it's going to
上傳,但在這種情況下
71
00:03:00,259 --> 00:03:04,710
finish fast enough okay so now let's go
完成足夠快,好吧,現在開始
72
00:03:04,909 --> 00:03:08,399
back to here and let us open up the data
回到這裡,讓我們打開資料
73
00:03:08,598 --> 00:03:10,590
directory and now I'd like to upload
目錄,現在我要上傳
74
00:03:10,789 --> 00:03:16,039
some input data so here I've got some
一些輸入資料,所以在這裡我有一些
75
00:03:16,239 --> 00:03:18,480
Wikipedia data I previously prepared
我之前準備的維琪百科資料
76
00:03:18,680 --> 00:03:21,360
small sample of it and this shouldn't
它的小樣本,這不應該
77
00:03:21,560 --> 00:03:23,689
take very long because it's pretty small
花很長時間,因為它很小
78
00:03:23,889 --> 00:03:26,640
all right so at this point I've got both
好的,所以在這一點上,我已經
79
00:03:26,840 --> 00:03:29,129
the job jar uploaded and the input data
上載的作業jar和輸入資料
80
00:03:29,329 --> 00:03:31,770
uploaded now the second step is to do
現在上傳第二步是做
81
00:03:31,969 --> 00:03:34,920
what elastic MapReduce calls creating
彈性MapReduce調用創建什麼
82
00:03:35,120 --> 00:03:38,009
the job flow. this job flow has a whole
這個工作流程有一個整體
83
00:03:38,209 --> 00:03:41,219
bunch of settings specifically you have
一堆具體的設置
Whole bunch:口語so much
84
00:03:41,419 --> 00:03:42,420
to give it a name you have to tell it
給它起個名字,你必須告訴它
85
00:03:42,620 --> 00:03:44,849
what kind of job it is you need to
你需要做什麼工作
86
00:03:45,049 --> 00:03:47,909
specify the cluster what type of servers
指定集群什麼類型的伺服器
Specify:to say clearly
87
00:03:48,109 --> 00:03:50,039
how many you need to tell it what key
你需要告訴幾個關鍵
88
00:03:50,239 --> 00:03:51,780
pair to use to run the job where to put
對用於將作業放置在何處
89
00:03:51,979 --> 00:03:53,759
the log files few other things that you
日誌檔中還有其他一些東西
90
00:03:53,959 --> 00:03:56,159
typically don't need to care about so
通常不需要在意
91
00:03:56,359 --> 00:03:57,420
we're going to go and we're going to set
我們要走了,我們要設定
92
00:03:57,620 --> 00:04:00,118
up a job flow and now I can go and
工作流,現在我可以去了
93
00:04:00,318 --> 00:04:03,569
actually create the job flow using the
實際使用創建作業流程
94
00:04:03,769 --> 00:04:06,000
EMR interface so I'm going to click on
EMR介面,所以我要點擊
95
00:04:06,199 --> 00:04:09,990
elastic MapReduce here it's going to
彈性MapReduce在這裡
96
00:04:10,189 --> 00:04:11,368
show me that I don't have any job flows
告訴我我沒有工作流
97
00:04:11,568 --> 00:04:15,980
so I'm going to create a new job flow
所以我要創建一個新的工作流程
98
00:04:16,180 --> 00:04:19,740
and I'll call this
我稱這個
99
00:04:19,939 --> 00:04:23,980
Wikipedia processing I'm going to run my
Wikipedia處理我要運行我的
100
00:04:24,180 --> 00:04:25,900
own application versus there are some
自己的應用程式與有一些
Application:軟體
Versus:比較
101
00:04:26,100 --> 00:04:27,370
samples that have been pre-created and
預先創建的樣本和
102
00:04:27,569 --> 00:04:30,090
the job type here that I'm running is
我正在運行的工作類型是
103
00:04:30,290 --> 00:04:36,730
going to be a custom jar now it's going
現在將成為自訂的罐子
104
00:04:36,930 --> 00:04:38,860
to ask me where this jar is located and
問我這個罐子在哪裡
105
00:04:39,060 --> 00:04:41,259
here I need to put in the path starting
我需要在這裡開始
106
00:04:41,459 --> 00:04:43,569
with the bucket in s3 where the job is
在工作所在的s3中使用存儲桶
107
00:04:43,769 --> 00:04:45,160
the job jars located
位於的工作罐
108
00:04:45,360 --> 00:04:49,620
so I know I put this into AWS test kk /
所以我知道我將其放入AWS測試kk /
109
00:04:49,819 --> 00:04:55,829
job / and it's called the Wikipedia and
job /,它被稱為Wikipedia和
110
00:04:56,029 --> 00:05:01,210
grams - job jar now I have to specify
克-現在必須指定工作罐
111
00:05:01,410 --> 00:05:02,470
the arguments and these are the
參數,這些是
112
00:05:02,670 --> 00:05:04,329
arguments that are actually going to the
實際上要去的論點
113
00:05:04,529 --> 00:05:07,660
main method of the class that's been
該類的主要方法
114
00:05:07,860 --> 00:05:10,090
specified in my job jars manifest so
在我的工作罐中指定的清單如此顯示
Manifest:big plan
115
00:05:10,290 --> 00:05:12,850
here I know I need to specify the input
在這裡我知道我需要指定輸入
Specify:presents
116
00:05:13,050 --> 00:05:14,290
file that I'm going to be processing and
我將要處理的檔
117
00:05:14,490 --> 00:05:18,310
note that here I'm using real HDFS paths
請注意,這裡我使用的是真實的HDFS路徑
118
00:05:18,509 --> 00:05:21,460
so I'm specifying s3 as the protocol
所以我指定s3作為協議
Protocol:correct way to do sth
119
00:05:21,660 --> 00:05:23,259
because input file is going to be coming
因為輸入檔即將到來
120
00:05:23,459 --> 00:05:26,290
from s3 and of course it's coming out of
從s3開始,當然是從
121
00:05:26,490 --> 00:05:30,939
the AWS test KK bucket at the location
該位置的AWS測試KK存儲桶
122
00:05:31,139 --> 00:05:38,770
of the data subtor andean wiki split xml
資料替代者andean Wiki拆分xml的說明
123
00:05:38,970 --> 00:05:42,278
I also have to tell my program where the
我還必須告訴我的程式
124
00:05:42,478 --> 00:05:43,778
output is going so I'm going to say -
輸出將繼續,所以我要說-
125
00:05:43,978 --> 00:05:46,810
output Durer in this case again it needs
在這種情況下再次需要輸出Durer
126
00:05:47,009 --> 00:05:49,509
to go into s3 because otherwise it's
進入s3,因為否則
Otherwise:要不然
127
00:05:49,709 --> 00:05:50,860
just going to disappear when the cluster
當集群消失時
128
00:05:51,060 --> 00:05:54,069
terminates so I'm going to again put it
終止,所以我要再次放入
Terminates:trun off or end
129
00:05:54,269 --> 00:06:00,040
into that same AWS test KK bucket in the
放入同一AWS測試KK存儲桶中
130
00:06:00,240 --> 00:06:05,468
results directory and I also can specify
結果目錄,我也可以指定
131
00:06:05,668 --> 00:06:07,300
an additional parameter my program that
我程式的另一個參數是
132
00:06:07,500 --> 00:06:09,040
says I only want to use one reduced task
說我只想使用一項簡化的任務
133
00:06:09,240 --> 00:06:11,100
so I'll end up with a single output file
所以我將得到一個輸出檔
134
00:06:11,300 --> 00:06:14,650
so I click the continue button and now
所以我點擊繼續按鈕,現在
135
00:06:14,850 --> 00:06:16,960
it's letting me pick the type and the
讓我選擇類型和
136
00:06:17,160 --> 00:06:18,218
number of the servers that are going
即將運行的伺服器數量
137
00:06:18,418 --> 00:06:20,468
into my cluster so for my master I'm
進入集群,所以對於我的主人
Master:the leader
138
00:06:20,668 --> 00:06:22,980
going to use an m1 small instance here
將在這裡使用m1小實例
139
00:06:23,180 --> 00:06:25,660
for my slaves I'm going to use 2m 1
我的奴隸要用2m 1
140
00:06:25,860 --> 00:06:28,120
small instances and I'm not going to use
小實例,我將不使用
Instances: a moment
141
00:06:28,319 --> 00:06:30,790
any tasks only instances we'll talk
我們將要討論的所有任務
142
00:06:30,990 --> 00:06:32,680
about that in the last module of the
關於這一點的最後一個模組
143
00:06:32,879 --> 00:06:33,040
course
課程
144
00:06:33,240 --> 00:06:35,499
where you can use a tax task instance
您可以在其中使用稅務任務實例
Tax:勞心勞力
145
00:06:35,699 --> 00:06:37,778
group and request spot pricing for it
分組並要求現貨定價
146
00:06:37,978 --> 00:06:39,160
and there's good reasons for doing that
這是有充分理由的
147
00:06:39,360 --> 00:06:40,569
but we don't need to that for this
但是我們不需要這個
148
00:06:40,769 --> 00:06:45,189
particular example and for keys I'm
具體的例子,我的關鍵
149
00:06:45,389 --> 00:06:47,649
using that AWS test key that I
使用我使用的那個AWS測試金鑰
150
00:06:47,848 --> 00:06:49,838
previously created for the key pair I
先前為金鑰對I創建的
151
00:06:50,038 --> 00:06:52,240
don't need to have a virtual private
不需要虛擬私人
Virtual:VR
152
00:06:52,439 --> 00:06:57,189
cloud for the logs that are being
雲為正在被日誌
153
00:06:57,389 --> 00:06:59,379
generated by the job I want them to go
我希望他們去做的工作產生的
Generated:something made by computer/machine
154
00:06:59,579 --> 00:07:01,838
into that AWS test cake a bucket in the
到那個AWS測試蛋糕中
155
00:07:02,038 --> 00:07:05,490
LOB slogs sub directory I'm not doing any
LOB slogs子目錄我什麼都沒做
156
00:07:05,689 --> 00:07:08,410
special debugging logging if I did want
如果需要的話,進行特殊的調試日誌記錄
157
00:07:08,610 --> 00:07:10,420
to do this I'd have to use simple dB
為此,我必須使用簡單的dB
158
00:07:10,620 --> 00:07:13,660
we'll talk about that later and I don't
我們稍後再談,我不會
159
00:07:13,860 --> 00:07:16,389
need to keep my cluster around once the
需要保持我的集群周圍
160
00:07:16,589 --> 00:07:21,228
job finishes so keepalive is set to no I
工作完成,因此keepalive設置為否
161
00:07:21,620 --> 00:07:25,389
click continue let's me decide whether I
按一下繼續讓我決定是否
162
00:07:25,589 --> 00:07:27,009
want to use anything any bootstrap
想要使用任何引導程式
163
00:07:27,209 --> 00:07:28,600
actions and we'll talk about that again
行動,我們將再次討論
164
00:07:28,800 --> 00:07:31,360
in the last module this is a way to sort
在最後一個模組中,這是一種排序方式
to sort of :有一點
165
00:07:31,560 --> 00:07:33,249
of alter the configuration of my cluster
更改集群的配置
Alter=change
166
00:07:33,449 --> 00:07:35,050
or do special setup with servers on my
或對我伺服器上的伺服器進行特殊設置
167
00:07:35,250 --> 00:07:36,759
cluster I don't need any of that
群集我不需要任何
168
00:07:36,959 --> 00:07:39,399
so I click continue it gives me one last
所以我點擊繼續,它給了我最後一個
169
00:07:39,598 --> 00:07:41,468
chance to check over all the settings
有機會檢查所有設置
170
00:07:41,668 --> 00:07:45,430
and then I can create the job flow once
然後我可以創建一次工作流程
171
00:07:45,629 --> 00:07:47,559
the job fell has been created I can go
工作下降已經創建了我可以去
Fell fall fall
172
00:07:47,759 --> 00:07:49,149
back over here and it will show me that
回到這裡,它會告訴我
173
00:07:49,348 --> 00:07:52,209
I've got this job that's starting up and
我已經開始這項工作了,
174
00:07:52,408 --> 00:07:55,059
at some point typically a couple of
在某些時候通常是幾個
175
00:07:55,259 --> 00:07:56,769
minutes my cluster will be running which
我的集群將在幾分鐘內運行
176
00:07:56,968 --> 00:08:00,218
means elastic MapReduce has allocated
表示已分配彈性MapReduce
Allocated: 分配
177
00:08:00,418 --> 00:08:03,069
the servers that I asked for provision
我要求供應的伺服器
178
00:08:03,269 --> 00:08:06,819
them with Hadoop downloaded my job jar
他們與Hadoop下載了我的工作罐
179
00:08:07,019 --> 00:08:08,680
and started up the job and so we're
然後開始工作,所以我們
180
00:08:08,879 --> 00:08:10,240
going to wait until that happens and
等到那件事發生
181
00:08:10,439 --> 00:08:11,980
we're going to take a look at the job as
我們將看一下這份工作,
182
00:08:12,180 --> 00:08:15,338
it's running while your job is running
它正在運行,而您的作業正在運行
183
00:08:15,538 --> 00:08:18,129
you can use the AWS console to monitor
您可以使用AWS控制台進行監控
184
00:08:18,329 --> 00:08:20,230
it to find out what state it's in is it
找出它處於什麼狀態
185
00:08:20,430 --> 00:08:21,670
starting up is actually running the job
開始實際上是在工作
186
00:08:21,870 --> 00:08:25,180
is it terminating is it done you can
它終止了嗎,你可以
187
00:08:25,379 --> 00:08:26,889
also see how long it's been running and
還可以看到它運行了多長時間
188
00:08:27,089 --> 00:08:28,930
you get an estimate of roughly how much
您估計大概有多少
Estimate(N)/(V) roughly(adj)
189
00:08:29,129 --> 00:08:30,670
it's going to cost you and if need be
這會花你的錢,如果需要的話
190
00:08:30,870 --> 00:08:32,859
you can terminate the job so we've
您可以終止工作,所以我們已經
191
00:08:33,059 --> 00:08:34,659
started our job let's go take a look at
開始我們的工作,讓我們來看一下
192
00:08:34,860 --> 00:08:37,120
it okay now you see we're actually
好吧,現在您看到我們實際上
193
00:08:37,320 --> 00:08:39,159
running the job so the status has
運行作業,因此狀態為
194
00:08:39,360 --> 00:08:42,370
changed to running and here it shows
更改為運行,它在這裡顯示
195
00:08:42,570 --> 00:08:44,740
that you have this normalized instance
你有這個規範化的實例
Normalized:by SOP
196
00:08:44,940 --> 00:08:47,019
hours what that's saying is that as
幾個小時的意思是
197
00:08:47,220 --> 00:08:49,409
this cluster starts actually running I'm
這個集群實際上開始運行我
198
00:08:49,610 --> 00:08:53,709
being charged for three servers times up
被收取三台伺服器的費用
199
00:08:53,909 --> 00:08:57,339
to an hour and if I actually had a job
一個小時,如果我真的有工作
200
00:08:57,539 --> 00:08:58,779
that ran longer than an hour then you'd
跑了一個多小時,然後你會
201
00:08:58,980 --> 00:09:00,729
see this jumping up to six instance
看到這跳到六個實例
202
00:09:00,929 --> 00:09:02,919
hours this is one of the reasons why you
小時,這就是你為什麼的原因之一
203
00:09:03,120 --> 00:09:04,659
want to avoid having a job that fails
想要避免工作失敗
204
00:09:04,860 --> 00:09:06,128
right away because you're still going to
立即開始,因為您仍然要
205
00:09:06,328 --> 00:09:09,008
pay for the number of servers times at
支付伺服器數量乘以
206
00:09:09,208 --> 00:09:10,568
least one hour even if your job only
至少一小時,即使僅工作
207
00:09:10,769 --> 00:09:12,299
runs for 10 seconds
運行10秒
208
00:09:12,500 --> 00:09:14,438
now the actual runtime for this job
現在該作業的實際執行時間
209
00:09:14,639 --> 00:09:15,818
isn't very long so I expect pretty
不太長,所以我希望漂亮
210
00:09:16,019 --> 00:09:19,029
quickly this job will succeed if I go
如果我去,這項工作很快就會成功
211
00:09:19,230 --> 00:09:22,928
down here and look at these steps what
在這裡,看看這些步驟是什麼
212
00:09:23,129 --> 00:09:25,178
you'll see here is I've got essentially
你會看到這裡是我的本質
Essentially:basic
213
00:09:25,379 --> 00:09:28,599
one step in my flow which is the single
我流程中的第一步
214
00:09:28,799 --> 00:09:31,568
jar a job that's running with my
阻止與我一起執行的工作
215
00:09:31,769 --> 00:09:35,419
parameters that I passed it down here
我在這裡傳遞的參數
216
00:09:36,799 --> 00:09:39,159
once the job finishes the status will
工作完成後,狀態將
217
00:09:39,360 --> 00:09:42,149
change to shutting down at that point
更改為在此時關閉
218
00:09:42,350 --> 00:09:46,149
the results of the run are being copied
運行結果被複製
219
00:09:46,350 --> 00:09:48,339
up to s3 which includes both the results
最多s3,其中包括兩個結果
220
00:09:48,539 --> 00:09:53,438
and also the log files so you can see
還有日誌檔,這樣您就可以看到
221
00:09:53,639 --> 00:09:55,709
the status just changed to shutting down
狀態剛剛變為關閉
222
00:09:55,909 --> 00:09:58,299
my elapsed time and the elapsed time
我經過的時間和經過的時間
Elapsed:past
223
00:09:58,500 --> 00:10:00,998
actually doesn't start until the cluster
直到簇才開始
224
00:10:01,198 --> 00:10:03,818
is actually up and running the job so
實際上已經做好了工作,所以
225
00:10:04,019 --> 00:10:05,078
total elapsed time for this job was only
這項工作的總耗時僅為
226
00:10:05,278 --> 00:10:08,289
four minutes now the job is finished I
四分鐘現在工作完成了
227
00:10:08,490 --> 00:10:09,849
can actually go take a look at the
可以去看看
228
00:10:10,049 --> 00:10:13,358
results when I set up my job I specified
我指定的工作結果
229
00:10:13,558 --> 00:10:15,128
my output directory and that was
我的輸出目錄是
230
00:10:15,328 --> 00:10:17,498
actually a parameter to the job that I
實際上是我工作的一個參數
231
00:10:17,698 --> 00:10:20,078
was running so I told it where an s3 to
在運行,所以我告訴它一個S3
232
00:10:20,278 --> 00:10:22,779
put it using the s3 end protocol now
現在使用s3終端協定
233
00:10:22,980 --> 00:10:25,448
because the Hadoop cluster goes away at
因為Hadoop集群在
234
00:10:25,649 --> 00:10:27,159
the end of the job it means all the
工作的結束意味著
235
00:10:27,360 --> 00:10:30,099
drives that are used for HDFS they're
用於HDFS的驅動器
236
00:10:30,299 --> 00:10:32,878
ephemeral which means they disappear so
短暫的意味著他們消失了
Ephemeral:短時間
237
00:10:33,078 --> 00:10:35,799
the only way to persist data typically
持久存儲資料的唯一方法
Persist:keep going
238
00:10:36,000 --> 00:10:38,589
is that you have to write it to s3
是你必須將其寫入s3
239
00:10:38,789 --> 00:10:42,428
now you can set up job flows where they
現在您可以在他們
240
00:10:42,629 --> 00:10:44,019
are alive that means they don't
活著意味著他們沒有
241
00:10:44,220 --> 00:10:45,428
terminate at the end of your job and
在工作結束時終止並
242
00:10:45,629 --> 00:10:48,428
that's a great way to debug jobs when
這是調試作業的好方法
243
00:10:48,629 --> 00:10:49,568
you're first getting started with them
你首先開始與他們
244
00:10:49,769 --> 00:10:52,178
and we'll talk about that more later but
我們稍後再討論
245
00:10:52,379 --> 00:10:55,448
a typical run has everything going into
一個典型的運行將一切
246
00:10:55,649 --> 00:10:58,358
s3 which means you get both the results
s3表示您同時獲得了兩個結果
247
00:10:58,558 --> 00:11:00,519
and typically that's because your
通常是因為
248
00:11:00,720 --> 00:11:00,909
program
程式
249
00:11:01,110 --> 00:11:03,128
is using some output path that you
正在使用您的某些輸出路徑
250
00:11:03,328 --> 00:11:05,198
specify and you tell it that you want to
指定並告訴您您想要
251
00:11:05,399 --> 00:11:08,529
write it into s3 and secondly elastic
將其寫入s3,其次是彈性
252
00:11:08,730 --> 00:11:09,849
MapReduce is going to copy all of the
MapReduce將複製所有
253
00:11:10,049 --> 00:11:11,469
log files up to the location you
將檔記錄到您所在的位置
254
00:11:11,669 --> 00:11:14,439
specified in s3 and in our case we used
在s3中指定,在本例中我們使用
255
00:11:14,639 --> 00:11:19,349
the bucket name AWS test skk slash logs
存儲桶名稱AWS Test skk斜杠日誌
256
00:11:19,549 --> 00:11:22,448
so let's go look at the job results now
所以現在我們來看一下工作結果
257
00:11:22,649 --> 00:11:29,219
if I go over to s3 and I take a look in
如果我轉到s3,然後看看
258
00:11:29,419 --> 00:11:34,939
my AWS test KK bucket
我的AWS測試KK存儲桶
259
00:11:35,419 --> 00:11:37,689
you see I've got my four directories
你看我有四個目錄
260
00:11:37,889 --> 00:11:41,870
there if I look in the results directory
如果我在結果目錄中查看
261
00:11:43,070 --> 00:11:46,120
my job is created two sub directories
我的工作是創建兩個子目錄
262
00:11:46,320 --> 00:11:47,679
inside their raw counts and sorted
在原始計數內並排序
raw counts:
263
00:11:47,879 --> 00:11:49,990
counts sort accounts I know is the final
計算我知道的最終分類帳戶
264
00:11:50,190 --> 00:11:54,818
output and here you see the typical
輸出,在這裡您可以看到典型的
265
00:11:55,019 --> 00:11:56,948
Hadoop success file and then also a part
Hadoop成功檔,然後也是一部分
266
00:11:57,149 --> 00:12:00,729
file from the reducer so part - r5 zeros
來自reducer的檔,所以部分-r5零
Reducer:to make small
267
00:12:00,929 --> 00:12:03,729
and I can actually download this file
我實際上可以下載該檔
268
00:12:03,929 --> 00:12:07,809
and I'll open it up with BBEdit
我將使用BBEdit打開它
269
00:12:08,009 --> 00:12:12,370
my editor of choice and that displays
我選擇的編輯器,顯示
270
00:12:12,570 --> 00:12:14,979
results it look like this now what this
結果現在看起來像這樣
271
00:12:15,179 --> 00:12:17,409
job does is it generates bigram counts
的工作是它會生成雙字母組計數
272
00:12:17,610 --> 00:12:20,289
from text found in Wikipedia stories so
從維琪百科故事中找到的文字
273
00:12:20,490 --> 00:12:21,969
you can see that the most common diagram
您可以看到最常見的圖表
274
00:12:22,169 --> 00:12:24,370
was a space and it occurred 4,000 .25
是一個空間,它發生了4,000 25
Occurred:happen
275
00:12:24,570 --> 00:12:27,189
times in that snippet from Wikipedia
維琪百科中該片段的時間
Snippet:a little piece
276
00:12:27,389 --> 00:12:30,909
that I uploaded into the s3 data
我上傳到s3數據中
277
00:12:31,110 --> 00:12:35,979
directory subdirectory of my bucket now
現在我的存儲桶的目錄子目錄
278
00:12:36,179 --> 00:12:40,679
if we go back up to the bucket level I
如果我們回到存儲桶級別,我
279
00:12:40,879 --> 00:12:44,469
look in logs what you'll see here is its
查看日誌,您會在這裡看到它的
280
00:12:44,669 --> 00:12:48,389
created a J - and then a job ID
創建了一個J-然後是一個工作ID
281
00:12:48,589 --> 00:12:50,828
directory so this has information about
目錄,因此它具有有關的資訊
282
00:12:51,028 --> 00:12:53,948
the actual job if I open up this one
實際的工作,如果我打開這個
283
00:12:54,149 --> 00:12:55,089
you'll see there's a bunch of sub
你會看到那裡有一堆子
284
00:12:55,289 --> 00:12:57,219
directories most of this isn't that
目錄大部分不是
285
00:12:57,419 --> 00:12:58,479
interesting it's information being
有趣的是信息
286
00:12:58,679 --> 00:13:00,279
logged by the hadoop system itself but
由hadoop系統本身記錄,但
287
00:13:00,480 --> 00:13:03,758
if you look in steps I only had one step
如果你看一步,我只有一步
288
00:13:03,958 --> 00:13:05,229
so it's just a single subdirectory
所以它只是一個子目錄
289
00:13:05,429 --> 00:13:08,198
called one if I look inside of that
如果我看裡面的話
290
00:13:08,399 --> 00:13:10,359
you'll see I've got a number of files in
您會看到我有很多文件
291
00:13:10,559 --> 00:13:12,939
here the interesting one is standard out
這裡有趣的是標準配置
292
00:13:13,139 --> 00:13:14,779
this is where my login goes
這是我的登錄資訊所在
293
00:13:14,980 --> 00:13:16,490
so if I double-click at this because
所以如果我按兩下這個,因為
294
00:13:16,690 --> 00:13:19,490
it's a regular text file I'll just get a
這是一個普通的文字檔,我會得到一個
295
00:13:19,690 --> 00:13:21,409
window popped up with the results with
彈出帶有結果的視窗
296
00:13:21,610 --> 00:13:23,899
the contents of this and this is what my
這的內容,這就是我的
297
00:13:24,100 --> 00:13:26,659
job was logging it was giving me
工作正在記錄,這給了我
298
00:13:26,860 --> 00:13:28,339
information about the job tracker and I
有關我和工作跟蹤器的資訊
299
00:13:28,539 --> 00:13:29,959
printed out some information about input
列印出一些有關輸入的資訊
300
00:13:30,159 --> 00:13:32,839
and output paths and the Engram size I
和輸出路徑以及Engram大小I
301
00:13:33,039 --> 00:13:38,779
was using etc so as you can see elastic
使用等,如您所見,彈性
Etc:etcetera
302
00:13:38,980 --> 00:13:41,029
MapReduce automatically uploaded this
MapReduce自動上傳了這個
303
00:13:41,230 --> 00:13:43,189
information into the directory that I'd
資訊進入我想要的目錄
304
00:13:43,389 --> 00:13:45,289
specified as my logs directory when I
當我指定為我的日誌目錄
305
00:13:45,490 --> 00:13:48,099
was defining my job and if in particular
在定義我的工作,特別是
306
00:13:48,299 --> 00:13:50,479
my job failed then this is where I'd
我的工作失敗了,這就是我要去的地方
307
00:13:50,679 --> 00:13:52,519
start looking for clues as to what went
開始尋找發生了什麼的線索
308
00:13:52,720 --> 00:13:55,309
wrong and remember that there's a lag
錯了,請記住有一個滯後
Lag:late
309
00:13:55,509 --> 00:13:57,409
between when the job finishes and this
在工作完成到此之間
310
00:13:57,610 --> 00:13:59,629
data gets uploaded especially if your
資料會上傳,特別是如果您的
311
00:13:59,830 --> 00:14:02,000
jobs doing a lot of logging then you can
做很多日誌的工作,那麼你可以
312
00:14:02,200 --> 00:14:04,039
have potentially many many gigabytes of
可能有許多千百萬位元組的
313
00:14:04,240 --> 00:14:05,779
log files and those will take some time
日誌檔,這些檔需要一些時間
314
00:14:05,980 --> 00:14:07,429
to upload so if you have a large cluster
進行上傳,如果您的集群很大
315
00:14:07,629 --> 00:14:09,019
generating lots of logging output I
生成大量的日誌輸出I
316
00:14:09,220 --> 00:14:10,639
typically wait at least five minutes
通常等待至少五分鐘
317
00:14:10,840 --> 00:14:12,799
more like ten minutes after my job is
我的工作大概是十分鐘後
318
00:14:13,000 --> 00:14:15,469
finished before I start looking for logs
在開始查找日誌之前完成
319
00:14:15,669 --> 00:14:18,039
from the job alright to summarize
從工作上總結一下
320
00:14:18,240 --> 00:14:21,289
running jobs using elastic MapReduce is
使用彈性MapReduce運行作業是
321
00:14:21,490 --> 00:14:23,659
very simple you define the jobs using
很簡單,您使用定義工作
322
00:14:23,860 --> 00:14:26,240
the ADA based console your input data
基於ADA的控制台您的輸入資料
323
00:14:26,440 --> 00:14:31,639
and your job jar gets loaded from s3 the
並且您的作業jar從s3中載入
324
00:14:31,840 --> 00:14:33,229
results of your job which include the
您的工作結果包括
325
00:14:33,429 --> 00:14:34,969
log files get pushed back up to s3 and
日誌檔被推回至s3和
326
00:14:35,169 --> 00:14:37,429
you can use the AWS console to monitor
您可以使用AWS控制台進行監控
327
00:14:37,629 --> 00:14:40,189
the status of your job now the next
您的工作狀態現在下一個
328
00:14:40,389 --> 00:14:41,809
module we're going to go and look at the
模組,我們將去看看
329
00:14:42,009 --> 00:14:44,120
different options you have for servers
伺服器有不同的選擇
330
00:14:44,320 --> 00:14:45,740
that you can use to create your Hadoop
您可以用來創建Hadoop的
331
00:14:45,940 --> 00:14:50,940
cluster