--- title: Big Data. 005. Lab Hadoop en Docker. tags: DAM, Big Data --- [Link en MarkDown](https://hackmd.io/@JdaXaviQ/H1_Gy_IVs) <div style="text-align: center; width: 50%"> ![](https://i.imgur.com/rsIX7fx.png) </div> # DAM Big Data 005: Lab: implementem un cluster Hadoop amb docker. És evident que l'objectiu d'un cluster Hadoop és aprofitar la potència de càlcul conjunta de varies màquines físiques, però una forma de testejar i practicar amb Hadoop sense la necessitat de treballar amb més d'una màquina física és aprofitar la versatilitat i eficiència que ens proporciona Docker a l'hora de simular diversos equips físics independents interconnectats per xarxa. ## Obtenint una distribució de Hadoop dockeritzada. Des del [repositori Github de Big-data-europe](https://github.com/big-data-europe/docker-hadoop) podem clonar una versió 'claus-en-mà' de Hadoop sobre Docker qüasi preparada per ser utilitzada, només necessitem canviar el fitxer de configuració docker-compose.yml pel que trobareu en aquest [link](https://gist.github.com/themonster2015/35cf4252893cdcc831e79e76deb95cdb). Un cop tenim configurats els ports i els volumns dels nostres contenidors, estem preparats per fer còrrer el nostre cluster Hadoop. ```bash portatil_host:$ cd docker-hadoop portatil_host:$ docker-compose up -d ``` Amb el paràmetre -d li comuniquem a docker-compose que s'executi en segon pla i ens deixi la terminal lliure per a introduïr més comandes, si voleu tenir més informació durant el període d'arrencada del sistema no la poseu, però heu de deixar la terminal oberta mentre treballeu. Ara podem comprovar l'estat del contenidors arrencats amb la comanda ```bash portatil_host:$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 6205e2377a8a bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up 2 minutes (healthy) 8188/tcp historyserver a325390a02fa bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up 2 minutes (healthy) 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp, 0.0.0.0:9870->9870/tcp, :::9870->9870/tcp namenode 3fd562e23a6b bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up 2 minutes (healthy) 9864/tcp datanode 3a09ec984528 bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up About a minute (healthy) 8088/tcp resourcemanager 02aa7fba2ce4 bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up 2 minutes (healthy) 8042/tcp nodemanager ``` Tal i com podem observar, s'han arrencat 5 contenidors: 1. historyserver 2. namenode [HDFS] 3. datanode 4. resourcemanager [YARN] 5. nodemanager On el contenidor 'namenode', a més dels ports de la xarxa interna entre contenidors també exposa dos ports del host: el 9870/tcp i el 9000/tcp tant per IPv4 com per IPv6. Aquest contenidor o _'node'_ del cluster serà la nostra porta d'entrada al sistema Hadoop. El port 9870 exposa un servidor web i per tant el podem consultar des del nostre navegador favorit. ```bash poratil_host:$ firefox localhost:9870 ``` ![](https://i.imgur.com/wUa7s13.png) ## Provant el nostre cluster. Si el sistema ha arrencat correctament, podem obrir una terminal per interaccionar amb el 'namenode': ```bash portatil_host:$ docker exec -it namenode bash root@a325390a02fa:/# ``` Ara ja podem començar a manipular el sistema de fixers del nostre cluster. En primer lloc crearem el directori '/input' a l'arrel del sistema de fitxers local del 'namenode' ``` root@a325390a02fa:/# mkdir input root@a325390a02fa:/# echo "Hello World!" > input/f1.txt root@a325390a02fa:/# echo "Hello DAM!" > input/f2.txt root@a325390a02fa:/# cat /input/f1.txt Hello World! root@a325390a02fa:/# cat /input/f2.txt Hello DAM! ``` El següent pas és crear el directori 'input' al sistema de fitxers distribuït HDFS i copiar ambdòs fitxers en ell. ``` root@a325390a02fa:/# hadoop fs -mkdir -p input root@a325390a02fa:/# hdfs dfs -put ./input/* input 2022-10-26 09:21:03,991 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2022-10-26 09:21:04,086 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false ``` Ara ens baixarem uns exemples de programes del següent [repositori Maven](https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-examples/2.7.1/hadoop-mapreduce-examples-2.7.1-sources.jar). :triangular_flag_on_post: :eye: Hem de baixar els exemples des del nostre host no des del namenode. Per copiar els exemples al namenode, hem de esbrinar la id del contenidor. ``` portatil_host:$ docker container ls CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 6205e2377a8a bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 8188/tcp historyserver a325390a02fa bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp, 0.0.0.0:9870->9870/tcp, :::9870->9870/tcp namenode 3fd562e23a6b bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 9864/tcp datanode 3a09ec984528 bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 8088/tcp resourcemanager 02aa7fba2ce4 bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 8042/tcp nodemanager ``` Al meu cas: __a325390a02fa__ ``` portatil_host:$ $ docker cp ../hadoop-mapreduce-examples-2.7.1-sources.jar a325390a02fa:hadoop-mapreduce-examples-2.7.1-sources.jar ``` I ara el moment esperat, a creuar els dits i executar l'exemple. ``` root@a325390a02fa:/# hadoop jar hadoop-mapreduce-examples-2.7.1-sources.jar org.apache.hadoop.examples.WordCount input output 2022-10-26 09:38:57,303 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.19.0.3:8032 2022-10-26 09:38:57,438 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.19.0.6:10200 2022-10-26 09:38:57,613 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1666774121349_0001 2022-10-26 09:38:57,696 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2022-10-26 09:38:57,793 INFO input.FileInputFormat: Total input files to process : 2 2022-10-26 09:38:57,819 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2022-10-26 09:38:57,835 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2022-10-26 09:38:57,840 INFO mapreduce.JobSubmitter: number of splits:2 2022-10-26 09:38:57,935 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2022-10-26 09:38:57,957 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1666774121349_0001 2022-10-26 09:38:57,958 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2022-10-26 09:38:58,100 INFO conf.Configuration: resource-types.xml not found 2022-10-26 09:38:58,100 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2022-10-26 09:38:58,486 INFO impl.YarnClientImpl: Submitted application application_1666774121349_0001 2022-10-26 09:38:58,526 INFO mapreduce.Job: The url to track the job: http://resourcemanager:8088/proxy/application_1666774121349_0001/ 2022-10-26 09:38:58,526 INFO mapreduce.Job: Running job: job_1666774121349_0001 2022-10-26 09:39:05,652 INFO mapreduce.Job: Job job_1666774121349_0001 running in uber mode : false 2022-10-26 09:39:05,654 INFO mapreduce.Job: map 0% reduce 0% 2022-10-26 09:39:10,707 INFO mapreduce.Job: map 50% reduce 0% 2022-10-26 09:39:11,716 INFO mapreduce.Job: map 100% reduce 0% 2022-10-26 09:39:13,726 INFO mapreduce.Job: map 100% reduce 100% 2022-10-26 09:39:14,752 INFO mapreduce.Job: Job job_1666774121349_0001 completed successfully 2022-10-26 09:39:14,852 INFO mapreduce.Job: Counters: 54 File System Counters FILE: Number of bytes read=53 FILE: Number of bytes written=687982 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=240 HDFS: Number of bytes written=24 HDFS: Number of read operations=11 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 HDFS: Number of bytes read erasure-coded=0 Job Counters Launched map tasks=2 Launched reduce tasks=1 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=14932 Total time spent by all reduces in occupied slots (ms)=13480 Total time spent by all map tasks (ms)=3733 Total time spent by all reduce tasks (ms)=1685 Total vcore-milliseconds taken by all map tasks=3733 Total vcore-milliseconds taken by all reduce tasks=1685 Total megabyte-milliseconds taken by all map tasks=15290368 Total megabyte-milliseconds taken by all reduce tasks=13803520 Map-Reduce Framework Map input records=2 Map output records=4 Map output bytes=40 Map output materialized bytes=72 Input split bytes=216 Combine input records=4 Combine output records=4 Reduce input groups=3 Reduce shuffle bytes=72 Reduce input records=4 Reduce output records=3 Spilled Records=8 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=123 CPU time spent (ms)=1150 Physical memory (bytes) snapshot=790765568 Virtual memory (bytes) snapshot=18661777408 Total committed heap usage (bytes)=831520768 Peak Map Physical memory (bytes)=297639936 Peak Map Virtual memory (bytes)=5106143232 Peak Reduce Physical memory (bytes)=195641344 Peak Reduce Virtual memory (bytes)=8451166208 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=24 File Output Format Counters Bytes Written=24 root@a325390a02fa:/# ``` Si la sortida s'assembla a la de dalt és que tot ha funcionat correctament i podem mirar a la carpeta output els resultats. ``` root@a325390a02fa:/# hdfs dfs -cat output/part-r-00000 2022-10-26 09:44:53,906 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false DAM! 1 Hello 2 World! 1 ``` Si, ja sé que no és gaire espectacular, però hem aconsseguit executar un programa en un cluster Hadoop :muscle: Ja podem tancar el nostre cluster. ``` portatil_host:$ docker-compose down Stopping historyserver ... done Stopping namenode ... done Stopping datanode ... done Stopping resourcemanager ... done Stopping nodemanager ... done Removing historyserver ... done Removing namenode ... done Removing datanode ... done Removing resourcemanager ... done Removing nodemanager ... done Removing network docker-hadoop_default ```