---
title: Big Data. 005. Lab Hadoop en Docker.
tags: DAM, Big Data
---
[Link en MarkDown](https://hackmd.io/@JdaXaviQ/H1_Gy_IVs)
<div style="text-align: center; width: 50%">

</div>
# DAM Big Data 005: Lab: implementem un cluster Hadoop amb docker.
És evident que l'objectiu d'un cluster Hadoop és aprofitar la potència de càlcul conjunta de varies màquines físiques, però una forma de testejar i practicar amb Hadoop sense la necessitat de treballar amb més d'una màquina física és aprofitar la versatilitat i eficiència que ens proporciona Docker a l'hora de simular diversos equips físics independents interconnectats per xarxa.
## Obtenint una distribució de Hadoop dockeritzada.
Des del [repositori Github de Big-data-europe](https://github.com/big-data-europe/docker-hadoop) podem clonar una versió 'claus-en-mà' de Hadoop sobre Docker qüasi preparada per ser utilitzada, només necessitem canviar el fitxer de configuració docker-compose.yml pel que trobareu en aquest [link](https://gist.github.com/themonster2015/35cf4252893cdcc831e79e76deb95cdb). Un cop tenim configurats els ports i els volumns dels nostres contenidors, estem preparats per fer còrrer el nostre cluster Hadoop.
```bash
portatil_host:$ cd docker-hadoop
portatil_host:$ docker-compose up -d
```
Amb el paràmetre -d li comuniquem a docker-compose que s'executi en segon pla i ens deixi la terminal lliure per a introduïr més comandes, si voleu tenir més informació durant el període d'arrencada del sistema no la poseu, però heu de deixar la terminal oberta mentre treballeu.
Ara podem comprovar l'estat del contenidors arrencats amb la comanda
```bash
portatil_host:$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6205e2377a8a bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up 2 minutes (healthy) 8188/tcp historyserver
a325390a02fa bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up 2 minutes (healthy) 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp, 0.0.0.0:9870->9870/tcp, :::9870->9870/tcp namenode
3fd562e23a6b bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up 2 minutes (healthy) 9864/tcp datanode
3a09ec984528 bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up About a minute (healthy) 8088/tcp resourcemanager
02aa7fba2ce4 bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 2 minutes ago Up 2 minutes (healthy) 8042/tcp nodemanager
```
Tal i com podem observar, s'han arrencat 5 contenidors:
1. historyserver
2. namenode [HDFS]
3. datanode
4. resourcemanager [YARN]
5. nodemanager
On el contenidor 'namenode', a més dels ports de la xarxa interna entre contenidors també exposa dos ports del host: el 9870/tcp i el 9000/tcp tant per IPv4 com per IPv6. Aquest contenidor o _'node'_ del cluster serà la nostra porta d'entrada al sistema Hadoop. El port 9870 exposa un servidor web i per tant el podem consultar des del nostre navegador favorit.
```bash
poratil_host:$ firefox localhost:9870
```

## Provant el nostre cluster.
Si el sistema ha arrencat correctament, podem obrir una terminal per interaccionar amb el 'namenode':
```bash
portatil_host:$ docker exec -it namenode bash
root@a325390a02fa:/#
```
Ara ja podem començar a manipular el sistema de fixers del nostre cluster. En primer lloc crearem el directori '/input' a l'arrel del sistema de fitxers local del 'namenode'
```
root@a325390a02fa:/# mkdir input
root@a325390a02fa:/# echo "Hello World!" > input/f1.txt
root@a325390a02fa:/# echo "Hello DAM!" > input/f2.txt
root@a325390a02fa:/# cat /input/f1.txt
Hello World!
root@a325390a02fa:/# cat /input/f2.txt
Hello DAM!
```
El següent pas és crear el directori 'input' al sistema de fitxers distribuït HDFS i copiar ambdòs fitxers en ell.
```
root@a325390a02fa:/# hadoop fs -mkdir -p input
root@a325390a02fa:/# hdfs dfs -put ./input/* input
2022-10-26 09:21:03,991 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-10-26 09:21:04,086 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
```
Ara ens baixarem uns exemples de programes del següent [repositori Maven](https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-examples/2.7.1/hadoop-mapreduce-examples-2.7.1-sources.jar).
:triangular_flag_on_post: :eye: Hem de baixar els exemples des del nostre host no des del namenode.
Per copiar els exemples al namenode, hem de esbrinar la id del contenidor.
```
portatil_host:$ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6205e2377a8a bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 8188/tcp historyserver
a325390a02fa bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp, 0.0.0.0:9870->9870/tcp, :::9870->9870/tcp namenode
3fd562e23a6b bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 9864/tcp datanode
3a09ec984528 bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 8088/tcp resourcemanager
02aa7fba2ce4 bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 43 minutes ago Up 43 minutes (healthy) 8042/tcp nodemanager
```
Al meu cas: __a325390a02fa__
```
portatil_host:$ $ docker cp ../hadoop-mapreduce-examples-2.7.1-sources.jar a325390a02fa:hadoop-mapreduce-examples-2.7.1-sources.jar
```
I ara el moment esperat, a creuar els dits i executar l'exemple.
```
root@a325390a02fa:/# hadoop jar hadoop-mapreduce-examples-2.7.1-sources.jar org.apache.hadoop.examples.WordCount input output
2022-10-26 09:38:57,303 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.19.0.3:8032
2022-10-26 09:38:57,438 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.19.0.6:10200
2022-10-26 09:38:57,613 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1666774121349_0001
2022-10-26 09:38:57,696 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-10-26 09:38:57,793 INFO input.FileInputFormat: Total input files to process : 2
2022-10-26 09:38:57,819 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-10-26 09:38:57,835 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-10-26 09:38:57,840 INFO mapreduce.JobSubmitter: number of splits:2
2022-10-26 09:38:57,935 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-10-26 09:38:57,957 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1666774121349_0001
2022-10-26 09:38:57,958 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-10-26 09:38:58,100 INFO conf.Configuration: resource-types.xml not found
2022-10-26 09:38:58,100 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-10-26 09:38:58,486 INFO impl.YarnClientImpl: Submitted application application_1666774121349_0001
2022-10-26 09:38:58,526 INFO mapreduce.Job: The url to track the job: http://resourcemanager:8088/proxy/application_1666774121349_0001/
2022-10-26 09:38:58,526 INFO mapreduce.Job: Running job: job_1666774121349_0001
2022-10-26 09:39:05,652 INFO mapreduce.Job: Job job_1666774121349_0001 running in uber mode : false
2022-10-26 09:39:05,654 INFO mapreduce.Job: map 0% reduce 0%
2022-10-26 09:39:10,707 INFO mapreduce.Job: map 50% reduce 0%
2022-10-26 09:39:11,716 INFO mapreduce.Job: map 100% reduce 0%
2022-10-26 09:39:13,726 INFO mapreduce.Job: map 100% reduce 100%
2022-10-26 09:39:14,752 INFO mapreduce.Job: Job job_1666774121349_0001 completed successfully
2022-10-26 09:39:14,852 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=53
FILE: Number of bytes written=687982
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=240
HDFS: Number of bytes written=24
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Rack-local map tasks=2
Total time spent by all maps in occupied slots (ms)=14932
Total time spent by all reduces in occupied slots (ms)=13480
Total time spent by all map tasks (ms)=3733
Total time spent by all reduce tasks (ms)=1685
Total vcore-milliseconds taken by all map tasks=3733
Total vcore-milliseconds taken by all reduce tasks=1685
Total megabyte-milliseconds taken by all map tasks=15290368
Total megabyte-milliseconds taken by all reduce tasks=13803520
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=40
Map output materialized bytes=72
Input split bytes=216
Combine input records=4
Combine output records=4
Reduce input groups=3
Reduce shuffle bytes=72
Reduce input records=4
Reduce output records=3
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=123
CPU time spent (ms)=1150
Physical memory (bytes) snapshot=790765568
Virtual memory (bytes) snapshot=18661777408
Total committed heap usage (bytes)=831520768
Peak Map Physical memory (bytes)=297639936
Peak Map Virtual memory (bytes)=5106143232
Peak Reduce Physical memory (bytes)=195641344
Peak Reduce Virtual memory (bytes)=8451166208
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=24
File Output Format Counters
Bytes Written=24
root@a325390a02fa:/#
```
Si la sortida s'assembla a la de dalt és que tot ha funcionat correctament i podem mirar a la carpeta output els resultats.
```
root@a325390a02fa:/# hdfs dfs -cat output/part-r-00000
2022-10-26 09:44:53,906 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
DAM! 1
Hello 2
World! 1
```
Si, ja sé que no és gaire espectacular, però hem aconsseguit executar un programa en un cluster Hadoop :muscle:
Ja podem tancar el nostre cluster.
```
portatil_host:$ docker-compose down
Stopping historyserver ... done
Stopping namenode ... done
Stopping datanode ... done
Stopping resourcemanager ... done
Stopping nodemanager ... done
Removing historyserver ... done
Removing namenode ... done
Removing datanode ... done
Removing resourcemanager ... done
Removing nodemanager ... done
Removing network docker-hadoop_default
```