# Installing the Hadoop Stack using Docker Follow the instructions to install Hadoop and other tools on Docker in a few simple steps. Make sure you do all these steps on your **host OS**, not any VM, because Docker acts like a VM in itself, so a VM inside a VM doesn't make sense. 1. Download Git from [this website](https://git-scm.com/downloads) and install it. Just click `Next` for all steps in the installation. 2. Download and install Docker Desktop from [this website](https://www.docker.com/products/docker-desktop/). You might have to restart your PC after the installation. 3. Start the Docker Desktop application (which starts the Docker process in the background). From here, you have two options: get a prebuilt image or build the image yourselves. --- ## Using a prebuilt image 1. Open a new terminal on your PC, and run your CPU-specific command to pull the appropriate image from Docker Hub: ```bash # AMD-based (Intel) docker pull silicoflare/hadoop:amd # ARM-based (Mac M series) docker pull silicoflare/hadoop:arm ``` --- ## Build the image yourselves 1. Open a new terminal in your PC, and type the following command to clone my installation repository. Make sure you run your OS-specific command only. ```bash # AMD-based (Intel) git clone -b amd --single-branch https://github.com/silicoflare/docker-hadoop # ARM-based (Mac M series) git clone -b arm --single-branch https://github.com/silicoflare/docker-hadoop ``` 2. Navigate into the newly created directory ```bash cd docker-hadoop ``` 3. Start the Docker build process of the image. Remember, this process will take anything from 15 to 30 minutes depending on your internet. You might also have to use `sudo` if any permission errors arise. ```bash docker build -t hadoop . ``` 4. Wait for the build to finish. --- ## Using the image Once you build or pull the image, it is time to create a container with the image. The following command creates a container, maps the required ports, and opens a terminal inside the container. Make sure you replace `SRN` in the command with your SRN in caps. ```bash docker run -it -p 9870:9870 -p 8088:8088 -p 9864:9864 --name SRN hadoop ``` Once the container is created and a shell like this shows up: ``` root@6aaa78189146:/# ``` Type `init` and press Enter. This stops all running processes, formats the HDFS namenodes, and starts all processes. After everything completes, type `jps` and check if there are around 7 processes. This completes the installation of all tools required for the Big Data course. Just type `exit` to exit from the container once done. From the next time, to reopen the container, just open Docker Desktop, open a new terminal, and type the following: ```bash docker start -ai SRN ``` Once the Docker shell opens, just type `restart` to restart all processes. # Tips to Remember - If you get permission errors while using Docker commands, use `sudo` before the commands - If you get an error that says `docker daemon is not running`, make sure you start Docker Desktop and try again - To copy files from current directory into the root directory of the container: ```bash docker cp ./filename SRN:/ ``` - To copy files from container to current directory: ```bash docker cp SRN:/path/to/file . ``` - If you get an error that says `port already allocated`, type `docker ps`, check the running containers and stop them. - If you get an error that says `request returned Internal Server Error`, it means your Docker build was not successful. Make sure you run the Docker build command again. If you still have doubts after this, just contact me on my email: [suraj.b.m555@gmail.com](mailto:suraj.b.m555@gmail.com), or add a comment in this page. --- ## Testing the tools The latest versions of the following tools are all preinstalled in this image: - hdfs - pig - hbase - hive - flume-ng - sqoop - zookeeper - spark - kafka - postgresql