Try   HackMD

Lab 1 - Installing HDP Sandbox

Course: Intro to Big Data - IU S23
Author: Firas Jolha

Agenda

Prerequisites

  • Familiarity with terminals, shell commands and fundamental knowledge in operating Systems (Linux)
  • Basic knowledge in virtualization and managing virtual machines
  • Basic knowledge in containerization and dealing with Docker images and containers

Objectives

  • Install a hypervisor or VM monitor such as VirtualBox
  • Install Hortonworks Data Platform
  • Access Hortonworks Data Platform via ssh
  • Access HDFS and learn how to transfer files

Introduction

Every business is now a data business. Data is the organizations' future and most valuable asset. The Hortonworks Data Platform (HDP) is a security-rich, enterprise-ready open-source Hadoop distribution based on a centralized architecture (YARN). Hortonworks Sandbox is a single-node cluster and can be run as a Docker container installed on a virtual machine. HDP is a complete system to handle the processing and storage of big data. It is an open architecture used to store and process complex and large-scale data. It is composed of numerous Apache Software Foundation (ASF) projects including Apache Hadoop and is built specifically to meet enterprise demands. Hortonworks was a standalone company untill 2019 when it is merged to Cloudera and now Hortonworks is a subsidiary for Cloudera, Inc.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Hortonworks is merged to Cloudera in 2019

At the beginning of this course, we are aiming to remind you with relational databases before dealing with big data. In practice, we will use the relational database management system PostgreSQL. Fortunately, HDP comes with a pre-installed PostgreSQL database server.

Hardware requirements

  • Memory dedicated to the cluster (Minimum: 4 GiB, Recommended: 8+ GiB). More is better.
  • CPU (Minimum: 4 Cores, Recommended: 6+ Cores)
  • Storage
    • 25-35 GiB
      • for HDP 2.5.0
    • 65-75 GiB
      • for HDP 2.6.5
    • 80-100 GiB
      • for HDP 3.0.1

HDP 2.5.0 is based on CentOS 6.8 which is EOL by 2021 and updating the packages via yum is kinda not possible. We recommend to install HDP 2.6.5 unless you have less resources.

Exercises

Note: Take only full-screen and high quality pictures.

  1. Prepare a report week1 with the results of this lab. Create a new Google document with the following format.
  2. [50 points] Install HDP on your machine. Take a full screenshot of Ambari dashboard and put it in the report.
  3. [5 points] Access HDFS. Screenshot the content of the root folder in HDFS and attach it to the report.
  4. [5 points] Download the data file from here. Add it to HDFS root folder (/) via Ambari Files View and take a screenshot of the content of the HDFS root folder and attach it to the report. Delete the file from HDFS.
  5. [15 points] Send the same data file from the local machine (where you boot up) to HDFS root folder (/) via scp or docker cp. Take a screenshot of the command and the output and add it to the report.
  6. [15 points] Access the shell of the cluster node via port 2222 in the terminal or 4200 in the browser and create a folder data in HDFS root folder (/). Move the file to the folder data. Send a copy of the folder to the local machine. Take screenshots of the commands and the output and add them to the report.
  7. [5 points] Reset the password of Ambari-Admin and access to Ambari as an admin. Display the roles of the users in the cluster. Screenshot and add it to the report.
  8. [5 points] Create a new user with your name and give it cluster manager (admin) role. Access it, screenshot and add to the report. You can check the roles of HDP from here.
  9. Submit the report week1 to Moodle before the deadline. Check the deadline on Moodle. The deadline is generally before the first lab in the next week.

Installing Hortonworks Sandbox

There are two common ways to install HDP Sandbox on your PC, either by using a hypervisor such as VirtualBox which will pull the Docker image and run a container for your cluster or by directly using Docker where you need to manage your resources via docker command line options in Linux or you can use WSL backend in Windows where you can configure resources via .wslconfig file.

Note: For M1- and M2-based MacOS users, please install CentOS 7.5 or Ubuntu 18.04. Then follow the approach B. using Docker where you need to install Docker on the virtual machine and run cluster containers.


A motivating success story
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

A. Using a Hypervisor

If you are not familiar with Docker, you can follow this approach where configuring the cluster resources can be done via the hypervisor GUI. If you have less resources, then we recommend using Docker, so you do not need to consume resources for running the guest virtual machine.

In this approach, you will run a virtual machine which in turn will run your cluster container, so the operating system of the virtual machine will be different from the operating system of the container (HDP Sandbox cluster). You can notice that by checking the content of /home directory or the version of the operating system cat /etc/redhat-release.

1. Installing a Hypervisor

We recommend VirtualBox as a hypervisor since it is supported for most common operating systems (Linux, Windows, and macOS). Please, follow the attached link in the following list to download your preferred hypervisor.

For installation instructions, you can use Google but I share here a tutorial to install VirtualBox on Windows 11. If you have an old version of the software, we recommend to update it in order to avoid any issues in installing the virtual machines. In my PC, I installed VirtualBox 7.0.4 in January 2023.

2. Installing HDP Sandbox

Hortonworks Data Platform (HDP) is a straightforward, pre-configured, learning environment that contains the latest developments from Apache Hadoop. The Sandbox comes packaged in a virtual environment that can run in the cloud or on your personal machine. The Sandbox allows you to learn and explore HDP on your own.

2.1. Downloading the Sandbox

You can find the download links of the Sandbox in .ova format with respect to the chosen hypervisor. If you are using VirtualBox, then download from here. For VMware users, the download link is here. You can also download them from the official website but it needs an account on Cloudera website. I share below the download links for all available versions of HDP Sandbox.

2.2. Installing the Sandbox

I will show here the steps to install the HDP Sandbox on VirtualBox. First of all, you need to be sure that you have installed VirtualBox and it is ready to create VMs.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Oracle VM VirtualBox

Select File from the top toolbar then choose Import Appliance... from the drop-down list or press Ctrl+I. The following window appears which allows you to specify the file where to import the virtual appliance from. Here you need to select the path of the virtual appliance. The virtual appliance has the extension .ova.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Import Appliance window

As shown in the figure below, select the path of the file .ovf then press Next.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Import Appliance window

In the next window, you may need to change some settings. Make sure that you set the CPU cores to 4 and the RAM size to 8192 MB.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Appliance settings window

And wait for the appliance to be imported as shown in the figure below.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Progress of importing the virtual appliance

If you got a value 0 for the Base Memory after importing the appliance (a bug in VirtualBox), please update the value as explained above and start the virtual machine.

2.3. Running the VM

The first boot of HDP Sandbox takes a lot of time, please take a break and wait till it finishes. Actually, during this time, the virtual machine is building the Docker image and then it starts to run a container for your cluster where you can access it from the host machine.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Booting HDP Sandbox

After finishing the extraction process, the system will run as shown below.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Running HDP Sandbox

After the boot operation is done, you will see the following screen where it gives the address to access the splash webpage of the platform at http://loaclhost:1080 or http://127.0.0.1:1080 for HDP 2.6.5.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

HDP Sandbox 2.6.5

Now, you have finished installation and are ready to access the cluster.

B. Using Docker

Here I will explain how you can install HDP Sandbox using Docker in Windows 11. For Linux users, installing Docker can be done via CLI and you can follow this tutorial.

1. Installing Docker Desktop on Windows

You can download Docker Desktop and install it by following the instructions in the official website. The installation instructions are easy and you can follow this tutorial.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Docker Desktop

Docker Desktop allows you to manage images and containers via GUI. It also allows you to run the cluster without the need to use CLI.

2. Installing HDP Sandbox

There are only two versions of HDP Sandbox on Docker Hub where HDP version 2.6.5 is what we need. Firstly, you need to download the installation scripts from Cloudera website:

The zip folder mainly contains .sh scripts which includes instructions to pull the HDP Sandbox image from Docker Hub and run the containers for your cluster.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

HDP Sandbox on Docker

For Linux users, it will be straightforward to run .sh scripts whereas for Windows 10/11 users you can use C://Windows/System32/bash.exe program. If you have an older version of Windows then you can install Git BASH to run .sh scripts.

We need to run only the script docker-deploy-hdp265.sh from bash shell as follows:

$ bash ./docker-deploy-hdp265.sh
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Pulling HDP Sandbox 2.6.5 image from Docker Hub

Running the installation script will take a lot of time since it will pull 15 GiB (HDP 2.6.5) from Docker Hub, so take a break. If you cannot see the progress of the download, then you can open another terminal and run the same instruction as follows:

docker pull hortonworks/sandbox-hdp:2.6.5

After the installation is successfully done, you will see two images on Docker Desktop and also two running containers.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

HDP Sandbox images on Docker Desktop

The HDP Sandbox cluster is now running on your PC

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
.

You do these steps only for the first time then you can stop/restart the cluster from Docker Desktop.

Configuring the resources for Docker

Limiting the resources of the containers is important to avoid freezing the host system due to the lack of memory. The default behavior of Docker engine is to give the containers as much resources as they request. Linux users can use command line options whereas Windows users can use configuration of WSL backend (which is installed by default in Windows 10/11 but do not forget that you need to install a Linux distribution such as Ubuntu, see tutorial here) or Hyper-V backend. Here I will explain how you can configure the resources of the containers if you use WSL backend in Windows.

You need to download .wslconfig file or build a new one with the same structure as follows:

# NOTE: This should be stored in %UserProfile% folder which is your home folder c:/Users/<Username> in Windows
# Settings apply across all Linux distros running on WSL 2
[wsl2]
# Limits VM memory to use no more than 4 GB, this can be set as whole numbers using GB or MB
memory=4GB
# Sets the VM to use four virtual processors
processors=4
# Sets amount of swap storage space to 0, default is 25% of available RAM
swap=0

After every update of .wslconfig file, you need to shutdown WSL, wait for some seconds then restart Desktop Docker. After that, you can start the containers to run the cluster. You can shutdown WSL as:

wsl --shutdown

More info about WSL configuration can be found here.

Stopping/Restarting the cluster

You can stop the cluster by turning off the virtual machine in case you are using a hypervisor. For Docker users, you only need to stop the containers sandbox-hdp and sandbox-proxy. Meanwhile, you can start the cluster in Docker Desktop by running the containers sandbox-hdp and sandbox-proxy whereas you just need to start the virtual machine in case you are using a hypervisor.

Important: Do not delete the containers in Docker Desktop, otherwise the persisted data will be gone and you need to reinstall the cluster using installation scripts.

Access HDP Sandbox cluster

Either you followed the first approach or the second one in cluster installation, you would end up here. The installed HDP Sandbox cluster is a single node implementation of HDP. It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. You can access the splash webpage of the cluster via http://localhost:1080 for HDP 2.6.5 and http://localhost:8888 for HDP 2.5.0.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

HDP Sandbox splash webpage

The button Quick Links will transfer you to the page of links where you can access some services of the cluster.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

HDP Sandbox quick links webpage

In order to see all services of the cluster, you need to access Ambari service at http://localhost:8080 where you can monitor and manage all services.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Ambari login page

You need to log in in order to access this service. You can use the credentials of the user maria_dev/maria_dev as (username/passowrd). HDP Sandbox comes with 4 default users with different roles in the cluster and there is also Ambari Admin who can manage the other users in the cluster.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Ambari homepage

As you can see in Ambari homepage, most services are showing alerts since they did not start so far or due to some problems. You need to wait till services start then you can access them. If you have set less resources than required then it is probably that most services cannot be run, so you can stop the services which are not needed to let the required services run.

Note: You can reset the password of Ambari Admin by running the command ambari-admin-password-reset via ssh as follows:

[root@sandbox-hdp ~]# ambari-admin-password-reset
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Resetting Ambari Admin credentials

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Overview of HDP services

Access via SSH

You can access the cluster via shell web client or called shell-in-a-box by following the address http://localhost:4200 in your browser.

For the very first time, the default credentials are root/hadoop and you will be asked to reset the password. You need to set a strong password to pass the password-reset step. For example, I use the password hdpIU2023.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Web Shell client for HDP Sandbox

You can also access the cluster via ssh command on your preferred terminal. You need to ssh on the port 2222 as root user:

ssh root@localhost -p 2222

Access HDFS

You can access HDFS files by selecting Files view in Ambari homepage.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Ambari - Files View

You can see in the following screen the contents of HDFS on the cluster. The page allows you to upload and download files/folders from/to local file system and HDFS.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

HDFS on HDP Sandbox cluster

You can also access HDFS via CLI using the command hdfs dfs. For example, to list the content of the root directory / in HDFS, you can write as follows:

[root@sandbox-hdp ~]# hdfs dfs -ls /

Transfer data between HDFS and local file systems

The single node of the cluster is running on CentOS which has ext4 local file system whereas the distributed data in the cluster is stored in HDFS. You also have a local file system on the host machine. We have multiple file systems among whom data may need to be transferred. For example, in order to process the data in the cluster, you need to store it in HDFS.

Transfer data into and out of the cluster node

You can move data from the local file system in the host machine to the local file system of the cluster node by using command docker cp. As an example, to move the file C:\Users\Admin\Desktop\hello.txt on Windows to the root directory / of the node, we run the following command on the host machine:

docker cp "C:\Users\Admin\Desktop\hello.txt"  "sandbox-hdp:/"

sandbox-hdp is the container name/id.

In the same way, you can move the file /hello2.txt from the local file system of the node to the local file system of the host machine as follows (run the command on the host machine):

docker cp "sandbox-hdp:/hello2.txt" "C:\Users\Admin\Desktop"

Make sure that the containers of the cluster are running before running the commands.

Info: If you installed HDP using a hypervisor, then you can use the command scp to immediately copy files from the source machine (where you are running the command) to the local file system of the cluster node.
For instance, to transfer the files in data folder from the host machine to the folder /data in the cluster node, we run the following command on the terminal.

scp -P 2222 data/* root@localhost:/data

Note: If you got issues as shown in the figure below, then you need to open the file %USERPROFILE%/.ssh/known_hosts and remove the previous keys for that port (You can empty the file if all keys are not important).

scp will create a new key and exchange it with the cluster node.

Transfer data into and out of HDFS

You can move data between the local file system of the host machine and HDFS via Ambari service as explained in section Access HDFS.

To move data from HDFS to the local file system of the cluster node, you can use hdfs dfs command with appropriate options. The table below shows usage of some options.

Common issues

  • Docker does not read the parameters in .wslconfig file on Windows.
    • Possible reasons:
      • The file is sensitive to white spaces and character cases. If the file has some issues in some lines then WSL backend of Docker Desktop will ignore the file.
    • Possible solutions:
      • Make sure that your file follows the structure that I specified here and change only values according to your limited resources.
  • The virtual machine does not boot up.

    • Possible reasons:
      • You set less resources than minimum for the CPU or the RAM.
    • Possible solutions:
      • Increase the number of cores for CPU and/or increase the allocated RAM.
  • When you try to access the database by running the command psql -U postgres as [root@sandbox-hdp data]# psql -U postgres. You may get the following error:

    psql: FATAL: Peer authentication failed for user "postgres"

    • Possible reasons:
      • The database is configured to allow access only through the credentials of the user.
    • Possible solutions:
      • You can change this configuration to allow all users to access the database without the need to enter credentials. Add the line local all all trust at the beginning of the file /var/lib/pgsql/data/pg_hba.conf then restart PostgreSQL service by running the command systemctl restart postgresql as root user.

NB: If you have other issues, please contact the TA.

Self-check questions

  1. What are the key components of Hadoop core?
  2. What is Apache Ambari?
  3. How can we import a virtual appliance in VirtualBox?
  4. How can we set the password of Ambari Admin?
  5. What is the version of Hadoop distribution in the installed cluster?
  6. What services does HDP provide?
  7. How can you upload a file from local file system to HDFS?
  8. How can you download a file from HDFS to the local file system?
  9. How can you access the cluster via ssh?
  10. How can we access the cluster via shell-in-a-box?
  11. How can we access the database ambari in PostgreSQL server?
  12. What are the default credentials of root user in the cluster?
  13. How to transfer files from the host machine to the cluster node?

Appendix

Sandbox Version

To find information about the cluster Sandbox, execute the following command on the cluster container:

sandbox-version

Ambari Roles

Ambari 2.4 introduced the notion of Role-Based Access Control(RBAC) for the Ambari web interface. Ambari now includes additional cluster operation roles providing more granular division of control of the Ambari Dashboard and the various Ambari Views. Only the admin id has access to view or change these roles. You can learn more about the roles from the here.

References

Learning the Ropes of the HDP Sandbox
Getting Started with HDP Sandbox
Hortonworks - Wikipedia
IBM Analytics - Data Sheet
PostgreSQL: The World's Most Advanced Open Source Relational Database
Moving data in HDFS
Hadoop HDFS Command Cheatsheet
HDP Cluster Roles