Course: Intro to Big Data - IU S23
Author: Firas Jolha
VirtualBox
Hortonworks Data Platform
Hortonworks Data Platform
via ssh
HDFS
and learn how to transfer filesEvery business is now a data business. Data is the organizations' future and most valuable asset. The Hortonworks Data Platform (HDP)
is a security-rich, enterprise-ready open-source Hadoop
distribution based on a centralized architecture (YARN). Hortonworks Sandbox
is a single-node cluster and can be run as a Docker container installed on a virtual machine. HDP is a complete system to handle the processing and storage of big data. It is an open architecture used to store and process complex and large-scale data. It is composed of numerous Apache Software Foundation (ASF) projects including Apache Hadoop and is built specifically to meet enterprise demands. Hortonworks was a standalone company untill 2019 when it is merged to Cloudera and now Hortonworks is a subsidiary for Cloudera, Inc.
Hortonworks is merged to Cloudera in 2019
At the beginning of this course, we are aiming to remind you with relational databases before dealing with big data. In practice, we will use the relational database management system PostgreSQL. Fortunately, HDP comes with a pre-installed PostgreSQL database server.
lscpu
). Sometimes it is disabled in BIOS.HDP 2.5.0 is based on CentOS 6.8 which is EOL by 2021 and updating the packages via yum
is kinda not possible. We recommend to install HDP 2.6.5 unless you have less resources.
Exercises
Note: Take only full-screen and high quality pictures.
week1
with the results of this lab. Create a new Google document with the following format.scp
or docker cp
. Take a screenshot of the command and the output and add it to the report.2222
in the terminal or 4200
in the browser and create a folder data
in HDFS root folder (/). Move the file to the folder data
. Send a copy of the folder to the local machine. Take screenshots of the commands and the output and add them to the report.week1
to Moodle before the deadline. Check the deadline on Moodle. The deadline is generally before the first lab in the next week.There are two common ways to install HDP Sandbox on your PC, either by using a hypervisor such as VirtualBox which will pull the Docker image and run a container for your cluster or by directly using Docker where you need to manage your resources via docker
command line options in Linux or you can use WSL backend in Windows where you can configure resources via .wslconfig
file.
Note: For M1- and M2-based MacOS users, please install CentOS 7.5 or Ubuntu 18.04. Then follow the approach B. using Docker where you need to install Docker on the virtual machine and run cluster containers.
A motivating success story
If you are not familiar with Docker, you can follow this approach where configuring the cluster resources can be done via the hypervisor GUI. If you have less resources, then we recommend using Docker, so you do not need to consume resources for running the guest virtual machine.
In this approach, you will run a virtual machine which in turn will run your cluster container, so the operating system of the virtual machine will be different from the operating system of the container (HDP Sandbox cluster). You can notice that by checking the content of /home
directory or the version of the operating system cat /etc/redhat-release
.
We recommend VirtualBox as a hypervisor since it is supported for most common operating systems (Linux, Windows, and macOS). Please, follow the attached link in the following list to download your preferred hypervisor.
For installation instructions, you can use Google but I share here a tutorial to install VirtualBox on Windows 11. If you have an old version of the software, we recommend to update it in order to avoid any issues in installing the virtual machines. In my PC, I installed VirtualBox 7.0.4 in January 2023.
Hortonworks Data Platform (HDP) is a straightforward, pre-configured, learning environment that contains the latest developments from Apache Hadoop. The Sandbox comes packaged in a virtual environment that can run in the cloud or on your personal machine. The Sandbox allows you to learn and explore HDP on your own.
You can find the download links of the Sandbox in .ova format with respect to the chosen hypervisor. If you are using VirtualBox, then download from here. For VMware users, the download link is here. You can also download them from the official website but it needs an account on Cloudera website. I share below the download links for all available versions of HDP Sandbox.
The download links of HDP Sandbox on VirtualBox:
The download links of HDP Sandbox on VMware:
I will show here the steps to install the HDP Sandbox on VirtualBox. First of all, you need to be sure that you have installed VirtualBox and it is ready to create VMs.
Oracle VM VirtualBox
Select File
from the top toolbar then choose Import Appliance...
from the drop-down list or press Ctrl+I
. The following window appears which allows you to specify the file where to import the virtual appliance from. Here you need to select the path of the virtual appliance. The virtual appliance has the extension .ova
.
Import Appliance window
As shown in the figure below, select the path of the file .ovf
then press Next.
Import Appliance window
In the next window, you may need to change some settings. Make sure that you set the CPU cores to 4 and the RAM size to 8192 MB.
Appliance settings window
And wait for the appliance to be imported as shown in the figure below.
Progress of importing the virtual appliance
If you got a value 0 for the Base Memory
after importing the appliance (a bug in VirtualBox), please update the value as explained above and start the virtual machine.
The first boot of HDP Sandbox takes a lot of time, please take a break and wait till it finishes. Actually, during this time, the virtual machine is building the Docker image and then it starts to run a container for your cluster where you can access it from the host machine.
Booting HDP Sandbox
After finishing the extraction process, the system will run as shown below.
Running HDP Sandbox
After the boot operation is done, you will see the following screen where it gives the address to access the splash webpage of the platform at http://loaclhost:1080 or http://127.0.0.1:1080 for HDP 2.6.5.
HDP Sandbox 2.6.5
Now, you have finished installation and are ready to access the cluster.
Here I will explain how you can install HDP Sandbox using Docker in Windows 11. For Linux users, installing Docker can be done via CLI and you can follow this tutorial.
You can download Docker Desktop and install it by following the instructions in the official website. The installation instructions are easy and you can follow this tutorial.
Docker Desktop
Docker Desktop allows you to manage images and containers via GUI. It also allows you to run the cluster without the need to use CLI.
There are only two versions of HDP Sandbox on Docker Hub where HDP version 2.6.5 is what we need. Firstly, you need to download the installation scripts from Cloudera website:
The zip folder mainly contains .sh
scripts which includes instructions to pull the HDP Sandbox image from Docker Hub and run the containers for your cluster.
HDP Sandbox on Docker
For Linux users, it will be straightforward to run .sh
scripts whereas for Windows 10/11 users you can use C://Windows/System32/bash.exe
program. If you have an older version of Windows then you can install Git BASH to run .sh
scripts.
We need to run only the script docker-deploy-hdp265.sh
from bash
shell as follows:
Pulling HDP Sandbox 2.6.5 image from Docker Hub
Running the installation script will take a lot of time since it will pull 15 GiB (HDP 2.6.5) from Docker Hub, so take a break. If you cannot see the progress of the download, then you can open another terminal and run the same instruction as follows:
After the installation is successfully done, you will see two images on Docker Desktop and also two running containers.
HDP Sandbox images on Docker Desktop
The HDP Sandbox cluster is now running on your PC
You do these steps only for the first time then you can stop/restart the cluster from Docker Desktop.
Limiting the resources of the containers is important to avoid freezing the host system due to the lack of memory. The default behavior of Docker engine is to give the containers as much resources as they request. Linux users can use command line options whereas Windows users can use configuration of WSL backend (which is installed by default in Windows 10/11 but do not forget that you need to install a Linux distribution such as Ubuntu, see tutorial here) or Hyper-V backend. Here I will explain how you can configure the resources of the containers if you use WSL backend in Windows.
You need to download .wslconfig
file or build a new one with the same structure as follows:
After every update of .wslconfig
file, you need to shutdown WSL
, wait for some seconds then restart Desktop Docker. After that, you can start the containers to run the cluster. You can shutdown WSL
as:
More info about WSL configuration can be found here.
You can stop the cluster by turning off the virtual machine in case you are using a hypervisor. For Docker users, you only need to stop the containers sandbox-hdp
and sandbox-proxy
. Meanwhile, you can start the cluster in Docker Desktop by running the containers sandbox-hdp
and sandbox-proxy
whereas you just need to start the virtual machine in case you are using a hypervisor.
Important: Do not delete the containers in Docker Desktop, otherwise the persisted data will be gone and you need to reinstall the cluster using installation scripts.
Either you followed the first approach or the second one in cluster installation, you would end up here. The installed HDP Sandbox cluster is a single node implementation of HDP. It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. You can access the splash webpage of the cluster via http://localhost:1080 for HDP 2.6.5 and http://localhost:8888 for HDP 2.5.0.
HDP Sandbox splash webpage
The button Quick Links will transfer you to the page of links where you can access some services of the cluster.
HDP Sandbox quick links webpage
In order to see all services of the cluster, you need to access Ambari service at http://localhost:8080 where you can monitor and manage all services.
Ambari login page
You need to log in in order to access this service. You can use the credentials of the user maria_dev/maria_dev
as (username/passowrd). HDP Sandbox comes with 4 default users with different roles in the cluster and there is also Ambari Admin who can manage the other users in the cluster.
Ambari homepage
As you can see in Ambari homepage, most services are showing alerts since they did not start so far or due to some problems. You need to wait till services start then you can access them. If you have set less resources than required then it is probably that most services cannot be run, so you can stop the services which are not needed to let the required services run.
Note: You can reset the password of Ambari Admin by running the command ambari-admin-password-reset
via ssh
as follows:
Resetting Ambari Admin credentials
Overview of HDP services
You can access the cluster via shell web client or called shell-in-a-box
by following the address http://localhost:4200 in your browser.
For the very first time, the default credentials are root/hadoop
and you will be asked to reset the password. You need to set a strong password to pass the password-reset step. For example, I use the password hdpIU2023
.
Web Shell client for HDP Sandbox
You can also access the cluster via ssh
command on your preferred terminal. You need to ssh
on the port 2222 as root user:
You can access HDFS files by selecting Files view
in Ambari homepage.
Ambari - Files View
You can see in the following screen the contents of HDFS on the cluster. The page allows you to upload and download files/folders from/to local file system and HDFS.
HDFS on HDP Sandbox cluster
You can also access HDFS via CLI using the command hdfs dfs
. For example, to list the content of the root directory /
in HDFS, you can write as follows:
The single node of the cluster is running on CentOS which has ext4 local file system whereas the distributed data in the cluster is stored in HDFS. You also have a local file system on the host machine. We have multiple file systems among whom data may need to be transferred. For example, in order to process the data in the cluster, you need to store it in HDFS.
You can move data from the local file system in the host machine to the local file system of the cluster node by using command docker cp
. As an example, to move the file C:\Users\Admin\Desktop\hello.txt
on Windows to the root directory /
of the node, we run the following command on the host machine:
sandbox-hdp
is the container name/id.
In the same way, you can move the file /hello2.txt
from the local file system of the node to the local file system of the host machine as follows (run the command on the host machine):
Make sure that the containers of the cluster are running before running the commands.
Info: If you installed HDP using a hypervisor, then you can use the command scp
to immediately copy files from the source machine (where you are running the command) to the local file system of the cluster node.
For instance, to transfer the files in data folder from the host machine to the folder /data
in the cluster node, we run the following command on the terminal.
Note: If you got issues as shown in the figure below, then you need to open the file %USERPROFILE%/.ssh/known_hosts
and remove the previous keys for that port (You can empty the file if all keys are not important).
scp
will create a new key and exchange it with the cluster node.
You can move data between the local file system of the host machine and HDFS via Ambari service as explained in section Access HDFS.
To move data from HDFS to the local file system of the cluster node, you can use hdfs dfs
command with appropriate options. The table below shows usage of some options.
Docker
does not read the parameters in .wslconfig
file on Windows.
The virtual machine does not boot up.
When you try to access the database by running the command psql -U postgres
as [root@sandbox-hdp data]# psql -U postgres
. You may get the following error:
psql: FATAL: Peer authentication failed for user "postgres"
local all all trust
at the beginning of the file /var/lib/pgsql/data/pg_hba.conf
then restart PostgreSQL service by running the command systemctl restart postgresql
as root user.NB: If you have other issues, please contact the TA.
Apache Ambari
?Ambari Admin
?ssh
?ambari
in PostgreSQL server?root
user in the cluster?To find information about the cluster Sandbox, execute the following command on the cluster container:
Ambari 2.4 introduced the notion of Role-Based Access Control(RBAC) for the Ambari web interface. Ambari now includes additional cluster operation roles providing more granular division of control of the Ambari Dashboard and the various Ambari Views. Only the admin id has access to view or change these roles. You can learn more about the roles from the here.
Learning the Ropes of the HDP Sandbox
Getting Started with HDP Sandbox
Hortonworks - Wikipedia
IBM Analytics - Data Sheet
PostgreSQL: The World's Most Advanced Open Source Relational Database
Moving data in HDFS
Hadoop HDFS Command Cheatsheet
HDP Cluster Roles