Course: Big Data - IU S23
Author: Firas Jolha
In this stage, we will build an ML model for the dataset that we have and perform hyperparameter tuning via GridSearch.
The dataset is about the departments and employees in a company as well as their salary categories. It consists of two .csv
files.
The file emps.csv
contains information about employees:
The file depts.csv
contains information about departments:
I created these csv
files from the tables provided in the link.
Before starting with Spark ML, make sure that you built Hive tables and tested them via EDA in the previous stage.
In this part of the project, we will build an ML model. Here I will explain two modes for performing analysis in HDP. The first one is used for deployment and the second one is used for development. We suggest to use both of them and perform the analysis in an interactive Zeppelin note then run the code via spark-submit
after changing some configurations in Spark Session.
Performing PDA should include:
- building the model.
- Tuning the model parameters.
- Performing predictions.
I recommend using Hive tables created in Hive via partitioning and/or bucketing and/or none of both since there is a common issue in reading the schema of tables generated by Sqoop but the Hive tables whose schema is generated by Hive would not return issues I hope.
In case you encounter similar issues, please contact your TA.
spark-submit
You can save the code in a file model.py
and run it on Spark using spark-submit
tool. You should add some jars to properly import the HIVE tables using Spark SQL. You should write the code in model.py
file and run it via spark-submit
tool as follows:
Now we can use depts
and emps
as an input dataframe for our model.
Possible issues with Hive will be discussed with students and added later to this document.
python2
as the default interpter.I do not recommend running the application in the cluster using spark2
interpreter since it leads to Hive exceptions.
Now we can use depts
and emps
as an input dataframe for our model.
You can perform EDA here via Spark SQL too but it is optional.
Here I will show a simple example to predict the salaries of the employees via linear regression.
You can export the results of prediction to csv
file as follows:
The path by default in Zeppelin refers to HDFS, and the file will be stored in HDFS as shown below, but you can move it to local file system and put it in output
of the project repository using hdfs dfs
commands.
We used coalesce
function to repartition the dataframe and get only one csv file, otherwise you could get multiple files due to the multiple partitions of the dataframe.
For other options for mode
, check the documentation.
For the project, you need to complete PDA according to the criteria in project description.