Lab 3 - sqoop and pig

--- tags: DS200 - Big Data --- # Lab 3 - sqoop and pig ## 1. Squoop: export data from HDFS to RDBMS **1. Create table Employee inside database EmployeeInfo and load the content from EmployeeNoheader.csv to this table. Then query the created database to show the info of employees with salary >1500** ```bash mysql -u root -p ``` ```sql create database EmployeeInfo; use EmployeeInfo; create table Employee(id int, name varchar(20), salary float, department varchar(20), age int); ``` ```bash sqoop export --connect jdbc:mysql://localhost/EmployeeInfo --username root --password cloudera --table Employee --export-dir /user/cloudera/lab3/EmployeeNoheader.csv ``` ```sql select * from Employee where salary>1500; ``` ## 2. Sqoop: import RDBMS table into HDFS with Where clasue **Use Sqoop import to get id, name, and salary of employees in Employee table with salary > 2000 and save results in Employee2000plus directory. Then show the imported results.** ```bash sqoop import --connect jdbc:mysql://localhost/EmployeeInfo --username root --password cloudera --table Employee --m 1 --where 'salary>2000' --target-dir /user/cloudera/lab3/employeeInfo/Employee2000plus hdfs dfs -cat lab3/employeeInfo/Employee2000plus/part* ``` ## 3. Pig: load data from HDFS into Pig Relation **Load data from file EmployeeNoheader.csv to a relation named PigEmployee and show the result** pig ```bash PigEmployee = load ' lab3/EmployeeNoheader.csv' using PigStorage(',') as (id:int, name:chararray, salary:float, department:chararray, age:int); dump PigEmployee; ``` ## 4. Pig: Filter operation on existed Pig relation **Create a relation named PigEmployee1500plus to store employees with salary > 1500** pig ```bash PigEmployee = load ' lab3/EmployeeNoheader.csv' using PigStorage(',') as (id:int, name:chararray, salary:float, department:chararray, age:int); dump PigEmployee; PigEmployee1500plus = filter PigEmployee by salary >1500; dump PigEmployee1500plus; ``` ## 5. Pig-Hive: Store data from Pig relation into Hive table **Store data in PigEmployee1500plus to a Hive table name HiveEmployee1500plus and show the result** hive ```sql create table HiveEmployee1500plus (id int, name string, salary float, department string, age int); ``` pig (grunt) ```bash store PigEmployee1500plus into 'HiveEmployee1500plus' using org.apache.hive.hcatalog.pig.HCatStorer(); ``` hive ```sql select * from HiveEmployee1500plus; ``` ## 6. Pig: Schema handling on existed Pig relation **Load data in EmployeeNoheader.csv to a relation named PigEmployeeNoschema without declaring datatype. Then from this relation, create another relation named BonusSalary with all data is same as that in PigEmployeeNoschema but salary = salary + 500** pig(grunt) ```bash PigEmployeeNoschema = load ' lab3/EmployeeNoheader.csv' using PigStorage(','); describe PigEmployeeNoschema; BonusSalary = foreach PigEmployeeNoschema generate $0, $1, $2+500, $3, $4; describe BonusSalary; dumpBonusSalary ``` ## 7. Hive: Load data from HDFS into Hive Internal table. Using Hue to query **Create an internal table name InternalEmployee with the schema** | | | | ---------- | ------ | | ID | int | | Name | string | | Salary | float | | Department | string | | Age | int | **a. Load data from file EmployeeInfo.csv to HDFS to table InternalEmp** **b. Login to HUE and query EmpName and EmpAge of all employees with salary ≥ 2000** hive (database ***default***) ```sql create table InternalEmployee (ID int, Name string, Salary float, Department string, Age int) row format delimited fields terminated by ',' tblproperties('skip.header.line.count'='1'); load data inpath 'lab3/EmployeeInfo.csv' into table InternalEmployee; select * from InternalEmployee; ``` HUE ```sql SELECT Name, Age FROM internalemployee WHERE salary >= 2000; ``` ## 8. Hive: Create and load data from HDFS into Hive Dynamic Partition table **Create a dynamic partition table named EmployeePartition to store the data from EmployeeInfo.csv file. This table is partitioned by Department. a. Load data to EmployeePartition table b. Open HUE and query to show all the rows of the table** hive ```sql create database newemployeedb; use newemployeedb; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table InternalEmployee (ID int, Name string, Salary float, Department string, Age int) row format delimited fields terminated by ',' tblproperties('skip.header.line.count'='1'); load data local inpath '/home/cloudera/Desktop/lab3/EmployeeInfo.csv' into table InternalEmployee; create table EmployeePartition (ID int, Name string, Salary float, Age int) partitioned by (Department string) row format delimited fields terminated by ','; describe EmployeePartition; --- thuộc tính partition thì select cuối cùng (thuộc tính Department) insert into EmployeePartition partition(Department) select ID, Name, salary, Department from InternalEmployee; ```