## BDA Lab Assignment 4 on PIG ## Writer: Yushant Tyagi, Roll No 103 # **PIG, THE HANDLER OF BIG DATA** --- **What Is PIG?** Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: * Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. * Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. * Extensibility. Users can create their own functions to do special-purpose processing. **What Will You Learn in this BLOG??** Pig is something everyone interested in big data should know. As the name suggests, it takes in anything and spits out relevant information out of it ![](https://i.imgur.com/yEypZSY.png) So, our aim in this blog is to do simple data anlysis using PIG and fire some of the basic commands of PIG **Prerequisites** If you plan to follow along, you need to have the following installed in your Computer: 1. Hadoop version 3.x.x 2. Operating System, just kidding (; 3. Pig version 0.16.0. Reason for using this version is that the version 0.17.0 is not yet compatible with hadoop 3.x.x. So, in order to have a good blog journey, we would not want any error. ## Let's Start First, start hadoop by navigating to *HADOOP_HOME* and typing start-all.sh in your terminal. It will start the datanodes, namenodes, resource manager, and node manager. It looks like the following: In order to check if everything worked out, type *jps* in the terminal. The output will look like the following: The data we will be working on is employees data. First, create an input directory in hdfs ![](https://i.imgur.com/cTZF28K.png)' Confirm with the *hdfs dfs -ls /* command that the input directory is created or not. You can navigate to localhost:9870 on your favourite browser to check the same. You can see the directory named Input ![](https://i.imgur.com/cpPavXS.png) We created a cvs file named *employee.txt* and pushed that inside the Input directory Hence, you can see it in out browser inside the Input directory ![](https://i.imgur.com/IEI1bVX.png) ## BASIC PIG COMMANDS 1. First things first, check the version of your PIG and ensure it is 0.16.0 ![](https://i.imgur.com/f5Z9oaW.png) 2. Start pig by navigating to *PIG_HOME* and then running pig command. 3. If you run just pig, then by default it will open the mapreduce mode **Mapreduce Pig Mode** ![](https://i.imgur.com/utTDFaK.png) If you run *pig -x local*, then this opens pigs in local mode **Local Pig Mode** ![](https://i.imgur.com/N3NBXle.png) 4. Here we will work with mapreduce mode since we need to access hadoop hdfs to access our employee.txt 5. Run the pig load command and create an employee variable **Load Data into PIG** ![](https://i.imgur.com/vdHpEu8.png) 6. Now we will group by our data **Note for Ma'am: I have switched my terminal to root mode because of which the interface will change** **Group By PIG** ![](https://i.imgur.com/PhGinDy.jpg) 7. Now we count the employees belonging to a particular department ![](https://i.imgur.com/vTSoxb0.png) 8. DUMPing the data ![](https://i.imgur.com/zo2ngNp.png) 9. Now you might see something like this ![](https://i.imgur.com/v7cSrcB.png) 10. Basically you can see that what you see on the screen is the processing done inside pig, just like food inside a pig's stomach. 11. The final output is something like this ![](https://i.imgur.com/eTZpD0L.png) ![](https://i.imgur.com/JsAwdhp.png) 12. The output file is saved in the hdfs. Same can be verified by navigating to localhost:9870 ![](https://i.imgur.com/9R1XLdZ.png) ## Conclusion Thus, with the help of the data anlysis we did on pig, we can see that how efficient it is to make it function over hdfs and take advantage of it for processing big data.