BIG DATA

Chapter 1: Introduction to Big Data

  1. What are the main characteristics of big data?
Volume
Velocity
Variety
Veracity
Value

  1. Why distributed computing is necessary for big data?[5]
  2. a) Explain with example about the distributed system in Big Data. What is the role of Data Scientist?
  3. Why do we need data analytics process?
  4. What are the current trends in big data analytics? Wtrat are the technical challenges and characteristics of big data?[10]
  5. How big data differ from traditional data? List out five distinct differences?
  6. What are the current trends in big data analytics? What are the technical challenges and characteristics of big data?
  7. What are BIg-Data challenges?Explain the data analytic process in terms of BIg-Data.
  8. How distributed systems help to solve the big data problems?
  9. What are the technical challenges and characteristics of a big data? Who are thq data scientists, list out their roles and skills.
  10. The data in big data warehouse is called hybrid data. Explain with suitable examples.

Chapter 2: Google File System

  1. Explain GFS architecture. How data and control messages flow in GFS architecture.Explain with suitable flow diagram.[10]
  2. How GFS differ from other File Systems? List out five distinct differences.
  3. What is the main role of GFS Master during read and write processes?
  4. .What is availability and fault tolerance in Google File System?
  5. Why do we have large and fixed sized Chunks in GFS? What can be the demerits of that design?
  6. Why single master is not a bottleneck in GFS cluster.
  7. How data and control message flow in GFS architeture. Explain with suitable flow diagram
  8. How GFS provides fault tolerance.How it allows tolerating chuck servers failures?
  9. Explain the control flow of write mutation with diagram.Explain the meta data storecl by GFS maste..
  10. Explain garbage collection implernented by GFS. Explain its purpose against implementing eager deletion for storage rellocation.
  11. Explain how master implernents garbage collection and detects stale replica in a GFS.
  12. Why do we have large and fixed sized Chunks in a GFS? What are the demerits of that design?
  13. With diagram, explain general architecture of Google File System.
  14. Why do we have single master in a GFS and millions of chunk serers?
  15. A cluster contains 1500 machines, each having 500GB disc capacity. Calcuiate approximate the number of the chunck sevrers, the blocks and the total available size if default chunck replica is 3 and 5 respectively.

Chapter 3 : MapReduce Framework

  1. Map Reduce is the heart of Hadoop eco-system? Define work flow of Map reduce with suitable examples.[10]
  2. Explain in brief Data Flow technique of Map-Reduce Framework. What is Optimization and Data Locality in Map Reduce?
  3. How is MapReduce library designed to tolerate different machines (map/reduce nodes)failure while executing MapReduce job?
  4. How does MAP-REDUCE work? Explain each step with suitable example.
  5. How map-reduce works in distributed fashion? Describe the parallel efticiency of map-reduce with suitable block diagram
  6. Write down the map-reduce program to find the word frequency.
  7. What is the combiner function in map-reduce? Explain its purpose with suitable example.
  8. What is a map reduce? Expiain the execution overview of the map reduce.
  9. How do you find max and min occurrence of words in a given document Explain.

Chapter 4: No Sql

  1. What are the diferences between structured and unstructured data? Explain with suitable examples.[10]
  2. what are the sources of structured, semi-structured and un-structured data in real-world?
  3. What are the differences between row and column oriented database? why Hbase, cassandra and MongoDB are called columa oriented NoSeL database?
  4. Explain CAP theorem and Eventual consistency. Also, explain the reason why some NoSQL databases like cassandra sacrifice absohite consistency for absolute availability.
  5. Explain the term "NO-SQL". Justify for distributed scenario normalization contradict the data availability
  6. What is the difference between a structured and unstructured data. Explain the eventual consistency and tunable consistency in context of Cassandra.
  7. Explain a CAP theorem. Differentiate between a RDBMS and a NoSQL Databases.
  8. Explain taxonomy of a NoSQL databases. Explain Cassendra database in brief.
  9. Using Mongo DB database.
    a)Create a collections named "posts", insert following records: title: MongoDB, description: N4ongoDB is a NoSQL database, by: Tom, Comments: We use MongoDB for unstructured data, likes: 100
    i) Now write a query to search title of the post written by Tom. [3]
    ii) Write mapReduce fuhction to count number of posts created by, varioris user.
  10. What are the various components of Hbase database. Explain with a suitable block diagram.[10]
  11. Why Hbase is called column-oriented NoSQL database built on top of HDFS? What are the commands to STORE, SELECT, MODIFY, and DELETE records from a table of Hbase.
  12. What is elastic search? What indexes will be used during elastic search. Explain with suitable example. [10]
  13. What is an elastic search? Expiain various types of analyzers.

Chapter 5: Searching and Indexing Big Data

  1. Explain LUCENE architecture and its data indexing approach.
  2. What are the data indexing steps? Describe the components of search application.
  3. What is the Lucene? Describe the typical components involved in the search application.

Chapter 6: Hadoop

  1. Define HDFS? How client reads data from HDFS? Explain with the help of suitable block diagram.[10]
  2. Define DFS. How client writes data in HDFS? Explain with the help of suitable block diagram.
  3. a) Explain in brief five daemons of Hadoop.What is the role of Hadoop Distributed File System in Hadoop?
  4. How Hadoop and GFS are similar interms of design architecture
  5. For a hadoop cluster with 128MB block size, how many mappers will hadoop mapreduce form while performing mapper function on 1 GB of data> Justify with explanation.
  6. Clock synchronizationin DFS may be the big challenge. How this clock synchronization problem can be solved? [10]
  7. Hbase, Cassandra and MongoDB are called column-oriented NoSQL database? How row-oriented database differ from column-oriented database? Explain with suitable examples.
  8. Discuss the architecture of Hbase in short. Explain eventual consistency and tunable consistency in context of Cassandra.
  9. What are the differences between row and column oriented database? Why Hbase, cassandra and MongoDB are called columa oriented NoSQL database?

Short Notes

​​​​a) Master-Slave architecture
​​​​b) Zookeeper
​​​​c) Clieat-Server architerucre
​​​​d) Hadoop MaP reduce
​​​​e) Application of Big data analytics
​​​​a) Scoop and fiume .
​​​​b) Zookeeper
​​​​c) Oozie
​​​​d) Pig and Hive
​​​​i) Elastic Search
​​​​ii) Hbase Architecture
​​​​iii) Functional Programming
​​​​a) Shadou' Master and Cluak services
​​​​b) Analyzers available in Lucene *..
​​​​c) Vertical and Horizontal Scalabiliby
​​​​a) Zookeeper and Oozie
​​​​a) CAP theorem
​​​​b) Role of Data Scientist in Big data
​​​​c) Amazon cloud
​​​​a) Combiner Functions
​​​​ii) Fault tolerant systems
​​​​iii) JSON
​​​​iv) Unstructured data

Numerical

  1. How word count job is performed flow chart?
    File.txt(file size: 200MB)
    Hi how are you
    How is your job
    How is your family
    How is your brother
    How is your sister
    What is the time now
    What is the strength of Hadoop

  2. For following dablist the input to output from both the map and reduce functions for getting maximum marks ofeach college.

Student Name College Name Final Marks
Ram ABC 70
Sita ABC 80
Hari ABC 60
Gita XYZ 90
Rita XYZ 80
Shyam PQR 90
Laxmi PQR 70
Gopal PQR 60
​​​​                    OR
​​​​What is the combiner function in mapreduce? Explain its purpose with suitable example.[10]
  1. How do you find max and min occurrence of the words in a given text document? Explain.
  2. Draw the output of mapreduce of the following lines:"big users big voiume data cloud contributes bid data" "facebook has big users facebook operates big data".