MapReduce - HackMD

# MapReduce contributed by `ilkclord` ## Background In the past years , Google has to deal with many computation with a large scale of raw data , like crawled documents, web request logs, etc .The raw data are usually distibuted in diffrent machines ,so the computation needs to be done in a distributed system .MapReduce is the method for this problem .MapReduce aims to solve the issue like how to parallelize the computation, distribute the data, and handle failures . MapReduce will formed a libarary which solve all this issues . ## Introduction MapReduce is a method that execute a task by partition the task to many parts , and assign it to other machines to complete it .There are two steps in this method .The first step is **Map** , which distribute the samller task out ,and the machines execute the task .After all the task are done,we can do the last step ,**Reduce** , which merge the mutiple results into one results . ## Model In the map_reduce mdoel , there exists a master machine , which handle the all the task , and there are mutiple node machine which does the task , either **Map** or **Reduce** .The Mapreduce model has some features . ### Pure Funtion Since MapReduce is used in computing in distributed system , there may occurred side effects . Some of them are unexpected and dangerous .The simple way to avoid the side effect is to make the funtion pure . So the function has expected to be a pure one in MapReduce libarary . * **Map** The Map funtion that process the input data , the input data can be read by the file or in other way like reading in database and reading from virtual memory . * **Combine** The Combine funtion merge all the results , in a key / value pair .The result will be stored in an intermediate file and send to the Reduce task * **Reduce** The Reduce function has the similar purpose as Combine ,the only diffrence between them is that the Reduce stores the result into the final result in the Master machine . ### Input / Output type Since the Map fuction must be a pure function , the input and ouput type must be a key / value pair , so as the Reduce function . ### Result One important feactures of results is that it has to be sorted . The sorted result helps the merge proces faster and it will be less complicate to serch for a specific data . During the Map_reduce process , there will be lots of result being generate , but the only one that matter is the Master result .All the results will finally merge into the copies in the Master. ### Master machine A master machine has to handle all the state of the task and the id of each node machine . When the failure occured , Master machine has to handle it , and there are some way to recover from failure . * **Worker failure** To deal with worker failure ,we reset the worker into idle state and reexecute it * **Master failure** For more than one master machine program, we set a checkpoint and check it in period of time , when failure occured ,we set the state back to the checkpoint . For one Master machine program , we could only reexecute the program . * **Straggler** When we are reaching the end of compute , if there are in-progressed task , we call another back up execution and wheither the primary one or back up one are done , we mark the task completed. ### Locality In the partition of the task , we should also consider the locality ( for cloud computing ) , since the transfer of the data cost the network resources .The node machine will done the task which the partial task has the data stores on it . ### Skipping Bad Record There may have some bugs in the program , and some of them are caused by third party issue .We provide a optional mode to make fprward progress . The implement of MapReduce provides a signal for master machine to detect the failure cause by specific record . When finished a task , the node machine returns a signal that shows the number of failure during execution , if more than one failure occured on the record , the record is marked . And the record will be skipped during the next re-execution. ### Status The master must handle a status page of computation state , which indicates how many workers are in-progressed and completed .It is useful , since the master can decide whether using more resource for the program or not by the status page . In addition , the top-level status page can also shows which worker has failed , and which task they are holding . It is useful to diagnose the bug . ## Work Flow ```graphviz DIgraph G{ node[shape = box] Master -> node1_Map->node1_Combine->node12_Result Master -> node2_Map->node2_Combine->node12_Result Master -> node4_Map->node3_Combine->node34_Result Master -> node3_Map->node4_Combine->node34_Result node34_Result->node1234_Reduce node12_Result->node1234_Reduce->Master_Result rankdir = LR } ``` ## Usage Here are some problem that can be solved by using MapReduce method . ### **Distributed Grep** ### **Distributed Sort** ### Inverted Index ### Word Count To count the word in a document , we can read the file character by character , and count it when it forms a word .Keep doing the process , we can get the word-count pages of the document.But we can do this process faster ! Since the number of word can represent as a the form of ( word , count ) , MapReduce method is suitable for this problem .