---
title: MapReduce (MIT 6.824 Distributed System)
tags: Distributesd System
description:
type: slide
---
# MapReduce (From MIT 6.824 Distributed System)
<!-- Put the link to this slide here so people can follow -->
http://nil.csail.mit.edu/6.824/2021/schedule.html
---
## TL;DR
> Users specify **a map function that processes a key/value pair** to generate a set of intermediate key/value pairs, and **a reduce function that merges all intermediate values** associated with the same intermediate key.
----
## Inspire from
> Abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.
From paper [MapReduce: Simplified Data Processing on Large Clusters](http://nil.csail.mit.edu/6.824/2021/papers/mapreduce.pdf)
----
## Context
multi-hour computations on multi-terabyte data-sets
> e.g. build search index, or sort, or analyze structure of web only practical with 1000s of computers applications not written by distributed systems experts
----
## overall goal
- easy for non-specialist programmers
- programmer just defines Map and Reduce functions
- often fairly simple sequential code
- MR takes care of, and hides, all aspects of distribution!
----
## Pseudo

----
## Flow

----
## Overview

----
## Notes
- In Google, the machines will be as GFS node and map worker in the same time to make it get input data from local disk and store intermediate files in local disk.
- GFS will split data in 64 MB per chunk and replicate each file to three different machines for tolerance.
---
# Fault Tolerance
Master/Coordinater rerun map/reduce.
- Can map run twice?
- Can reduce run twice?
Both answers are yes, but MapReduce is just function if the input is same the output should be same too.
----
## Details of worker crash recovery
Map worker crashes:
coordinator notices worker no longer responds to pings
coordinator knows which Map tasks it ran on that worker
those tasks' intermediate output is now lost, must be re-created
coordinator tells other workers to run those tasks
can omit re-running if Reduces already fetched the intermediate data
----
## Details of worker crash recovery
Reduce worker crashes:
finished tasks are OK -- stored in GFS, with replicas.
coordinator re-starts worker's unfinished tasks on other workers.
----
## Other failures
- Will coordinator fail?
- It will and if it failed it will rerun all the map and reduce functions.
- How to deal with slow workers (stragglers)?
- If all of other map workers is done, coordinator will call another instances to run this slow job as replication.
---
# Why do we need
- It's simpler, easier and smaller, and we can reduce codebase because MapReduce library will handle fault tolerance, distributed and parallelization.
- Performance
- Easily to speed up due to scale out.
---
# Consider
Can we do MapReduce in streaming process?
Hugely influential (Hadoop, Spark).
Probably no longer in use at Google.
Replaced by Flume / FlumeJava (see paper by Chambers et al).
GFS replaced by Colossus (no good description), and BigTable.
<style>
.reveal {
font-size: 24px;
}
</style>