The MapReduce Framework
Ever wonder what MapReduce is? Neither have I, but I'm happy I learned about it!
What is MapReduce?โ
MapReduce is a distributed execution framework that takes away the complexity of distributed programming by exposing 2 processing steps that developers implement; Map
and Reduce
. It was ideated in Google back in 2004 by Jeffery Dean and Sanjay Ghemawat of Google, in their paper "MapReduce: Simplified Data Processing on Large Clusters"
Map
: function that processes a key/value pair to generate a set of intermediate key/value pairs.
Reduce
: function that merges all intermediate values associated with the same intermediate key.
Many real world tasks are expressible in this framework, as shown in the paper.
The MapReduce abstraction functions; Map
and Reduce
were inspired by the functional programming language Lisp.
Example of MapReduceโ
The best way to understand the MapReduce framework is by using a simple example. In our example, we want to count the number of occurrences of words in some content. In our example, using the Go programming language, the Map
and Reduce
functions would look similar to this:
func Map(key string, value string) {
// key: document name
// value: document contents
for _, word := range value {
EmitIntermediate(word, "1");
}
}
func Reduce(key string, values string) {
// key: a word
// values: a list of counts
int result = 0;
for _, val := range values {
result += ParseInt(val);
}
Emit(AsString(result));
}
The Map
function emits each word plus an associated count of occurrences (just โ1โ in this simple example). The Reduce
function sums together all counts emitted for a particular word.
MapReduce starts by assuming there is an input and that input is split up in files or chunks.
We can then use the Map
function, and the MapReduce framework is gonna run the Map
function on each of the input files:
Map("Input1", "c a t") -> {'c':1, 'a':1, 't':1}
Map("Input1", "m a t") -> {'m':1, 'a':1, 't':1}
Map("Input3", "b a t") -> {'b':1, 'a':1, 't':1}
You can notice parallelism here.
The Map
invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits.
After running the Map
function in each of our input files, we get what the paper refers to as the intermediate output:
{'c':1, 'a':1, 't':1} // input file 1 output
{'m':1, 'a':1, 't':1} // input file 2 output
{'b':1, 'a':1, 't':1} // input file 3 output
The MapReduce framework collects together all instances of all maps for each word (key) and hand them to a Reduce
function.
The collection would require network communication, as these map tables are typically on different computers.
After the collection, here's how the Reduce
function calls would look like:
Reduce('c', [1]) -> 1
Reduce('m', [1]) -> 1
Reduce('b', [1]) -> 1
Reduce('a', [1, 1, 1]) -> 3
Reduce('t', [1, 1, 1]) -> 3
This is what a typical MapReduce job would look like at a high-level for this example.
I highly recommend reading "MapReduce: Simplified Data Processing on Large Clusters" for more information.
Thanks for reading!