Skip to main content

The MapReduce Framework

ยท 3 min read
Fernando Ramirez
Software Engineer @ Microsoft

Ever wonder what MapReduce is? Neither have I, but I'm happy I learned about it!

What is MapReduce?โ€‹

MapReduce is a distributed execution framework that takes away the complexity of distributed programming by exposing 2 processing steps that developers implement; Map and Reduce. It was ideated in Google back in 2004 by Jeffery Dean and Sanjay Ghemawat of Google, in their paper "MapReduce: Simplified Data Processing on Large Clusters"

Map: function that processes a key/value pair to generate a set of intermediate key/value pairs. Reduce: function that merges all intermediate values associated with the same intermediate key.

info

Many real world tasks are expressible in this framework, as shown in the paper.

info

The MapReduce abstraction functions; Map and Reduce were inspired by the functional programming language Lisp.

Example of MapReduceโ€‹

The best way to understand the MapReduce framework is by using a simple example. In our example, we want to count the number of occurrences of words in some content. In our example, using the Go programming language, the Map and Reduce functions would look similar to this:

func Map(key string, value string) {
// key: document name
// value: document contents
for _, word := range value {
EmitIntermediate(word, "1");
}
}

func Reduce(key string, values string) {
// key: a word
// values: a list of counts
int result = 0;
for _, val := range values {
result += ParseInt(val);
}
Emit(AsString(result));
}

The Map function emits each word plus an associated count of occurrences (just โ€˜1โ€™ in this simple example). The Reduce function sums together all counts emitted for a particular word.

MapReduce starts by assuming there is an input and that input is split up in files or chunks.

We can then use the Map function, and the MapReduce framework is gonna run the Map function on each of the input files:

Map("Input1", "c a t") -> {'c':1, 'a':1, 't':1}
Map("Input1", "m a t") -> {'m':1, 'a':1, 't':1}
Map("Input3", "b a t") -> {'b':1, 'a':1, 't':1}
note

You can notice parallelism here.

info

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits.

After running the Map function in each of our input files, we get what the paper refers to as the intermediate output:

{'c':1, 'a':1, 't':1} // input file 1 output
{'m':1, 'a':1, 't':1} // input file 2 output
{'b':1, 'a':1, 't':1} // input file 3 output

The MapReduce framework collects together all instances of all maps for each word (key) and hand them to a Reduce function.

info

The collection would require network communication, as these map tables are typically on different computers.

After the collection, here's how the Reduce function calls would look like:

Reduce('c', [1]) -> 1
Reduce('m', [1]) -> 1
Reduce('b', [1]) -> 1
Reduce('a', [1, 1, 1]) -> 3
Reduce('t', [1, 1, 1]) -> 3

This is what a typical MapReduce job would look like at a high-level for this example.

I highly recommend reading "MapReduce: Simplified Data Processing on Large Clusters" for more information.

Thanks for reading!