MapReduce

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster.

[10] The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play.

MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead.

While this process often appears inefficient compared to algorithms that are more sequential (because multiple instances of the reduction process must be run), MapReduce can be applied to significantly larger datasets than a single "commodity" server can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours.

[14] The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data are still available.

Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process.

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs.

Distributed implementations of MapReduce require a means of connecting the processes performing the Map and Reduce phases.

As another example, imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age.

The MapReduce System would then line up the 96 Reduce processors by performing shuffling operation of the key/value pairs due to the fact that we need average per age, and provide each with its millions of corresponding input records.

The input reader reads data from stable storage (typically, a distributed file system) and generates key/value pairs.

From a basic requirements point of view, any MapReduce operation must involve the ability to arbitrarily regroup data being reduced.

Additional modules such as the Combiner function can help to reduce the amount of data written to disk, and transmitted over the network.

[22] When designing a MapReduce algorithm, the author needs to choose a good tradeoff[11] between the computation and the communication costs.

In tuning performance of MapReduce, the complexity of mapping, shuffle, sorting (grouping by the key), and reducing has to be taken into account.

The amount of data produced by the mappers is a key parameter that shifts the bulk of the computation cost between mapping and reducing.

[23] For processes that complete quickly, and where the data fits into main memory of a single machine or a small cluster, using a MapReduce framework usually is not effective.

Since these frameworks are designed to recover from the loss of whole nodes during the computation, they write interim results to distributed storage.

A task that completes in seconds can just be restarted in the case of an error, and the likelihood of at least one machine failing grows quickly with the cluster size.

MapReduce is useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, Singular Value Decomposition,[24] web access log stats, inverted index construction, document clustering, machine learning,[25] and statistical machine translation.

[35] Development at Google has since moved on to technologies such as Percolator, FlumeJava[36] and MillWheel that offer streaming operation and updates instead of batch processing, to allow integrating "live" search results without rebuilding the complete index.

David DeWitt and Michael Stonebraker, computer scientists specializing in parallel databases and shared-nothing architectures, have been critical of the breadth of problems that MapReduce can be used for.

[39] They challenged the MapReduce proponents' claims of novelty, citing Teradata as an example of prior art that has existed for over two decades.

"[39] MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as Pig (or PigLatin), Sawzall, Apache Hive,[40] HBase[41] and Bigtable[41][42] are addressing some of these problems.

[43] Jorgensen asserts that DeWitt and Stonebraker's entire analysis is groundless as MapReduce was never designed nor intended to be used as a database.

DeWitt and Stonebraker have subsequently published a detailed benchmark study in 2009 comparing performance of Hadoop's MapReduce and RDBMS approaches on several specific problems.

In Ars Technica, an editor acknowledged Google's role in popularizing the MapReduce concept, but questioned whether the patent was valid or novel.