Report I

MapReduce in HDFS (Hadoop File System)

MapReduce - is a framework for distributed computing of tasks using a large number of computers (called "nodes"), forming a cluster. HDFS has its own MapReduce framework.

Lets go deeper, and look at what happens in each step

On the Map step(pretreatment step) the input data is collected. In order to do this, one of the computers (known as the master node - master node) receives the input data of the problem, divide them into parts and transfers to other computers (slave node).

On the Reduce step all preprocessed data is merged. The master node receives responses from the operating units and forms the basis of their results - the solution, which was formulated originally.

Lets look where we can apply it.

For instance

  • city, temperature
  • student, grade
  • car, maximum speed

MapReduce accomplishes this in parallel by dividing the work into independent tasks, spread across many nodes (servers). This model would not scale to large clusters (hundreds or thousands of nodes) if the components shared data arbitrarily. The communication overhead required to keep the data on the nodes synchronized would be inefficient. Rather, the data elements in MapReduce are immutable, meaning that they cannot be updated. Example, if during a MapReduce job, input data is changed eg. (modifying a student grade or car's speed) the change does not get reflected in the input files; instead new output (key, value) pairs are generated which are then forwarded by Hadoop into the next phase of execution(1).

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System  are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster(2).

here is the typical example of MapReduce

// Function used by slave nodes on the Map step void map(String name, String document): // Input date: // name - name of the document // document - internal data of the document for each word w in document: EmitIntermediate(w, "1"); // Function used by slave nodes on the Reduce step void reduce(String word, Iterator partialCounts): // Input date: // word // partialCounts - grouped list of intermediate results int result = 0; for each v in partialCounts: result += parseInt(v); Emit(AsString(result));

I'm going to write a small example program called "WordCount" as a template for further work for our implementation of HDFS in PACC project.