Introduction to Map-Reduce Programming model in IoT

 Introduction to Map-Reduce Programming model

  • MapReduce is a programming framework.
  • Perform distributed and parallel processing on large data sets in a distributed environment.
  • MapReduce consists of two distinct tasks – Map and Reduce.
  • Map job:- A block of data is read and processed to produce key-value pairs as intermediate outputs.
  • The output of a Mapper or map job (key-value pairs) is input to the Reducer.
  • The reducer receives the key-value pair from multiple map jobs.
  • Reducer aggregates those intermediate data tuples (intermediate key-value pairs) into a smaller set of tuples or key-value pairs which is the final output.

Parts of the Map-Reduce Programming model

PayLoad − Applications implement the Map and the Reduce functions, and form the core of the job.

Mapper –

Maps the input key/value pairs to a set of intermediate key/value pairs.

The first stage in Data Processing using MapReduce is the Mapper Class. 

Here, RecordReader processes each Input record and generates the respective key-value pair. 

Hadoop’s Mapper store saves this intermediate data into the local disk.

Input Split -  

It is the logical representation of data. 

It represents a block of work that contains a single map task in the 

MapReduce Program.


It interacts with the Input split and converts the obtained data into the form 

of Key-Value Pairs.

Reducer Class

The Intermediate output generated from the mapper is fed to the reducer

 which processes it and generates the final output which is then saved in

 the HDFS.

Driver Class 

The major component in a MapReduce job is a Driver Class. 

It is responsible for setting up a MapReduce Job to run in Hadoop. 

NamedNode − Node that manages the Hadoop Distributed File System (HDFS).

DataNode − Node where data is presented in advance before any processing takes place.

MasterNode − Node where JobTracker runs and which accepts job requests from clients.

SlaveNode − Node where Map and Reduce program runs.

JobTracker − Schedules jobs and tracks the assigned jobs to Task tracker.

Task Tracker − Tracks the task and reports the status to JobTracker.

Job − A program is an execution of a Mapper and Reducer across a dataset.

Task − An execution of a Mapper or a Reducer on a slice of data.

Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode. 

Advantages of MapReduce

The two biggest advantages of MapReduce are:

      1. Parallel Processing:

In MapReduce, dividing the job among multiple nodes and each node works with a part of the job simultaneously. 

So, MapReduce is based on the Divide and Conquer paradigm which helps us to process the data using different machines. 

Data Locality: 

In MapReduce data is distributed among multiple nodes where each node processes the part of the data residing on it. It provides the following advantages:

It ·         Very cost-effective to move the processing unit to the data.

·         The processing time is reduced as all the nodes are working with their part of the data in parallel.

·         Every node gets a part of the data to process so no chance of a node getting overburdened. 


·         According to the concept of key-value pairs. It provides powerful paradigms for parallel data processing.

·         For processing data in MapReduce, it is able to map a given input, and expected output into the MapReduce paradigm, that is both Input and Output need to be mapped into the format of multiple key-value pairs.

·         A single key-value pair is also referred to as a record.

 A Map-Reduce job is divided into four simple phases, 1. Map phase, 2. Combine phase, 3. Shuffle phase, and 4. Reduce phase.

·         Each phase takes a key value as an input and emits one or more key-value pairs as an output.


Map phase.
The map function operates on a single record at a time.

·         On each input of the key-value pair (LongWritable key, Text value) MapReduce framework will call the map function with key and value as arguments.

·         Here, the value is a complete line of text.

·         To count the words in a line, first need to split the line into words, for this use a java string tokenizer or create your own function.

·         Output from this phase is a key-value pair, where a word is a key and frequency 1 is a value. Output key value is emitted using context.write(word, one). LongWritable, Text, and IntWritable are wrappers for corresponding basic datatypes in Java.

·         These wrappers are provided by the MapReduce framework to handle the serialization and deserialization of key-value records.


2. Combine phase.
The combiner is the process of applying a reducer logic early on an output from a single map process.

·         Mappers output is collected into an in-memory buffer.

·         MapReduce framework sorts this buffer and executes the commoner on it, if have provided one.

·         Combiner output is written to the disk. For the combiner code, please refer to Reducer code.


3. Shuffle phase.
In the shuffle phase, MapReduce partitions data and sends it to a reducer.

·         Each mapper sends a partition to each reducer.

·         Partitions are created by a Partitioner provided by the MapReduce framework.

·         For each key-value pair, the Partitioner decides which reducer it needs to send.

·         All the records for a same key are sent to a single reducer.


4. Reduce phase.
During initialization of the reduce phase, each reducer copies its input partition from the output of each mapper.

·         After copying all parts, the reducer first merges these parts and sorts all input records by key.

·         In the Reduce phase, a reduce function is executed only once for each key found in the sorted output.

·         MapReduce framework collects all the values of a key and creates a list of values.

·         The Reduce function is executed on this list of values and a corresponding key.


·         In the reduce function all the input frequencies for a word are added together to create a single result frequency.

·         This result frequency is the total frequency of a word in an input of all documents.


Notice that all the records for a key are sent to a single reducer, so only one reducer will output are frequency for a given word. The same word won’t be present in the output of the other reducers.


MapReduce Architecture explained in detail

  • One map task is created for each split which then executes map function for each record in the split.
  • It is always beneficial to have multiple splits because the time taken to process a split is small as compared to the time taken for processing of the whole input. When the splits are smaller, the processing is better to load balanced since we are processing the splits in parallel.
  • However, it is also not desirable to have splits too small in size. When splits are too small, the overload of managing the splits and map task creation begins to dominate the total job execution time.
  • For most jobs, it is better to make a split size equal to the size of an HDFS block (which is 64 MB, by default).
  • Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS.
  • Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS store operation.
  • Map output is intermediate output which is processed by reduce tasks to produce the final output.
  • Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication becomes overkill.
  • In the event of node failure, before the map output is consumed by the reduce task, Hadoop reruns the map task on another node and re-creates the map output.
  • Reduce task doesn't work on the concept of data locality. An output of every map task is fed to the reduce task. Map output is transferred to the machine where reduce task is running.
  • On this machine, the output is merged and then passed to the user-defined reduce function.
  • Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the local node and other replicas are stored on off-rack nodes). So, writing the reduce output

Hadoop divides the job into tasks. There are two types of tasks:

  1. Map tasks (Splits & Mapping)
  2. Reduce tasks (Shuffling, Reducing)

as mentioned above.

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called a

  1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
  2. Multiple Task Trackers: Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode.


Post a Comment