MapReduce |

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

What is MapReduce ?

https://www.youtube.com/watch?v=b-IvmXoO0bU

Why Mapreduce ?

Prior to 2004 huge data were stored in single servers . If any program ran a query where data is stored in multiple servers , logical integration of those search results and analysis of the data was a nightmare . Not to mention the massive efforts and expenses that were involved.

The threat of data loss, challenge of data backup , and reduced scalability resulted in issue snowballing into a crisis of sorts.

To counter this Google introduced Map Reduce in the Dec of 2004. And the anlysis of datasets was done in 10 minutes rather than 8-10 days. Queries could run on multiple servers simultaneously . And search results could be logically integrated and dta could be analyzed in real time.

USP of Mapreduce is its fault tolerance and scalability .

Map Reduce Analogy

The map reduce represent here a Poll counting after an election.

In Case 1 : Each Polling both ballet papers are counted by a Teller. This is a pre-map-reduce step called input splitting.

In step 2 : tellers of all booth count the ballot papers in parallel. As multiple tellers are working on a single job , the execution time will be faster . This is called the Map Method.

In Step 3: The ballot count in each booth under the assembly and parliament seat positions is found. And the total count for each candidate is generated. This is known as the Reduce Method.

Thus the Map and Reduce help complete the work faster than an Individual counter.

Word Count example :

The job is to produce the word count of a given paragraph .

The Map reduce process begins with an input phase . which produces data for which the Map reduces process must be performed.

Splitting : Is converting a job into a number of Tasks.

Mapping : The mapping phase refers to generating a Key value pair since this task is about counting words this sentence is split into words . It is using the sub string methods to generate words from lines The mapping phase will ensure that each words converted into keys and a default value of "1" is allotted to each keys or each word in the sentence

Shuffling : In the shuffling phase refers to sorting data based on those keys as shown on screen the words are sorted in ascending order

Reducing : The last phase is the reducing phase . In this phase the data is reduced based on the repeated keys by incrementing the value of each key whether is there is a duplicate word . The word "dog" and "a" are repeated therefore reducer will delete the key and increase the number of occurrence of the key . This is how the Map Reduce operation is performed

Map Execution Phases :

Map Execution consists of 5 Phases :

Map Phase : The Assigned Input split is read from HDFS where the split could be file block by default.
Furthermore inputs are parsed into records as key-value pairs . The Map function is applied to each record to return zero or more records . These intermediate outputs are stored into a local file system as a file they are sorted first by Bucket number by a key at the end of the Map phase information is sent to the master node after its completion.

Partition Phase : In the partition phase each mapper must determine which reducer will receive of the outputs . For any key regardless of which mapper generated it the destination partition is the same for a single word and for that word will go to the same destination partition . Know that the number of partitions will be equal to the reducers

Shuffle Phase : In the shuffle Phase input data is fetched from all map tasks for the portion corresponding to the reduce task bucket.

Sort Phase : In the sort phase . A merge sort of all map output occurs in a single run

Reduce Phase : Finally in the reduce phase a user defined reduce function is applied to the merged run. The output it written to a file in HDFS .

Search This Blog

MapReduce | Hadoop | Pyspark | Python | BigData

MapReduce |

Comments

Post a Comment