When do reduce tasks start in Hadoop?

HadoopMapreduceReduce

Hadoop Problem Overview


In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?

Hadoop Solutions


Solution 1 - Hadoop

The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage: 0-33% means its doing shuffle, 34-66% is sort, 67%-100% is reduce. This is why your reducers will sometimes seem "stuck" at 33%-- it's waiting for mappers to finish.

Reducers start shuffling based on a threshold of percentage of mappers that have finished. You can change the parameter to get reducers to start sooner or later.

Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.

Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data and waiting for mappers to finish. Another job that starts later that will actually use the reduce slots now can't use them.

You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. In new versions of Hadoop (at least 2.4.1) the parameter is called is mapreduce.job.reduce.slowstart.completedmaps (thanks user yegor256).

Typically, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.

Solution 2 - Hadoop

The reduce phase can start long before a reducer is called. As soon as "a" mapper finishes the job, the generated data undergoes some sorting and shuffling (which includes call to combiner and partitioner). The reducer "phase" kicks in the moment post mapper data processing is started. As these processing is done, you will see progress in reducers percentage. However, none of the reducers have been called in yet. Depending on number of processors available/used, nature of data and number of expected reducers, you may want to change the parameter as described by @Donald-miner above.

Solution 3 - Hadoop

As much I understand Reduce phase start with the map phase and keep consuming the record from maps. However since there is sort and shuffle phase after the map phase all the outputs have to be sorted and sent to the reducer. So logically you can imagine that reduce phase starts only after map phase but actually for performance reason reducers are also initialized with the mappers.

Solution 4 - Hadoop

The percentage shown for the reduce phase is actually about the amount of the data copied from the maps output to the reducers input directories. To know when does this copying start? It is a configuration you can set as Donald showed above. Once all the data is copied to reducers (ie. 100% reduce) that's when the reducers start working and hence might freeze in "100% reduce" if your reducers code is I/O or CPU intensive.

Solution 5 - Hadoop

Reduce starts only after all the mapper have fished there task, Reducer have to communicate with all the mappers so it has to wait till the last mapper finished its task.however mapper starts transferring data to the moment it has completed its task.

Solution 6 - Hadoop

Consider a WordCount example in order to understand better how the map reduce task works.Suppose we have a large file, say a novel and our task is to find the number of times each word occurs in the file. Since the file is large, it might be divided into different blocks and replicated in different worker nodes. The word count job is composed of map and reduce tasks. The map task takes as input each block and produces an intermediate key-value pair. In this example, since we are counting the number of occurences of words, the mapper while processing a block would result in intermediate results of the form (word1,count1), (word2,count2) etc. The intermediate results of all the mappers is passed through a shuffle phase which will reorder the intermediate result.

Assume that our map output from different mappers is of the following form:

Map 1:- (is,24) (was,32) (and,12)

Map2 :- (my,12) (is,23) (was,30)

The map outputs are sorted in such a manner that the same key values are given to the same reducer. Here it would mean that the keys corresponding to is,was etc go the same reducer.It is the reducer which produces the final output,which in this case would be:- (and,12)(is,47)(my,12)(was,62)

Solution 7 - Hadoop

Reducer tasks starts only after the completion of all the mappers.

But the data transfer happens after each Map. Actually it is a pull operation.

That means, each time reducer will be asking every maptask if they have some data to retrive from Map.If they find any mapper completed their task , Reducer pull the intermediate data.

The intermediate data from Mapper is stored in disk. And the data transfer from Mapper to Reduce happens through Network (Data Locality is not preserved in Reduce phase)

Solution 8 - Hadoop

When Mapper finishes its task then Reducer starts its job to Reduce the Data this is Mapreduce job.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSlayerView Question on Stackoverflow
Solution 1 - HadoopDonald MinerView Answer on Stackoverflow
Solution 2 - HadoopjavadevgView Answer on Stackoverflow
Solution 3 - HadoopAnimesh Raj JhaView Answer on Stackoverflow
Solution 4 - HadoopMahaView Answer on Stackoverflow
Solution 5 - HadoopAbhishekView Answer on Stackoverflow
Solution 6 - HadoopAnnieView Answer on Stackoverflow
Solution 7 - HadoopUnmesha Sreeveni U.BView Answer on Stackoverflow
Solution 8 - HadoopsyedView Answer on Stackoverflow