|
|
-
When reduce tasks start in MapReduce Streaming?
Pedro Sá da Costa 2013-01-15, 23:06
Hi,
I read from documents that in MapReduce, the reduce tasks only start after a percentage (by default 90%) of maps end. This means that the slowest maps can delay the start of reduce tasks, and the input data that is consumed by the reduce tasks is represented as a batch of data. This means that, the scenario of having reduce tasks consuming data as long the map tasks produce it, doesn't exist. But with the in Hadoop MapReduce streaming this still happens?
-- Best regards, P
-
Re: When reduce tasks start in MapReduce Streaming?
Jeff Bean 2013-01-16, 05:41
Hi Pedro,
Yes, Hadoop Streaming has the same property. The reduce method is not called until the mappers are done, and the reducers are not scheduled before the threshold set by mapred.reduce.slowstart.completed.maps is reached.
On Tue, Jan 15, 2013 at 3:06 PM, Pedro Sá da Costa <[EMAIL PROTECTED]>wrote:
> Hi, > > I read from documents that in MapReduce, the reduce tasks only start > after a percentage (by default 90%) of maps end. This means that the > slowest maps can delay the start of reduce tasks, and the input data > that is consumed by the reduce tasks is represented as a batch of > data. This means that, the scenario of having reduce tasks consuming > data as long the map tasks produce it, doesn't exist. But with the in > Hadoop MapReduce streaming this still happens? > > -- > Best regards, > P >
-
Re: When reduce tasks start in MapReduce Streaming?
Pedro Sá da Costa 2013-01-16, 09:04
So why it's called hadoop streaming, if it doesn't behave like a streaming application (The reduces don't receive data as long as it is produced by the map tasks)? On 16 January 2013 05:41, Jeff Bean <[EMAIL PROTECTED]> wrote: > me property. The reduce method is not called until the mappers are done, and > the reducers are not scheduled before the threshold set by > mapred.reduce.slowstart.completed.maps is reached. -- Best regards,
-
Re: When reduce tasks start in MapReduce Streaming?
Jeff Bean 2013-01-16, 09:20
It's called Hadoop Streaming because keys and values are streamed in to stdin of the script you specify for Hadoop Streaming and then captured via stdout.
On Wed, Jan 16, 2013 at 1:04 AM, Pedro Sá da Costa <[EMAIL PROTECTED]>wrote:
> So why it's called hadoop streaming, if it doesn't behave like a > streaming application (The reduces don't receive data as long as it is > produced by the map tasks)? > > > On 16 January 2013 05:41, Jeff Bean <[EMAIL PROTECTED]> wrote: > > me property. The reduce method is not called until the mappers are done, > and > > the reducers are not scheduled before the threshold set by > > mapred.reduce.slowstart.completed.maps is reached. > > > > > -- > Best regards, >
|
|