|
|
john smith 2011-09-10, 09:06
Hi,
Some of the MR jobs I run doesn't need sorting of map-output in each partition. Is there someway I can disable it?
Any help?
Thanks jS
Arun C Murthy 2011-09-10, 18:48
Run a map-only job with #reduces set to 0.
Arun
On Sep 10, 2011, at 2:06 AM, john smith wrote:
> Hi, > > Some of the MR jobs I run doesn't need sorting of map-output in each > partition. Is there someway I can disable it? > > Any help? > > Thanks > jS
Meng Mao 2011-09-10, 19:33
Is there a way to collate the possibly large number of map output files, though?
On Sat, Sep 10, 2011 at 2:48 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
> Run a map-only job with #reduces set to 0. > > Arun > > On Sep 10, 2011, at 2:06 AM, john smith wrote: > > > Hi, > > > > Some of the MR jobs I run doesn't need sorting of map-output in each > > partition. Is there someway I can disable it? > > > > Any help? > > > > Thanks > > jS > >
Owen O'Malley 2011-09-10, 20:33
On Sat, Sep 10, 2011 at 12:33 PM, Meng Mao <[EMAIL PROTECTED]> wrote:
> Is there a way to collate the possibly large number of map output files, > though? You can make fewer mappers by setting the mapred.min.split.size to define the smallest input that will be given to a mapper.
There isn't currently a way of getting a collated, but unsorted list of key/value pairs. For most applications, the in memory sort is fairly cheap relative to the shuffle and other parts of the processing.
-- Owen
Arun C Murthy 2011-09-11, 01:33
The point of a 'reduce phase' is to aggregate keys from different maps (i.e. all inputs).
I'm not sure what you are trying to do, but a use-case will help.
IAC, the only way to achieve what you are trying to do is to run to jobs with the first a map-only job (i.e. #reduces = 0).
Arun
On Sep 10, 2011, at 10:19 PM, john smith wrote:
> Hey, > > I have reduce phases too. But for each reduce, I dont need sorted input > (map-output for that corresponding reduce task). > Setting #red to 0 completely removes the reduce phase. > > Am I missing something? > > Thanks, > > On Sun, Sep 11, 2011 at 12:18 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > >> Run a map-only job with #reduces set to 0. >> >> Arun >> >> On Sep 10, 2011, at 2:06 AM, john smith wrote: >> >>> Hi, >>> >>> Some of the MR jobs I run doesn't need sorting of map-output in each >>> partition. Is there someway I can disable it? >>> >>> Any help? >>> >>> Thanks >>> jS >> >>
john smith 2011-09-11, 05:19
Hey,
I have reduce phases too. But for each reduce, I dont need sorted input (map-output for that corresponding reduce task). Setting #red to 0 completely removes the reduce phase.
Am I missing something?
Thanks,
On Sun, Sep 11, 2011 at 12:18 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
> Run a map-only job with #reduces set to 0. > > Arun > > On Sep 10, 2011, at 2:06 AM, john smith wrote: > > > Hi, > > > > Some of the MR jobs I run doesn't need sorting of map-output in each > > partition. Is there someway I can disable it? > > > > Any help? > > > > Thanks > > jS > >
john smith 2011-09-11, 07:43
Hi Arun,
Suppose I am doing a simple wordcount and the map-phase is over. After the shuffle, in each partition, the inputs to the reducer, come in a sorted order of keys. I want to disable this.
Take the same case of wc. I don't mind the order in which my reduce gets the keys of a single partition. I guess hadoop does an external sort for this. I want to disable that.
Thanks, jS
On Sun, Sep 11, 2011 at 7:03 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
> The point of a 'reduce phase' is to aggregate keys from different maps > (i.e. all inputs). > > I'm not sure what you are trying to do, but a use-case will help. > > IAC, the only way to achieve what you are trying to do is to run to jobs > with the first a map-only job (i.e. #reduces = 0). > > Arun > > On Sep 10, 2011, at 10:19 PM, john smith wrote: > > > Hey, > > > > I have reduce phases too. But for each reduce, I dont need sorted input > > (map-output for that corresponding reduce task). > > Setting #red to 0 completely removes the reduce phase. > > > > Am I missing something? > > > > Thanks, > > > > On Sun, Sep 11, 2011 at 12:18 AM, Arun C Murthy <[EMAIL PROTECTED]> > wrote: > > > >> Run a map-only job with #reduces set to 0. > >> > >> Arun > >> > >> On Sep 10, 2011, at 2:06 AM, john smith wrote: > >> > >>> Hi, > >>> > >>> Some of the MR jobs I run doesn't need sorting of map-output in each > >>> partition. Is there someway I can disable it? > >>> > >>> Any help? > >>> > >>> Thanks > >>> jS > >> > >> > >
Joey Echeverria 2011-09-11, 09:56
The sort is what's implementing the group by key function. You can't have one without the other in Hadoop. Are you trying to disable the sort because you think it's too slow?
-Joey
On Sun, Sep 11, 2011 at 2:43 AM, john smith <[EMAIL PROTECTED]> wrote: > Hi Arun, > > Suppose I am doing a simple wordcount and the map-phase is over. After the > shuffle, in each partition, the inputs to the reducer, come in a sorted > order of keys. I want to disable this. > > Take the same case of wc. I don't mind the order in which my reduce gets the > keys of a single partition. I guess hadoop does an external sort for this. I > want to disable that. > > Thanks, > jS > > On Sun, Sep 11, 2011 at 7:03 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > >> The point of a 'reduce phase' is to aggregate keys from different maps >> (i.e. all inputs). >> >> I'm not sure what you are trying to do, but a use-case will help. >> >> IAC, the only way to achieve what you are trying to do is to run to jobs >> with the first a map-only job (i.e. #reduces = 0). >> >> Arun >> >> On Sep 10, 2011, at 10:19 PM, john smith wrote: >> >> > Hey, >> > >> > I have reduce phases too. But for each reduce, I dont need sorted input >> > (map-output for that corresponding reduce task). >> > Setting #red to 0 completely removes the reduce phase. >> > >> > Am I missing something? >> > >> > Thanks, >> > >> > On Sun, Sep 11, 2011 at 12:18 AM, Arun C Murthy <[EMAIL PROTECTED]> >> wrote: >> > >> >> Run a map-only job with #reduces set to 0. >> >> >> >> Arun >> >> >> >> On Sep 10, 2011, at 2:06 AM, john smith wrote: >> >> >> >>> Hi, >> >>> >> >>> Some of the MR jobs I run doesn't need sorting of map-output in each >> >>> partition. Is there someway I can disable it? >> >>> >> >>> Any help? >> >>> >> >>> Thanks >> >>> jS >> >> >> >> >> >> >
-- Joseph Echeverria Cloudera, Inc. 443.305.9434
|
|