Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: setNumReduceTasks(1)


Copy link to this message
-
Re: setNumReduceTasks(1)
Something Something 2010-01-30, 16:59
It is the latter, not the former.  Yes, but approximation is not allowed in
this case.  Not sure how splitting in 2 jobs would help but let me think
more about it.  Thanks for the help.

On Sat, Jan 30, 2010 at 1:35 AM, Mridul Muralidharan
<[EMAIL PROTECTED]>wrote:

>
> Top K is slightly more complicated (in comparison) to implement efficiently
> : you might want to look at other projects like pig to see how they do it
> (to compare and look at ideas).
>
> Just to get an understanding - your mappers generate <key, value>, and you
> want to pick top K based on value in reducer side ?
> Or can you have multiple key's coming in from various mappers and you need
> to aggregate it at reducer ?
>
>
> If former (that is key is unique), then a combiner to emit's top K per
> mapper, and then a single reducer which sorts and picks from the M * C * K
> tuples should do the trick (M == number of mappers, C == avg number of
> combiner invocations per mapper, K == number of output tuples required).
>
> If latter, you can try to do heuristics to approximate the value, but it
> always has a error margin (to efficiently do it : this is something I ask in
> interviews :) ) which you will need to take into account - or you can just
> split it into two jobs : aggregate in job 1, top K in job 2.
>
> Regards,
> Mridul
>
>
>
> Something Something wrote:
>
>> N could be up to 1000, and output from Map job could be about 5 Million.
>>  We
>> only want the top 1000 because rest of it could be just noise.  Thanks for
>> your help.
>>
>> On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[EMAIL PROTECTED]
>> >wrote:
>>
>>  How big is N?  How big is outcome of Map job?
>>>
>>> Alex.
>>>
>>> On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>  I am sorry, but I forgot to add one important piece of information.
>>>>
>>>> I don't want to write any random N rows to the table.  I want to write
>>>>
>>> the
>>>
>>>> *top* N rows - meaning - I want to write the "key" values of the Reducer
>>>>
>>> in
>>>
>>>> descending order.  Does this make sense?  Sorry for the confusion.
>>>>
>>>> On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
>>>> [EMAIL PROTECTED]
>>>>
>>>>> wrote:
>>>>> A possible solution is to emit only N rows from each mapper and then
>>>>>
>>>> use
>>>
>>>> 1
>>>>
>>>>> reduce task [*] - if value of N is not very high.
>>>>> So you end up with utmost m * N rows on reducer instead of full
>>>>>
>>>> inputset
>>>
>>>> -
>>>>
>>>>> and so the limit can be done easier.
>>>>>
>>>>>
>>>>> If you ok with some sort of variance in the number of rows inserted
>>>>>
>>>> (and
>>>
>>>> if
>>>>
>>>>> value of N is very high), you can do more interesting things like N/m'
>>>>>
>>>> rows
>>>>
>>>>> per mapper - and multiple reducers (r) : with assumtion that each
>>>>>
>>>> reducer
>>>
>>>> will see atleast N/r rows - and so you can limit to N/r per reducer :
>>>>> ofcourse, there is a possible error that gets introduced here ...
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>> [*] Assuming you just want simple limit - nothing else.
>>>>> Also note, each mapper might want to emit N rows instead of 'tweaks'
>>>>>
>>>> like
>>>
>>>> N/m rows, since it is possible that multiple mappers might have less
>>>>>
>>>> than
>>>
>>>> N/m rows to emit to begin with !
>>>>>
>>>>>
>>>>>
>>>>> Something Something wrote:
>>>>>
>>>>>  If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>>>>>> class
>>>>>> be instantiated only on one machine.. always?  I mean if I have a
>>>>>>
>>>>> cluster
>>>>
>>>>> of
>>>>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class
>>>>>>
>>>>> guaranteed
>>>
>>>> to
>>>>>> be instantiated only on 1 machine?
>>>>>>
>>>>>> If answer is yes, then I will use static variable as a counter to see
>>>>>>
>>>>> how
>>>>
>>>>> may rows have been added to my HBase table so far.  In my use case, I
>>>>>>
>>>>> want
>>>>
>>>>> to write only N number of rows to a table.  Is there a better way to