Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Re: setNumReduceTasks(1)


Copy link to this message
-
Re: setNumReduceTasks(1)
It is the latter, not the former.  Yes, but approximation is not allowed in
this case.  Not sure how splitting in 2 jobs would help but let me think
more about it.  Thanks for the help.

On Sat, Jan 30, 2010 at 1:35 AM, Mridul Muralidharan
<[EMAIL PROTECTED]>wrote:

>
> Top K is slightly more complicated (in comparison) to implement efficiently
> : you might want to look at other projects like pig to see how they do it
> (to compare and look at ideas).
>
> Just to get an understanding - your mappers generate <key, value>, and you
> want to pick top K based on value in reducer side ?
> Or can you have multiple key's coming in from various mappers and you need
> to aggregate it at reducer ?
>
>
> If former (that is key is unique), then a combiner to emit's top K per
> mapper, and then a single reducer which sorts and picks from the M * C * K
> tuples should do the trick (M == number of mappers, C == avg number of
> combiner invocations per mapper, K == number of output tuples required).
>
> If latter, you can try to do heuristics to approximate the value, but it
> always has a error margin (to efficiently do it : this is something I ask in
> interviews :) ) which you will need to take into account - or you can just
> split it into two jobs : aggregate in job 1, top K in job 2.
>
> Regards,
> Mridul
>
>
>
> Something Something wrote:
>
>> N could be up to 1000, and output from Map job could be about 5 Million.
>>  We
>> only want the top 1000 because rest of it could be just noise.  Thanks for
>> your help.
>>
>> On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[EMAIL PROTECTED]
>> >wrote:
>>
>>  How big is N?  How big is outcome of Map job?
>>>
>>> Alex.
>>>
>>> On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>  I am sorry, but I forgot to add one important piece of information.
>>>>
>>>> I don't want to write any random N rows to the table.  I want to write
>>>>
>>> the
>>>
>>>> *top* N rows - meaning - I want to write the "key" values of the Reducer
>>>>
>>> in
>>>
>>>> descending order.  Does this make sense?  Sorry for the confusion.
>>>>
>>>> On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
>>>> [EMAIL PROTECTED]
>>>>
>>>>> wrote:
>>>>> A possible solution is to emit only N rows from each mapper and then
>>>>>
>>>> use
>>>
>>>> 1
>>>>
>>>>> reduce task [*] - if value of N is not very high.
>>>>> So you end up with utmost m * N rows on reducer instead of full
>>>>>
>>>> inputset
>>>
>>>> -
>>>>
>>>>> and so the limit can be done easier.
>>>>>
>>>>>
>>>>> If you ok with some sort of variance in the number of rows inserted
>>>>>
>>>> (and
>>>
>>>> if
>>>>
>>>>> value of N is very high), you can do more interesting things like N/m'
>>>>>
>>>> rows
>>>>
>>>>> per mapper - and multiple reducers (r) : with assumtion that each
>>>>>
>>>> reducer
>>>
>>>> will see atleast N/r rows - and so you can limit to N/r per reducer :
>>>>> ofcourse, there is a possible error that gets introduced here ...
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>> [*] Assuming you just want simple limit - nothing else.
>>>>> Also note, each mapper might want to emit N rows instead of 'tweaks'
>>>>>
>>>> like
>>>
>>>> N/m rows, since it is possible that multiple mappers might have less
>>>>>
>>>> than
>>>
>>>> N/m rows to emit to begin with !
>>>>>
>>>>>
>>>>>
>>>>> Something Something wrote:
>>>>>
>>>>>  If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>>>>>> class
>>>>>> be instantiated only on one machine.. always?  I mean if I have a
>>>>>>
>>>>> cluster
>>>>
>>>>> of
>>>>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class
>>>>>>
>>>>> guaranteed
>>>
>>>> to
>>>>>> be instantiated only on 1 machine?
>>>>>>
>>>>>> If answer is yes, then I will use static variable as a counter to see
>>>>>>
>>>>> how
>>>>
>>>>> may rows have been added to my HBase table so far.  In my use case, I
>>>>>>
>>>>> want
>>>>
>>>>> to write only N number of rows to a table.  Is there a better way to
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB