|
|
-
Re: setNumReduceTasks(1)
Jeff Zhang 2010-01-26, 07:27
Mridul,
What do you mean about "Counter's are not synchronized in 'real-time' " ? As I know, JT will aggregate Counters from TT, so I think the aggregated Counter in JT should be correct. On Tue, Jan 26, 2010 at 3:08 PM, Mridul Muralidharan <[EMAIL PROTECTED]>wrote:
> Jeff Zhang wrote: > >> *See my comments below* >> >> >> On Mon, Jan 25, 2010 at 3:22 PM, Something Something < >> [EMAIL PROTECTED]> wrote: >> >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the >>> class >>> be instantiated only on one machine.. always? I mean if I have a cluster >>> of >>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed >>> to >>> be instantiated only on 1 machine? >>> >>> *--Yes* >>> >> >> >> If answer is yes, then I will use static variable as a counter to see how >>> may rows have been added to my HBase table so far. In my use case, I >>> want >>> to write only N number of rows to a table. Is there a better way to do >>> this? Please let me know. Thanks. >>> >>> >> *--Maybe you can use Counter to track the number of rows you add to HBase, >> then you do not need to limit the reduce task as 1* >> >> >> > Counter's are not synchronized in 'real-time' : so you cant use that to > limit at addition time imo. > It is more for aggregation, not realtime messaging. > > - Mridul >
-- Best Regards
Jeff Zhang
-
Re: setNumReduceTasks(1)
Mridul Muralidharan 2010-01-27, 14:39
Jeff Zhang wrote: > Mridul, > > What do you mean about "Counter's are not synchronized in 'real-time' " ? > As I know, JT will aggregate Counters from TT, so I think the aggregated > Counter in JT should be correct.
Aggregate counters are guaranteed to be correct at end of a logical state - not necessarily in between. Consider cases of mapper/reducer task re-execution, caching at the task nodes (counters piggyback on heartbeat - and so every XX seconds), etc.
So trying to limit output based on counter would typically result in not optimal results.
Regards, Mridul
> > > On Tue, Jan 26, 2010 at 3:08 PM, Mridul Muralidharan > <[EMAIL PROTECTED]>wrote: > >> Jeff Zhang wrote: >> >>> *See my comments below* >>> >>> >>> On Mon, Jan 25, 2010 at 3:22 PM, Something Something < >>> [EMAIL PROTECTED]> wrote: >>> >>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the >>>> class >>>> be instantiated only on one machine.. always? I mean if I have a cluster >>>> of >>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed >>>> to >>>> be instantiated only on 1 machine? >>>> >>>> *--Yes* >>>> >>> >>> If answer is yes, then I will use static variable as a counter to see how >>>> may rows have been added to my HBase table so far. In my use case, I >>>> want >>>> to write only N number of rows to a table. Is there a better way to do >>>> this? Please let me know. Thanks. >>>> >>>> >>> *--Maybe you can use Counter to track the number of rows you add to HBase, >>> then you do not need to limit the reduce task as 1* >>> >>> >>> >> Counter's are not synchronized in 'real-time' : so you cant use that to >> limit at addition time imo. >> It is more for aggregation, not realtime messaging. >> >> - Mridul >> > > >
-
Re: setNumReduceTasks(1)
Alex Baranov 2010-01-29, 19:43
How big is N? How big is outcome of Map job?
Alex.
On Fri, Jan 29, 2010 at 7:36 PM, Something Something < [EMAIL PROTECTED]> wrote:
> I am sorry, but I forgot to add one important piece of information. > > I don't want to write any random N rows to the table. I want to write the > *top* N rows - meaning - I want to write the "key" values of the Reducer in > descending order. Does this make sense? Sorry for the confusion. > > On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan < > [EMAIL PROTECTED] > > wrote: > > > > > A possible solution is to emit only N rows from each mapper and then use > 1 > > reduce task [*] - if value of N is not very high. > > So you end up with utmost m * N rows on reducer instead of full inputset > - > > and so the limit can be done easier. > > > > > > If you ok with some sort of variance in the number of rows inserted (and > if > > value of N is very high), you can do more interesting things like N/m' > rows > > per mapper - and multiple reducers (r) : with assumtion that each reducer > > will see atleast N/r rows - and so you can limit to N/r per reducer : > > ofcourse, there is a possible error that gets introduced here ... > > > > > > Regards, > > Mridul > > > > [*] Assuming you just want simple limit - nothing else. > > Also note, each mapper might want to emit N rows instead of 'tweaks' like > > N/m rows, since it is possible that multiple mappers might have less than > > N/m rows to emit to begin with ! > > > > > > > > Something Something wrote: > > > >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the > >> class > >> be instantiated only on one machine.. always? I mean if I have a > cluster > >> of > >> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed > >> to > >> be instantiated only on 1 machine? > >> > >> If answer is yes, then I will use static variable as a counter to see > how > >> may rows have been added to my HBase table so far. In my use case, I > want > >> to write only N number of rows to a table. Is there a better way to do > >> this? Please let me know. Thanks. > >> > > > > >
-
Re: setNumReduceTasks(1)
Something Something 2010-01-30, 02:32
N could be up to 1000, and output from Map job could be about 5 Million. We only want the top 1000 because rest of it could be just noise. Thanks for your help.
On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[EMAIL PROTECTED]>wrote:
> How big is N? How big is outcome of Map job? > > Alex. > > On Fri, Jan 29, 2010 at 7:36 PM, Something Something < > [EMAIL PROTECTED]> wrote: > > > I am sorry, but I forgot to add one important piece of information. > > > > I don't want to write any random N rows to the table. I want to write > the > > *top* N rows - meaning - I want to write the "key" values of the Reducer > in > > descending order. Does this make sense? Sorry for the confusion. > > > > On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan < > > [EMAIL PROTECTED] > > > wrote: > > > > > > > > A possible solution is to emit only N rows from each mapper and then > use > > 1 > > > reduce task [*] - if value of N is not very high. > > > So you end up with utmost m * N rows on reducer instead of full > inputset > > - > > > and so the limit can be done easier. > > > > > > > > > If you ok with some sort of variance in the number of rows inserted > (and > > if > > > value of N is very high), you can do more interesting things like N/m' > > rows > > > per mapper - and multiple reducers (r) : with assumtion that each > reducer > > > will see atleast N/r rows - and so you can limit to N/r per reducer : > > > ofcourse, there is a possible error that gets introduced here ... > > > > > > > > > Regards, > > > Mridul > > > > > > [*] Assuming you just want simple limit - nothing else. > > > Also note, each mapper might want to emit N rows instead of 'tweaks' > like > > > N/m rows, since it is possible that multiple mappers might have less > than > > > N/m rows to emit to begin with ! > > > > > > > > > > > > Something Something wrote: > > > > > >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the > > >> class > > >> be instantiated only on one machine.. always? I mean if I have a > > cluster > > >> of > > >> say 1 master, 10 workers & 3 zookeepers, is the Reducer class > guaranteed > > >> to > > >> be instantiated only on 1 machine? > > >> > > >> If answer is yes, then I will use static variable as a counter to see > > how > > >> may rows have been added to my HBase table so far. In my use case, I > > want > > >> to write only N number of rows to a table. Is there a better way to > do > > >> this? Please let me know. Thanks. > > >> > > > > > > > > >
-
Re: setNumReduceTasks(1)
Mridul Muralidharan 2010-01-30, 09:35
Top K is slightly more complicated (in comparison) to implement efficiently : you might want to look at other projects like pig to see how they do it (to compare and look at ideas).
Just to get an understanding - your mappers generate <key, value>, and you want to pick top K based on value in reducer side ? Or can you have multiple key's coming in from various mappers and you need to aggregate it at reducer ? If former (that is key is unique), then a combiner to emit's top K per mapper, and then a single reducer which sorts and picks from the M * C * K tuples should do the trick (M == number of mappers, C == avg number of combiner invocations per mapper, K == number of output tuples required).
If latter, you can try to do heuristics to approximate the value, but it always has a error margin (to efficiently do it : this is something I ask in interviews :) ) which you will need to take into account - or you can just split it into two jobs : aggregate in job 1, top K in job 2.
Regards, Mridul Something Something wrote: > N could be up to 1000, and output from Map job could be about 5 Million. We > only want the top 1000 because rest of it could be just noise. Thanks for > your help. > > On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[EMAIL PROTECTED]>wrote: > >> How big is N? How big is outcome of Map job? >> >> Alex. >> >> On Fri, Jan 29, 2010 at 7:36 PM, Something Something < >> [EMAIL PROTECTED]> wrote: >> >>> I am sorry, but I forgot to add one important piece of information. >>> >>> I don't want to write any random N rows to the table. I want to write >> the >>> *top* N rows - meaning - I want to write the "key" values of the Reducer >> in >>> descending order. Does this make sense? Sorry for the confusion. >>> >>> On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan < >>> [EMAIL PROTECTED] >>>> wrote: >>>> A possible solution is to emit only N rows from each mapper and then >> use >>> 1 >>>> reduce task [*] - if value of N is not very high. >>>> So you end up with utmost m * N rows on reducer instead of full >> inputset >>> - >>>> and so the limit can be done easier. >>>> >>>> >>>> If you ok with some sort of variance in the number of rows inserted >> (and >>> if >>>> value of N is very high), you can do more interesting things like N/m' >>> rows >>>> per mapper - and multiple reducers (r) : with assumtion that each >> reducer >>>> will see atleast N/r rows - and so you can limit to N/r per reducer : >>>> ofcourse, there is a possible error that gets introduced here ... >>>> >>>> >>>> Regards, >>>> Mridul >>>> >>>> [*] Assuming you just want simple limit - nothing else. >>>> Also note, each mapper might want to emit N rows instead of 'tweaks' >> like >>>> N/m rows, since it is possible that multiple mappers might have less >> than >>>> N/m rows to emit to begin with ! >>>> >>>> >>>> >>>> Something Something wrote: >>>> >>>>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the >>>>> class >>>>> be instantiated only on one machine.. always? I mean if I have a >>> cluster >>>>> of >>>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class >> guaranteed >>>>> to >>>>> be instantiated only on 1 machine? >>>>> >>>>> If answer is yes, then I will use static variable as a counter to see >>> how >>>>> may rows have been added to my HBase table so far. In my use case, I >>> want >>>>> to write only N number of rows to a table. Is there a better way to >> do >>>>> this? Please let me know. Thanks. >>>>> >>>>
-
Re: setNumReduceTasks(1)
Something Something 2010-01-30, 16:59
It is the latter, not the former. Yes, but approximation is not allowed in this case. Not sure how splitting in 2 jobs would help but let me think more about it. Thanks for the help.
On Sat, Jan 30, 2010 at 1:35 AM, Mridul Muralidharan <[EMAIL PROTECTED]>wrote:
> > Top K is slightly more complicated (in comparison) to implement efficiently > : you might want to look at other projects like pig to see how they do it > (to compare and look at ideas). > > Just to get an understanding - your mappers generate <key, value>, and you > want to pick top K based on value in reducer side ? > Or can you have multiple key's coming in from various mappers and you need > to aggregate it at reducer ? > > > If former (that is key is unique), then a combiner to emit's top K per > mapper, and then a single reducer which sorts and picks from the M * C * K > tuples should do the trick (M == number of mappers, C == avg number of > combiner invocations per mapper, K == number of output tuples required). > > If latter, you can try to do heuristics to approximate the value, but it > always has a error margin (to efficiently do it : this is something I ask in > interviews :) ) which you will need to take into account - or you can just > split it into two jobs : aggregate in job 1, top K in job 2. > > Regards, > Mridul > > > > Something Something wrote: > >> N could be up to 1000, and output from Map job could be about 5 Million. >> We >> only want the top 1000 because rest of it could be just noise. Thanks for >> your help. >> >> On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[EMAIL PROTECTED] >> >wrote: >> >> How big is N? How big is outcome of Map job? >>> >>> Alex. >>> >>> On Fri, Jan 29, 2010 at 7:36 PM, Something Something < >>> [EMAIL PROTECTED]> wrote: >>> >>> I am sorry, but I forgot to add one important piece of information. >>>> >>>> I don't want to write any random N rows to the table. I want to write >>>> >>> the >>> >>>> *top* N rows - meaning - I want to write the "key" values of the Reducer >>>> >>> in >>> >>>> descending order. Does this make sense? Sorry for the confusion. >>>> >>>> On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan < >>>> [EMAIL PROTECTED] >>>> >>>>> wrote: >>>>> A possible solution is to emit only N rows from each mapper and then >>>>> >>>> use >>> >>>> 1 >>>> >>>>> reduce task [*] - if value of N is not very high. >>>>> So you end up with utmost m * N rows on reducer instead of full >>>>> >>>> inputset >>> >>>> - >>>> >>>>> and so the limit can be done easier. >>>>> >>>>> >>>>> If you ok with some sort of variance in the number of rows inserted >>>>> >>>> (and >>> >>>> if >>>> >>>>> value of N is very high), you can do more interesting things like N/m' >>>>> >>>> rows >>>> >>>>> per mapper - and multiple reducers (r) : with assumtion that each >>>>> >>>> reducer >>> >>>> will see atleast N/r rows - and so you can limit to N/r per reducer : >>>>> ofcourse, there is a possible error that gets introduced here ... >>>>> >>>>> >>>>> Regards, >>>>> Mridul >>>>> >>>>> [*] Assuming you just want simple limit - nothing else. >>>>> Also note, each mapper might want to emit N rows instead of 'tweaks' >>>>> >>>> like >>> >>>> N/m rows, since it is possible that multiple mappers might have less >>>>> >>>> than >>> >>>> N/m rows to emit to begin with ! >>>>> >>>>> >>>>> >>>>> Something Something wrote: >>>>> >>>>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the >>>>>> class >>>>>> be instantiated only on one machine.. always? I mean if I have a >>>>>> >>>>> cluster >>>> >>>>> of >>>>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class >>>>>> >>>>> guaranteed >>> >>>> to >>>>>> be instantiated only on 1 machine? >>>>>> >>>>>> If answer is yes, then I will use static variable as a counter to see >>>>>> >>>>> how >>>> >>>>> may rows have been added to my HBase table so far. In my use case, I >>>>>> >>>>> want >>>> >>>>> to write only N number of rows to a table. Is there a better way to
|
|