|
|
-
Can I number output results with a Counter?
Mark Kerzner 2011-05-20, 16:55
Hi, can I use a Counter to give each record in all reducers a consecutive number? Currently I am using a single Reducer, but it is an anti-pattern. But I need to assign consecutive numbers to all output records in all reducers, and it does not matter how, as long as each gets its own number.
If it IS possible, then how are multiple processes accessing those counters without creating race conditions.
Thank you,
Mark
-
Re: Can I number output results with a Counter?
Joey Echeverria 2011-05-20, 17:01
To make sure I understand you correctly, you need a globally unique one up counter for each output record?
If you had an upper bound on the number of records a single reducer could output and you can afford to have gaps, you could just use the task id and multiply that by the max number of records and then one up from there.
If that doesn't work for you, then you'll need to use some kind of central service for allocating numbers which could become a bottleneck.
-Joey
On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner <[EMAIL PROTECTED]> wrote: > Hi, can I use a Counter to give each record in all reducers a consecutive > number? Currently I am using a single Reducer, but it is an anti-pattern. > But I need to assign consecutive numbers to all output records in all > reducers, and it does not matter how, as long as each gets its own number. > > If it IS possible, then how are multiple processes accessing those counters > without creating race conditions. > > Thank you, > > Mark >
-- Joseph Echeverria Cloudera, Inc. 443.305.9434
-
Re: Can I number output results with a Counter?
Mark Kerzner 2011-05-20, 17:17
Joey,
You understood me perfectly well. I see your first advice, but I am not allowed to have gaps. A central service is something I may consider if single reducer becomes a worse bottleneck than it.
But what are counters for? They seem to be exactly that.
Mark
On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria <[EMAIL PROTECTED]> wrote:
> To make sure I understand you correctly, you need a globally unique > one up counter for each output record? > > If you had an upper bound on the number of records a single reducer > could output and you can afford to have gaps, you could just use the > task id and multiply that by the max number of records and then one up > from there. > > If that doesn't work for you, then you'll need to use some kind of > central service for allocating numbers which could become a > bottleneck. > > -Joey > > On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner <[EMAIL PROTECTED]> > wrote: > > Hi, can I use a Counter to give each record in all reducers a consecutive > > number? Currently I am using a single Reducer, but it is an anti-pattern. > > But I need to assign consecutive numbers to all output records in all > > reducers, and it does not matter how, as long as each gets its own > number. > > > > If it IS possible, then how are multiple processes accessing those > counters > > without creating race conditions. > > > > Thank you, > > > > Mark > > > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 >
-
Re: Can I number output results with a Counter?
Joey Echeverria 2011-05-20, 17:34
Counters are a way to get status from your running job. They don't increment a global state. They locally save increments and periodically report those increments to the central counter. That means that the final count will be correct, but you can't use them to coordinate counts while your job is running.
-Joey
On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner <[EMAIL PROTECTED]> wrote: > Joey, > > You understood me perfectly well. I see your first advice, but I am not > allowed to have gaps. A central service is something I may consider if > single reducer becomes a worse bottleneck than it. > > But what are counters for? They seem to be exactly that. > > Mark > > On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria <[EMAIL PROTECTED]> wrote: > >> To make sure I understand you correctly, you need a globally unique >> one up counter for each output record? >> >> If you had an upper bound on the number of records a single reducer >> could output and you can afford to have gaps, you could just use the >> task id and multiply that by the max number of records and then one up >> from there. >> >> If that doesn't work for you, then you'll need to use some kind of >> central service for allocating numbers which could become a >> bottleneck. >> >> -Joey >> >> On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner <[EMAIL PROTECTED]> >> wrote: >> > Hi, can I use a Counter to give each record in all reducers a consecutive >> > number? Currently I am using a single Reducer, but it is an anti-pattern. >> > But I need to assign consecutive numbers to all output records in all >> > reducers, and it does not matter how, as long as each gets its own >> number. >> > >> > If it IS possible, then how are multiple processes accessing those >> counters >> > without creating race conditions. >> > >> > Thank you, >> > >> > Mark >> > >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 >> >
-- Joseph Echeverria Cloudera, Inc. 443.305.9434
-
Re: Can I number output results with a Counter?
Kai Voigt 2011-05-20, 17:39
Also, with speculative execution enabled, you might see a higher count as you expect while the same task is running multiple times in parallel. When a task gets killed because another instance was quicker, those counters will be removed from the global count though.
Kai
Am 20.05.2011 um 19:34 schrieb Joey Echeverria:
> Counters are a way to get status from your running job. They don't > increment a global state. They locally save increments and > periodically report those increments to the central counter. That > means that the final count will be correct, but you can't use them to > coordinate counts while your job is running. > > -Joey > > On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner <[EMAIL PROTECTED]> wrote: >> Joey, >> >> You understood me perfectly well. I see your first advice, but I am not >> allowed to have gaps. A central service is something I may consider if >> single reducer becomes a worse bottleneck than it. >> >> But what are counters for? They seem to be exactly that. >> >> Mark >> >> On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria <[EMAIL PROTECTED]> wrote: >> >>> To make sure I understand you correctly, you need a globally unique >>> one up counter for each output record? >>> >>> If you had an upper bound on the number of records a single reducer >>> could output and you can afford to have gaps, you could just use the >>> task id and multiply that by the max number of records and then one up >>> from there. >>> >>> If that doesn't work for you, then you'll need to use some kind of >>> central service for allocating numbers which could become a >>> bottleneck. >>> >>> -Joey >>> >>> On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner <[EMAIL PROTECTED]> >>> wrote: >>>> Hi, can I use a Counter to give each record in all reducers a consecutive >>>> number? Currently I am using a single Reducer, but it is an anti-pattern. >>>> But I need to assign consecutive numbers to all output records in all >>>> reducers, and it does not matter how, as long as each gets its own >>> number. >>>> >>>> If it IS possible, then how are multiple processes accessing those >>> counters >>>> without creating race conditions. >>>> >>>> Thank you, >>>> >>>> Mark >>>> >>> >>> >>> >>> -- >>> Joseph Echeverria >>> Cloudera, Inc. >>> 443.305.9434 >>> >> > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 >
-- Kai Voigt [EMAIL PROTECTED]
-
Re: Can I number output results with a Counter?
Mark Kerzner 2011-05-20, 18:38
Thank you, Kai and Joey, for the explanation. That's what I thought about them, but did not want to miss the "magical" replacement for a central services in the counters. No, there is no magic, just great reality.
Mark
On Fri, May 20, 2011 at 12:39 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
> Also, with speculative execution enabled, you might see a higher count as > you expect while the same task is running multiple times in parallel. When a > task gets killed because another instance was quicker, those counters will > be removed from the global count though. > > Kai > > Am 20.05.2011 um 19:34 schrieb Joey Echeverria: > > > Counters are a way to get status from your running job. They don't > > increment a global state. They locally save increments and > > periodically report those increments to the central counter. That > > means that the final count will be correct, but you can't use them to > > coordinate counts while your job is running. > > > > -Joey > > > > On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner <[EMAIL PROTECTED]> > wrote: > >> Joey, > >> > >> You understood me perfectly well. I see your first advice, but I am not > >> allowed to have gaps. A central service is something I may consider if > >> single reducer becomes a worse bottleneck than it. > >> > >> But what are counters for? They seem to be exactly that. > >> > >> Mark > >> > >> On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria <[EMAIL PROTECTED]> > wrote: > >> > >>> To make sure I understand you correctly, you need a globally unique > >>> one up counter for each output record? > >>> > >>> If you had an upper bound on the number of records a single reducer > >>> could output and you can afford to have gaps, you could just use the > >>> task id and multiply that by the max number of records and then one up > >>> from there. > >>> > >>> If that doesn't work for you, then you'll need to use some kind of > >>> central service for allocating numbers which could become a > >>> bottleneck. > >>> > >>> -Joey > >>> > >>> On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner <[EMAIL PROTECTED]> > >>> wrote: > >>>> Hi, can I use a Counter to give each record in all reducers a > consecutive > >>>> number? Currently I am using a single Reducer, but it is an > anti-pattern. > >>>> But I need to assign consecutive numbers to all output records in all > >>>> reducers, and it does not matter how, as long as each gets its own > >>> number. > >>>> > >>>> If it IS possible, then how are multiple processes accessing those > >>> counters > >>>> without creating race conditions. > >>>> > >>>> Thank you, > >>>> > >>>> Mark > >>>> > >>> > >>> > >>> > >>> -- > >>> Joseph Echeverria > >>> Cloudera, Inc. > >>> 443.305.9434 > >>> > >> > > > > > > > > -- > > Joseph Echeverria > > Cloudera, Inc. > > 443.305.9434 > > > > -- > Kai Voigt > [EMAIL PROTECTED] > > > > >
|
|