|
Mark Kerzner
2009-10-28, 03:55
Aaron Kimball
2009-10-28, 04:27
Mark Kerzner
2009-10-28, 04:34
Michael Klatt
2009-10-28, 17:40
Mark Kerzner
2009-10-28, 17:49
Michael Klatt
2009-10-28, 17:57
Mark Kerzner
2009-10-28, 18:03
brien colwell
2009-10-28, 18:07
Mark Kerzner
2009-10-28, 18:20
Edward Capriolo
2009-10-28, 19:22
Mark Kerzner
2009-10-28, 19:29
|
-
How to give consecutive numbers to output records?Mark Kerzner 2009-10-28, 03:55
Hi,
I need to number all output records consecutively, like, 1,2,3... This is no problem with one reducer, making recordId an instance variable in the Reducer class, and setting conf.setNumReduceTasks(1) However, it is an architectural decision forced by processing need, where the reducer becomes a bottleneck. Can I have a global variable for all reducers, which would give each the next consecutive recordId? In the database scenario, this would be the unique autokey. How to do it in MapReduce? Thank you
-
Re: How to give consecutive numbers to output records?Aaron Kimball 2009-10-28, 04:27
There is no in-MapReduce mechanism for cross-task synchronization. You'll
need to use something like Zookeeper for this, or another external database. Note that this will greatly complicate your life. If I were you, I'd try to either redesign my pipeline elsewhere to eliminate this need, or maybe get really clever. For example, do your numbers need to be sequential, or just unique? If the latter, then take the byte offset into the reducer's current output file and combine that with the reducer id (e.g., <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're all building unique sequences. If the former... rethink your pipeline? :) - Aaron On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[EMAIL PROTECTED]> wrote: > Hi, > > I need to number all output records consecutively, like, 1,2,3... > > This is no problem with one reducer, making recordId an instance variable > in > the Reducer class, and setting conf.setNumReduceTasks(1) > > However, it is an architectural decision forced by processing need, where > the reducer becomes a bottleneck. Can I have a global variable for all > reducers, which would give each the next consecutive recordId? In the > database scenario, this would be the unique autokey. How to do it in > MapReduce? > > Thank you >
-
Re: How to give consecutive numbers to output records?Mark Kerzner 2009-10-28, 04:34
Aaron, although your notes are not a ready solution, but they are a great
help. Thank you, Mark On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[EMAIL PROTECTED]> wrote: > There is no in-MapReduce mechanism for cross-task synchronization. You'll > need to use something like Zookeeper for this, or another external > database. > Note that this will greatly complicate your life. > > If I were you, I'd try to either redesign my pipeline elsewhere to > eliminate > this need, or maybe get really clever. For example, do your numbers need to > be sequential, or just unique? > > If the latter, then take the byte offset into the reducer's current output > file and combine that with the reducer id (e.g., > <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're > all > building unique sequences. If the former... rethink your pipeline? :) > > - Aaron > > On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[EMAIL PROTECTED]> > wrote: > > > Hi, > > > > I need to number all output records consecutively, like, 1,2,3... > > > > This is no problem with one reducer, making recordId an instance variable > > in > > the Reducer class, and setting conf.setNumReduceTasks(1) > > > > However, it is an architectural decision forced by processing need, where > > the reducer becomes a bottleneck. Can I have a global variable for all > > reducers, which would give each the next consecutive recordId? In the > > database scenario, this would be the unique autokey. How to do it in > > MapReduce? > > > > Thank you > > >
-
Re: How to give consecutive numbers to output records?Michael Klatt 2009-10-28, 17:40
I posted an approach to this using streaming, but if the environment variables are available in standard Java interface, this may work for you. http://www.mail-archive.com/[EMAIL PROTECTED]/msg09079.html You'll have to be able to tolerate some small gaps in the ids. Michael Mark Kerzner wrote: > > > Aaron, although your notes are not a ready solution, but they are a great > help. > > Thank you, > Mark > > On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[EMAIL PROTECTED]> wrote: > >> There is no in-MapReduce mechanism for cross-task synchronization. You'll >> need to use something like Zookeeper for this, or another external >> database. >> Note that this will greatly complicate your life. >> >> If I were you, I'd try to either redesign my pipeline elsewhere to >> eliminate >> this need, or maybe get really clever. For example, do your numbers need to >> be sequential, or just unique? >> >> If the latter, then take the byte offset into the reducer's current output >> file and combine that with the reducer id (e.g., >> <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're >> all >> building unique sequences. If the former... rethink your pipeline? :) >> >> - Aaron >> >> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[EMAIL PROTECTED]> >> wrote: >> >> > Hi, >> > >> > I need to number all output records consecutively, like, 1,2,3... >> > >> > This is no problem with one reducer, making recordId an instance variable >> > in >> > the Reducer class, and setting conf.setNumReduceTasks(1) >> > >> > However, it is an architectural decision forced by processing need, where >> > the reducer becomes a bottleneck. Can I have a global variable for all >> > reducers, which would give each the next consecutive recordId? In the >> > database scenario, this would be the unique autokey. How to do it in >> > MapReduce? >> > >> > Thank you >> > >> > >
-
Re: How to give consecutive numbers to output records?Mark Kerzner 2009-10-28, 17:49
Michael,
environmental variables are available in Java, but the environment itself is not shared between instances. I read your code - you are solving exactly the same problem I am interested in - but I did not see how it works in distributed environment. By the way, it occurs to me that JavaSpaces, which is a different approach to distributed computing, trumpled by Hadoop, could be used here! Just run one instance with GigaSpaces at all times, and you got your self-increment for any number of jobs. It is perfect for concurrent processing and very fast. Thank you, Mark On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[EMAIL PROTECTED]>wrote: > > I posted an approach to this using streaming, but if the environment > variables are available in standard Java interface, this may work for you. > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg09079.html > > You'll have to be able to tolerate some small gaps in the ids. > > Michael > > > Mark Kerzner wrote: > >> >> >> Aaron, although your notes are not a ready solution, but they are a great >> help. >> >> Thank you, >> Mark >> >> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[EMAIL PROTECTED]> >> wrote: >> >> There is no in-MapReduce mechanism for cross-task synchronization. You'll >>> need to use something like Zookeeper for this, or another external >>> database. >>> Note that this will greatly complicate your life. >>> >>> If I were you, I'd try to either redesign my pipeline elsewhere to >>> eliminate >>> this need, or maybe get really clever. For example, do your numbers need >>> to >>> be sequential, or just unique? >>> >>> If the latter, then take the byte offset into the reducer's current >>> output >>> file and combine that with the reducer id (e.g., >>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're >>> all >>> building unique sequences. If the former... rethink your pipeline? :) >>> >>> - Aaron >>> >>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[EMAIL PROTECTED]> >>> wrote: >>> >>> > Hi, >>> > >>> > I need to number all output records consecutively, like, 1,2,3... >>> > >>> > This is no problem with one reducer, making recordId an instance >>> variable >>> > in >>> > the Reducer class, and setting conf.setNumReduceTasks(1) >>> > >>> > However, it is an architectural decision forced by processing need, >>> where >>> > the reducer becomes a bottleneck. Can I have a global variable for all >>> > reducers, which would give each the next consecutive recordId? In the >>> > database scenario, this would be the unique autokey. How to do it in >>> > MapReduce? >>> > >>> > Thank you >>> > >>> >>> >> >>
-
Re: How to give consecutive numbers to output records?Michael Klatt 2009-10-28, 17:57
Hi Mark, Each mapper (or reducer) has an environment variable "mapred_map_tasks" (or "mapred_reduce_tasks") which will describe how many tasks the map or reduce job was split into. It also has a variable "mapred_task_id" which contains a unique identifier for the task. Using these two together, it's able to generate a sequence of numbers which won't collide with other mappers (or reducers). For example, if a job has 20 (map or reduce) tasks, then task #1 will generate the sequence (assuming the first id is zero): 0,20,40,60,80,100,120... and task #2 will generate 1,21,41,61,81,101,121... and so on. This approach works in both the mapper OR the reducer...you just have to look at different variables. Michael Mark Kerzner wrote: > Michael, > > environmental variables are available in Java, but the environment itself is > not shared between instances. I read your code - you are solving exactly the > same problem I am interested in - but I did not see how it works in > distributed environment. > > By the way, it occurs to me that JavaSpaces, which is a different approach > to distributed computing, trumpled by Hadoop, could be used here! Just run > one instance with GigaSpaces at all times, and you got your self-increment > for any number of jobs. It is perfect for concurrent processing and very > fast. > > Thank you, > Mark > > On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[EMAIL PROTECTED]>wrote: > >> I posted an approach to this using streaming, but if the environment >> variables are available in standard Java interface, this may work for you. >> >> http://www.mail-archive.com/[EMAIL PROTECTED]/msg09079.html >> >> You'll have to be able to tolerate some small gaps in the ids. >> >> Michael >> >> >> Mark Kerzner wrote: >> >>> >>> Aaron, although your notes are not a ready solution, but they are a great >>> help. >>> >>> Thank you, >>> Mark >>> >>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[EMAIL PROTECTED]> >>> wrote: >>> >>> There is no in-MapReduce mechanism for cross-task synchronization. You'll >>>> need to use something like Zookeeper for this, or another external >>>> database. >>>> Note that this will greatly complicate your life. >>>> >>>> If I were you, I'd try to either redesign my pipeline elsewhere to >>>> eliminate >>>> this need, or maybe get really clever. For example, do your numbers need >>>> to >>>> be sequential, or just unique? >>>> >>>> If the latter, then take the byte offset into the reducer's current >>>> output >>>> file and combine that with the reducer id (e.g., >>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're >>>> all >>>> building unique sequences. If the former... rethink your pipeline? :) >>>> >>>> - Aaron >>>> >>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I need to number all output records consecutively, like, 1,2,3... >>>>> >>>>> This is no problem with one reducer, making recordId an instance >>>> variable >>>>> in >>>>> the Reducer class, and setting conf.setNumReduceTasks(1) >>>>> >>>>> However, it is an architectural decision forced by processing need, >>>> where >>>>> the reducer becomes a bottleneck. Can I have a global variable for all >>>>> reducers, which would give each the next consecutive recordId? In the >>>>> database scenario, this would be the unique autokey. How to do it in >>>>> MapReduce? >>>>> >>>>> Thank you >>>>> >>>> >>> >
-
Re: How to give consecutive numbers to output records?Mark Kerzner 2009-10-28, 18:03
Oh, I see. Very smart.
In my case, I need consecutive numbers with no gaps, and I need them in the order of how Hadoop sorted the maps. So I don't see how I could apply this approach, but thank you - it is a great discussion, which was helpful to consider all issues around this, and which brought about the idea of JavaSpaces - I am going to check out this one. Mark On Wed, Oct 28, 2009 at 12:57 PM, Michael Klatt <[EMAIL PROTECTED]>wrote: > > Hi Mark, > > Each mapper (or reducer) has an environment variable "mapred_map_tasks" (or > "mapred_reduce_tasks") which will describe how many tasks the map or reduce > job was split into. It also has a variable > "mapred_task_id" which contains a unique identifier for the task. Using > these two together, it's able to generate a sequence of numbers which won't > collide with other mappers (or reducers). > > For example, if a job has 20 (map or reduce) tasks, then task #1 will > generate the sequence (assuming the first id is zero): > > 0,20,40,60,80,100,120... > > and task #2 will generate > > 1,21,41,61,81,101,121... > > and so on. This approach works in both the mapper OR the reducer...you > just have to look at different variables. > > Michael > > > Mark Kerzner wrote: > >> Michael, >> >> environmental variables are available in Java, but the environment itself >> is >> not shared between instances. I read your code - you are solving exactly >> the >> same problem I am interested in - but I did not see how it works in >> distributed environment. >> >> By the way, it occurs to me that JavaSpaces, which is a different approach >> to distributed computing, trumpled by Hadoop, could be used here! Just run >> one instance with GigaSpaces at all times, and you got your self-increment >> for any number of jobs. It is perfect for concurrent processing and very >> fast. >> >> Thank you, >> Mark >> >> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[EMAIL PROTECTED] >> >wrote: >> >> I posted an approach to this using streaming, but if the environment >>> variables are available in standard Java interface, this may work for >>> you. >>> >>> http://www.mail-archive.com/[EMAIL PROTECTED]/msg09079.html >>> >>> You'll have to be able to tolerate some small gaps in the ids. >>> >>> Michael >>> >>> >>> Mark Kerzner wrote: >>> >>> >>>> Aaron, although your notes are not a ready solution, but they are a >>>> great >>>> help. >>>> >>>> Thank you, >>>> Mark >>>> >>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> There is no in-MapReduce mechanism for cross-task synchronization. >>>> You'll >>>> >>>>> need to use something like Zookeeper for this, or another external >>>>> database. >>>>> Note that this will greatly complicate your life. >>>>> >>>>> If I were you, I'd try to either redesign my pipeline elsewhere to >>>>> eliminate >>>>> this need, or maybe get really clever. For example, do your numbers >>>>> need >>>>> to >>>>> be sequential, or just unique? >>>>> >>>>> If the latter, then take the byte offset into the reducer's current >>>>> output >>>>> file and combine that with the reducer id (e.g., >>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that >>>>> they're >>>>> all >>>>> building unique sequences. If the former... rethink your pipeline? :) >>>>> >>>>> - Aaron >>>>> >>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>> Hi, >>>>>> >>>>>> I need to number all output records consecutively, like, 1,2,3... >>>>>> >>>>>> This is no problem with one reducer, making recordId an instance >>>>>> >>>>> variable >>>>> >>>>>> in >>>>>> the Reducer class, and setting conf.setNumReduceTasks(1) >>>>>> >>>>>> However, it is an architectural decision forced by processing need, >>>>>> >>>>> where >>>>> >>>>>> the reducer becomes a bottleneck. Can I have a global variable for all >>>>>> reducers, which would give each the next consecutive recordId? In the >>>>>> database scenario, this would be the unique autokey. How to do it in
-
Re: How to give consecutive numbers to output records?brien colwell 2009-10-28, 18:07
Another approach is to initialize each map task with an ID (using
JavaSpaces, something like Zookeeper, or some aspect of the input data) and then pack that with a map-local counter into a global ID. This makes assumptions like the number of map tasks less than 2^10 and the number of records per mapper will be less than 2^53. The packed global IDs are consecutive per map task. If globally consecutive is needed, a second stage can create a histogram of map task ID -> number of records and use it to transform the global IDs to globally consecutive . Mark Kerzner wrote: > Michael, > > environmental variables are available in Java, but the environment itself is > not shared between instances. I read your code - you are solving exactly the > same problem I am interested in - but I did not see how it works in > distributed environment. > > By the way, it occurs to me that JavaSpaces, which is a different approach > to distributed computing, trumpled by Hadoop, could be used here! Just run > one instance with GigaSpaces at all times, and you got your self-increment > for any number of jobs. It is perfect for concurrent processing and very > fast. > > Thank you, > Mark > > On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[EMAIL PROTECTED]>wrote: > > >> I posted an approach to this using streaming, but if the environment >> variables are available in standard Java interface, this may work for you. >> >> http://www.mail-archive.com/[EMAIL PROTECTED]/msg09079.html >> >> You'll have to be able to tolerate some small gaps in the ids. >> >> Michael >> >> >> Mark Kerzner wrote: >> >> >>> Aaron, although your notes are not a ready solution, but they are a great >>> help. >>> >>> Thank you, >>> Mark >>> >>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[EMAIL PROTECTED]> >>> wrote: >>> >>> There is no in-MapReduce mechanism for cross-task synchronization. You'll >>> >>>> need to use something like Zookeeper for this, or another external >>>> database. >>>> Note that this will greatly complicate your life. >>>> >>>> If I were you, I'd try to either redesign my pipeline elsewhere to >>>> eliminate >>>> this need, or maybe get really clever. For example, do your numbers need >>>> to >>>> be sequential, or just unique? >>>> >>>> If the latter, then take the byte offset into the reducer's current >>>> output >>>> file and combine that with the reducer id (e.g., >>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're >>>> all >>>> building unique sequences. If the former... rethink your pipeline? :) >>>> >>>> - Aaron >>>> >>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> >>>>> Hi, >>>>> >>>>> I need to number all output records consecutively, like, 1,2,3... >>>>> >>>>> This is no problem with one reducer, making recordId an instance >>>>> >>>> variable >>>> >>>>> in >>>>> the Reducer class, and setting conf.setNumReduceTasks(1) >>>>> >>>>> However, it is an architectural decision forced by processing need, >>>>> >>>> where >>>> >>>>> the reducer becomes a bottleneck. Can I have a global variable for all >>>>> reducers, which would give each the next consecutive recordId? In the >>>>> database scenario, this would be the unique autokey. How to do it in >>>>> MapReduce? >>>>> >>>>> Thank you >>>>> >>>>> >>>> >>> > >
-
Re: How to give consecutive numbers to output records?Mark Kerzner 2009-10-28, 18:20
Brien,
- I am on EC2, what would be the advantage of using Zookeeper over JavaSpaces? Either would have to be maintained by me, as they are not provided on EC2 directly; - pack that with a map-local counter into a global ID - you mean, just take the global counter and make the local instance counter equal to it? - 2^53 is quite sufficient for my purposes, but where is the number coming from? - Looking at your last point, I saw what I have previously missed: I need numbers consecutive within each reducer, and then I need them consecutive between reducers. I assume that reducers are sorted. For example, if my records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the other one - maps 4,5,6. If that's the case, I need to know how the reducers are sorted. Then I could simply run the second stage. Thank you, Mark On Wed, Oct 28, 2009 at 1:07 PM, brien colwell <[EMAIL PROTECTED]> wrote: > Another approach is to initialize each map task with an ID (using > JavaSpaces, something like Zookeeper, or some aspect of the input data) and > then pack that with a map-local counter into a global ID. This makes > assumptions like the number of map tasks less than 2^10 and the number of > records per mapper will be less than 2^53. The packed global IDs are > consecutive per map task. If globally consecutive is needed, a second stage > can create a histogram of map task ID -> number of records and use it to > transform the global IDs to globally consecutive . > > > > > Mark Kerzner wrote: > >> Michael, >> >> environmental variables are available in Java, but the environment itself >> is >> not shared between instances. I read your code - you are solving exactly >> the >> same problem I am interested in - but I did not see how it works in >> distributed environment. >> >> By the way, it occurs to me that JavaSpaces, which is a different approach >> to distributed computing, trumpled by Hadoop, could be used here! Just run >> one instance with GigaSpaces at all times, and you got your self-increment >> for any number of jobs. It is perfect for concurrent processing and very >> fast. >> >> Thank you, >> Mark >> >> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[EMAIL PROTECTED] >> >wrote: >> >> >> >>> I posted an approach to this using streaming, but if the environment >>> variables are available in standard Java interface, this may work for >>> you. >>> >>> http://www.mail-archive.com/[EMAIL PROTECTED]/msg09079.html >>> >>> You'll have to be able to tolerate some small gaps in the ids. >>> >>> Michael >>> >>> >>> Mark Kerzner wrote: >>> >>> >>> >>>> Aaron, although your notes are not a ready solution, but they are a >>>> great >>>> help. >>>> >>>> Thank you, >>>> Mark >>>> >>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> There is no in-MapReduce mechanism for cross-task synchronization. >>>> You'll >>>> >>>> >>>>> need to use something like Zookeeper for this, or another external >>>>> database. >>>>> Note that this will greatly complicate your life. >>>>> >>>>> If I were you, I'd try to either redesign my pipeline elsewhere to >>>>> eliminate >>>>> this need, or maybe get really clever. For example, do your numbers >>>>> need >>>>> to >>>>> be sequential, or just unique? >>>>> >>>>> If the latter, then take the byte offset into the reducer's current >>>>> output >>>>> file and combine that with the reducer id (e.g., >>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that >>>>> they're >>>>> all >>>>> building unique sequences. If the former... rethink your pipeline? :) >>>>> >>>>> - Aaron >>>>> >>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>> >>>>> >>>>>> Hi, >>>>>> >>>>>> I need to number all output records consecutively, like, 1,2,3... >>>>>> >>>>>> This is no problem with one reducer, making recordId an instance >>>>>> >>>>>> >>>>> variable >>>>> >>>>>
-
Re: How to give consecutive numbers to output records?Edward Capriolo 2009-10-28, 19:22
On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner <[EMAIL PROTECTED]> wrote:
> Brien, > > > - I am on EC2, what would be the advantage of using Zookeeper over > JavaSpaces? Either would have to be maintained by me, as they are not > provided on EC2 directly; > - pack that with a map-local counter into a global ID - you mean, just > take the global counter and make the local instance counter equal to it? > - 2^53 is quite sufficient for my purposes, but where is the number > coming from? > - Looking at your last point, I saw what I have previously missed: I need > numbers consecutive within each reducer, and then I need them consecutive > between reducers. I assume that reducers are sorted. For example, if my > records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the > other one - maps 4,5,6. If that's the case, I need to know how the reducers > are sorted. Then I could simply run the second stage. > > Thank you, > Mark > > On Wed, Oct 28, 2009 at 1:07 PM, brien colwell <[EMAIL PROTECTED]> wrote: > >> Another approach is to initialize each map task with an ID (using >> JavaSpaces, something like Zookeeper, or some aspect of the input data) and >> then pack that with a map-local counter into a global ID. This makes >> assumptions like the number of map tasks less than 2^10 and the number of >> records per mapper will be less than 2^53. The packed global IDs are >> consecutive per map task. If globally consecutive is needed, a second stage >> can create a histogram of map task ID -> number of records and use it to >> transform the global IDs to globally consecutive . >> >> >> >> >> Mark Kerzner wrote: >> >>> Michael, >>> >>> environmental variables are available in Java, but the environment itself >>> is >>> not shared between instances. I read your code - you are solving exactly >>> the >>> same problem I am interested in - but I did not see how it works in >>> distributed environment. >>> >>> By the way, it occurs to me that JavaSpaces, which is a different approach >>> to distributed computing, trumpled by Hadoop, could be used here! Just run >>> one instance with GigaSpaces at all times, and you got your self-increment >>> for any number of jobs. It is perfect for concurrent processing and very >>> fast. >>> >>> Thank you, >>> Mark >>> >>> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[EMAIL PROTECTED] >>> >wrote: >>> >>> >>> >>>> I posted an approach to this using streaming, but if the environment >>>> variables are available in standard Java interface, this may work for >>>> you. >>>> >>>> http://www.mail-archive.com/[EMAIL PROTECTED]/msg09079.html >>>> >>>> You'll have to be able to tolerate some small gaps in the ids. >>>> >>>> Michael >>>> >>>> >>>> Mark Kerzner wrote: >>>> >>>> >>>> >>>>> Aaron, although your notes are not a ready solution, but they are a >>>>> great >>>>> help. >>>>> >>>>> Thank you, >>>>> Mark >>>>> >>>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>> There is no in-MapReduce mechanism for cross-task synchronization. >>>>> You'll >>>>> >>>>> >>>>>> need to use something like Zookeeper for this, or another external >>>>>> database. >>>>>> Note that this will greatly complicate your life. >>>>>> >>>>>> If I were you, I'd try to either redesign my pipeline elsewhere to >>>>>> eliminate >>>>>> this need, or maybe get really clever. For example, do your numbers >>>>>> need >>>>>> to >>>>>> be sequential, or just unique? >>>>>> >>>>>> If the latter, then take the byte offset into the reducer's current >>>>>> output >>>>>> file and combine that with the reducer id (e.g., >>>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that >>>>>> they're >>>>>> all >>>>>> building unique sequences. If the former... rethink your pipeline? :) >>>>>> >>>>>> - Aaron >>>>>> >>>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[EMAIL PROTECTED]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Hi, >>> My two cents here. It seems like what you want is an atomic auto-id in order with no gaps. In particular the "no gaps" and "in order" requirement really constrains you because now you can't use the map/reduce id s/hostname etc to generate distinct keys. If you have these requirements "in order" "no gaps". The best approach may just be to use one mapper/reducer. Take a step back and look at the big picture. No matter how much software you add, you asking for a single/serialized/locking "IDFACTORY". Doing this multinode zookeeper locking "IDFACTORY" is probably going to be more work and slower then just doing it in a single thread mapper or reducer.
-
Re: How to give consecutive numbers to output records?Mark Kerzner 2009-10-28, 19:29
I agree, Edward. One Reducer is what I have now, and it works. It is, or may
become, a bottleneck. When it does, I can go back to using multiple reducers, and add a second MR job to renumber. That way, I can stick with the situation until it becomes necessary to optimize, and not introduce any changes or new software - for, as you say, it might become a pain. Follow the rule (which I got from Scott Meyers' Effective C++) - avoid premature optimization. mark On Wed, Oct 28, 2009 at 2:22 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner <[EMAIL PROTECTED]> > wrote: > > Brien, > > > > > > - I am on EC2, what would be the advantage of using Zookeeper over > > JavaSpaces? Either would have to be maintained by me, as they are not > > provided on EC2 directly; > > - pack that with a map-local counter into a global ID - you mean, just > > take the global counter and make the local instance counter equal to > it? > > - 2^53 is quite sufficient for my purposes, but where is the number > > coming from? > > - Looking at your last point, I saw what I have previously missed: I > need > > numbers consecutive within each reducer, and then I need them > consecutive > > between reducers. I assume that reducers are sorted. For example, if my > > records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and > the > > other one - maps 4,5,6. If that's the case, I need to know how the > reducers > > are sorted. Then I could simply run the second stage. > > > > Thank you, > > Mark > > > > On Wed, Oct 28, 2009 at 1:07 PM, brien colwell <[EMAIL PROTECTED]> > wrote: > > > >> Another approach is to initialize each map task with an ID (using > >> JavaSpaces, something like Zookeeper, or some aspect of the input data) > and > >> then pack that with a map-local counter into a global ID. This makes > >> assumptions like the number of map tasks less than 2^10 and the number > of > >> records per mapper will be less than 2^53. The packed global IDs are > >> consecutive per map task. If globally consecutive is needed, a second > stage > >> can create a histogram of map task ID -> number of records and use it to > >> transform the global IDs to globally consecutive . > >> > >> > >> > >> > >> Mark Kerzner wrote: > >> > >>> Michael, > >>> > >>> environmental variables are available in Java, but the environment > itself > >>> is > >>> not shared between instances. I read your code - you are solving > exactly > >>> the > >>> same problem I am interested in - but I did not see how it works in > >>> distributed environment. > >>> > >>> By the way, it occurs to me that JavaSpaces, which is a different > approach > >>> to distributed computing, trumpled by Hadoop, could be used here! Just > run > >>> one instance with GigaSpaces at all times, and you got your > self-increment > >>> for any number of jobs. It is perfect for concurrent processing and > very > >>> fast. > >>> > >>> Thank you, > >>> Mark > >>> > >>> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt < > [EMAIL PROTECTED] > >>> >wrote: > >>> > >>> > >>> > >>>> I posted an approach to this using streaming, but if the environment > >>>> variables are available in standard Java interface, this may work for > >>>> you. > >>>> > >>>> http://www.mail-archive.com/[EMAIL PROTECTED]/msg09079.html > >>>> > >>>> You'll have to be able to tolerate some small gaps in the ids. > >>>> > >>>> Michael > >>>> > >>>> > >>>> Mark Kerzner wrote: > >>>> > >>>> > >>>> > >>>>> Aaron, although your notes are not a ready solution, but they are a > >>>>> great > >>>>> help. > >>>>> > >>>>> Thank you, > >>>>> Mark > >>>>> > >>>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[EMAIL PROTECTED]> > >>>>> wrote: > >>>>> > >>>>> There is no in-MapReduce mechanism for cross-task synchronization. > >>>>> You'll > >>>>> > >>>>> > >>>>>> need to use something like Zookeeper for this, or another external > >>>>>> database. > >>>>>> Note that this will greatly complicate your life. |