|
David Parks
2012-10-27, 03:07
Ted Dunning
2012-10-27, 05:22
omn izzy
2012-10-27, 21:16
Kev Kilroy
2012-10-27, 21:23
David Parks
2012-10-29, 01:15
Taeho Kang
2012-10-29, 01:32
Michael Katzenellenbogen
2012-10-29, 01:33
Ted Dunning
2012-10-29, 05:58
David Parks
2012-10-29, 06:54
Steve Loughran
2012-10-29, 10:15
|
-
Cluster wide atomic operationsDavid Parks 2012-10-27, 03:07
How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter. Does Hadoop provide native support for these kinds of operations? An in case ultimate answer involves zookeeper, I'd love to work out doing this in AWS/EMR.
-
Re: Cluster wide atomic operationsTed Dunning 2012-10-27, 05:22
This is better asked on the Zookeeper lists.
The first answer is that global atomic operations are a generally bad idea. The second answer is that if you an batch these operations up then you can cut the evilness of global atomicity by a substantial factor. Are you sure you need a global counter? On Fri, Oct 26, 2012 at 11:07 PM, David Parks <[EMAIL PROTECTED]>wrote: > How can we manage cluster-wide atomic operations? Such as maintaining an > auto-increment counter. > > Does Hadoop provide native support for these kinds of operations? > > An in case ultimate answer involves zookeeper, I'd love to work out doing > this in AWS/EMR. > >
-
Re: Cluster wide atomic operationsomn izzy 2012-10-27, 21:16
test
-
Re: Cluster wide atomic operationsKev Kilroy 2012-10-27, 21:23
z
-
RE: Cluster wide atomic operationsDavid Parks 2012-10-29, 01:15
I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one to one million. I suppose I could assign a range of usable IDs to each reduce task (where ID's are assigned) and keep those organized somehow at the end of the job, but this seems clunky too. Since this is on AWS, zookeeper is not a good option. I thought it was part of the hadoop cluster (and thus easy to access), but guess I was wrong there. I would think that such a service would run most logically on the taskmaster server. I'm surprised this isn't a common issue. I guess I could launch a separate job that runs such a sequence service perhaps. But that's non trivial its self with failure concerns. Perhaps there's just a better way of thinking of this? From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Saturday, October 27, 2012 12:23 PM To: [EMAIL PROTECTED] Subject: Re: Cluster wide atomic operations This is better asked on the Zookeeper lists. The first answer is that global atomic operations are a generally bad idea. The second answer is that if you an batch these operations up then you can cut the evilness of global atomicity by a substantial factor. Are you sure you need a global counter? On Fri, Oct 26, 2012 at 11:07 PM, David Parks <[EMAIL PROTECTED]> wrote: How can we manage cluster-wide atomic operations? Such as maintaining an auto-increment counter. Does Hadoop provide native support for these kinds of operations? An in case ultimate answer involves zookeeper, I'd love to work out doing this in AWS/EMR.
-
Re: Cluster wide atomic operationsTaeho Kang 2012-10-29, 01:32
Hello, David,
How about using something like Redis for that matter? http://redis.io There are services like RedisToGo (https://redistogo.com/), which also runs on AWS and is very easy to get started. Sign up and few clicks and you are set to go. On Mon, Oct 29, 2012 at 10:15 AM, David Parks <[EMAIL PROTECTED]>wrote: > I need a unique & permanent ID assigned to new item encountered, which has > a constraint that it is in the range of, let’s say for simple discussion, > one to one million.**** > > ** ** > > I suppose I could assign a range of usable IDs to each reduce task (where > ID’s are assigned) and keep those organized somehow at the end of the job, > but this seems clunky too.**** > > ** ** > > Since this is on AWS, zookeeper is not a good option. I thought it was > part of the hadoop cluster (and thus easy to access), but guess I was wrong > there.**** > > ** ** > > I would think that such a service would run most logically on the > taskmaster server. I’m surprised this isn’t a common issue. I guess I could > launch a separate job that runs such a sequence service perhaps. But that’s > non trivial its self with failure concerns. **** > > ** ** > > Perhaps there’s just a better way of thinking of this?**** > > ** ** > > ** ** > > *From:* Ted Dunning [mailto:[EMAIL PROTECTED]] > *Sent:* Saturday, October 27, 2012 12:23 PM > *To:* [EMAIL PROTECTED] > *Subject:* Re: Cluster wide atomic operations**** > > ** ** > > This is better asked on the Zookeeper lists.**** > > ** ** > > The first answer is that global atomic operations are a generally bad idea. > **** > > ** ** > > The second answer is that if you an batch these operations up then you can > cut the evilness of global atomicity by a substantial factor.**** > > ** ** > > Are you sure you need a global counter?**** > > On Fri, Oct 26, 2012 at 11:07 PM, David Parks <[EMAIL PROTECTED]> > wrote:**** > > How can we manage cluster-wide atomic operations? Such as maintaining an > auto-increment counter. > > Does Hadoop provide native support for these kinds of operations? > > An in case ultimate answer involves zookeeper, I'd love to work out doing > this in AWS/EMR.**** > > ** ** >
-
Re: Cluster wide atomic operationsMichael Katzenellenbogen 2012-10-29, 01:33
Twitter's Snowflake may provide you with some inspiration:
https://github.com/twitter/snowflake -Michael On Oct 28, 2012, at 9:16 PM, David Parks <[EMAIL PROTECTED]> wrote: I need a unique & permanent ID assigned to new item encountered, which has a constraint that it is in the range of, let’s say for simple discussion, one to one million. I suppose I could assign a range of usable IDs to each reduce task (where ID’s are assigned) and keep those organized somehow at the end of the job, but this seems clunky too. Since this is on AWS, zookeeper is not a good option. I thought it was part of the hadoop cluster (and thus easy to access), but guess I was wrong there. I would think that such a service would run most logically on the taskmaster server. I’m surprised this isn’t a common issue. I guess I could launch a separate job that runs such a sequence service perhaps. But that’s non trivial its self with failure concerns. Perhaps there’s just a better way of thinking of this? *From:* Ted Dunning [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>] *Sent:* Saturday, October 27, 2012 12:23 PM *To:* [EMAIL PROTECTED] *Subject:* Re: Cluster wide atomic operations This is better asked on the Zookeeper lists. The first answer is that global atomic operations are a generally bad idea. The second answer is that if you an batch these operations up then you can cut the evilness of global atomicity by a substantial factor. Are you sure you need a global counter? On Fri, Oct 26, 2012 at 11:07 PM, David Parks <[EMAIL PROTECTED]> wrote: How can we manage cluster-wide atomic operations? Such as maintaining an auto-increment counter. Does Hadoop provide native support for these kinds of operations? An in case ultimate answer involves zookeeper, I'd love to work out doing this in AWS/EMR.
-
Re: Cluster wide atomic operationsTed Dunning 2012-10-29, 05:58
On Sun, Oct 28, 2012 at 9:15 PM, David Parks <[EMAIL PROTECTED]> wrote:
> I need a unique & permanent ID assigned to new item encountered, which has > a constraint that it is in the range of, let’s say for simple discussion, > one to one million. > Having such a limited range may require that you have a central service to generate ID's. The use of a central service can be disastrous for throughput. > **** > > ** I suppose I could assign a range of usable IDs to each reduce task > (where ID’s are assigned) and keep those organized somehow at the end of > the job, but this seems clunky too. > > ** > Yes. Much better. > Since this is on AWS, zookeeper is not a good option. I thought it was > part of the hadoop cluster (and thus easy to access), but guess I was wrong > there. > No. This is specifically not part of Hadoop for performance reasons. > ** I would think that such a service would run most logically on the > taskmaster server. I’m surprised this isn’t a common issue. I guess I could > launch a separate job that runs such a sequence service perhaps. But that’s > non trivial its self with failure concerns. > The problem is that a serial number service is a major loss of performance in a parallel system. Unless you relax the idea considerably (by allowing blocks, or having lots of bits like Snowflake), then you wind up with a round-trip per id and you have a critical section on the ID generator. This is bad. Look up Amdahl's Law. > ** Perhaps there’s just a better way of thinking of this? > Yes. Use lots of bits and be satisfied with uniqueness rather than perfect ordering and limited range. As the other respondent said, look up Snowflake.
-
RE: Cluster wide atomic operationsDavid Parks 2012-10-29, 06:54
That's a very helpful discussion. Thank you.
I'd like to go with assigning blocks of IDs for each reducer. Snowflake would require external changes that are a pain, I'd rather make my job fit our current constraints. Is there a way to get an index number for each reducer such that I could identify which block of IDs to assign each one? Thanks, David From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Monday, October 29, 2012 12:58 PM To: [EMAIL PROTECTED] Subject: Re: Cluster wide atomic operations On Sun, Oct 28, 2012 at 9:15 PM, David Parks <[EMAIL PROTECTED]> wrote: I need a unique & permanent ID assigned to new item encountered, which has a constraint that it is in the range of, let's say for simple discussion, one to one million. Having such a limited range may require that you have a central service to generate ID's. The use of a central service can be disastrous for throughput. I suppose I could assign a range of usable IDs to each reduce task (where ID's are assigned) and keep those organized somehow at the end of the job, but this seems clunky too. Yes. Much better. Since this is on AWS, zookeeper is not a good option. I thought it was part of the hadoop cluster (and thus easy to access), but guess I was wrong there. No. This is specifically not part of Hadoop for performance reasons. I would think that such a service would run most logically on the taskmaster server. I'm surprised this isn't a common issue. I guess I could launch a separate job that runs such a sequence service perhaps. But that's non trivial its self with failure concerns. The problem is that a serial number service is a major loss of performance in a parallel system. Unless you relax the idea considerably (by allowing blocks, or having lots of bits like Snowflake), then you wind up with a round-trip per id and you have a critical section on the ID generator. This is bad. Look up Amdahl's Law. Perhaps there's just a better way of thinking of this? Yes. Use lots of bits and be satisfied with uniqueness rather than perfect ordering and limited range. As the other respondent said, look up Snowflake.
-
Re: Cluster wide atomic operationsSteve Loughran 2012-10-29, 10:15
On 29 October 2012 01:15, David Parks <[EMAIL PROTECTED]> wrote:
> I need a unique & permanent ID assigned to new item encountered, which has > a constraint that it is in the range of, let’s say for simple discussion, > one to one million. > I'd go for UUID generation, which you can do in parallel -though it doesn't meet your range requirements |