Twitter's Snowflake may provide you with some inspiration:
On Oct 28, 2012, at 9:16 PM, David Parks <[EMAIL PROTECTED]> wrote:
I need a unique & permanent ID assigned to new item encountered, which has
a constraint that it is in the range of, let’s say for simple discussion,
one to one million.
I suppose I could assign a range of usable IDs to each reduce task (where
ID’s are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.
Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
I would think that such a service would run most logically on the
taskmaster server. I’m surprised this isn’t a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that’s
non trivial its self with failure concerns.
Perhaps there’s just a better way of thinking of this?
*From:* Ted Dunning [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>]
*Sent:* Saturday, October 27, 2012 12:23 PM
*To:* [EMAIL PROTECTED]
*Subject:* Re: Cluster wide atomic operations
This is better asked on the Zookeeper lists.
The first answer is that global atomic operations are a generally bad idea.
The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.
Are you sure you need a global counter?
On Fri, Oct 26, 2012 at 11:07 PM, David Parks <[EMAIL PROTECTED]>
How can we manage cluster-wide atomic operations? Such as maintaining an
Does Hadoop provide native support for these kinds of operations?
An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.