On Sun, Oct 28, 2012 at 9:15 PM, David Parks <[EMAIL PROTECTED]> wrote:
> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
Having such a limited range may require that you have a central service to
generate ID's. The use of a central service can be disastrous for
> ** I suppose I could assign a range of usable IDs to each reduce task
> (where ID’s are assigned) and keep those organized somehow at the end of
> the job, but this seems clunky too.
Yes. Much better.
> Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
No. This is specifically not part of Hadoop for performance reasons.
> ** I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns.
The problem is that a serial number service is a major loss of performance
in a parallel system. Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.
This is bad.
Look up Amdahl's Law.
> ** Perhaps there’s just a better way of thinking of this?
Yes. Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.
As the other respondent said, look up Snowflake.