Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Cluster wide atomic operations


Copy link to this message
-
RE: Cluster wide atomic operations
David Parks 2012-10-29, 06:54
That's a very helpful discussion. Thank you.

 

I'd like to go with assigning blocks of IDs for each reducer. Snowflake
would require external changes that are a pain, I'd rather make my job fit
our current constraints.

 

Is there a way to get an index number for each reducer such that I could
identify which block of IDs to assign each one?

 

Thanks,

David

 

 

From: Ted Dunning [mailto:[EMAIL PROTECTED]]
Sent: Monday, October 29, 2012 12:58 PM
To: [EMAIL PROTECTED]
Subject: Re: Cluster wide atomic operations

 

 

On Sun, Oct 28, 2012 at 9:15 PM, David Parks <[EMAIL PROTECTED]> wrote:

I need a unique & permanent ID assigned to new item encountered, which has a
constraint that it is in the range of, let's say for simple discussion, one
to one million.

 

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.

 

 I suppose I could assign a range of usable IDs to each reduce task (where
ID's are assigned) and keep those organized somehow at the end of the job,
but this seems clunky too.

 

Yes.  Much better.

 

 Since this is on AWS, zookeeper is not a good option. I thought it was part
of the hadoop cluster (and thus easy to access), but guess I was wrong
there.

 

No.  This is specifically not part of Hadoop for performance reasons.

 

 I would think that such a service would run most logically on the
taskmaster server. I'm surprised this isn't a common issue. I guess I could
launch a separate job that runs such a sequence service perhaps. But that's
non trivial its self with failure concerns.

 

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.  This
is bad.

 

Look up Amdahl's Law.

 

 Perhaps there's just a better way of thinking of this?

 

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

 

As the other respondent said, look up Snowflake.