|
|
-
Is there a way to insure that different jobs have the same number of reducers
Steve Lewis 2011-06-30, 01:05
I am trying to run an application where I try to generate the cartesion product of two potentially large data sets. In reality I only need the cartesian product of values in the set with a particular integer key. I am considering a design where the first mappers run through the values of set A emitting that integer as a key and the item as a value. The reducers are simple identity reducers. In the second job the mappers run through set B emitting values with a key and the item as a value. The reducers read the output of the first job to run through the values of A. One issue is that assuming the same hashing partitioner is used there are the same number of reducers, a specific reducer , say reducer 12 , will receive the same keys in both jobs and thus part-r-00012 from the first job is the only file reducer 12 will need to read. Can I guarantee (without restricting the number of reducers to a smaller number than the cluster will support) that this condition is met - namely that the keys in the second job hit the same reducer number as the first job? What about restarts and failures? BTW is there any way to find out the size of a cluster??
-- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
-
Re: Is there a way to insure that different jobs have the same number of reducers
Trevor Adams 2011-06-30, 01:11
Exact same bucket is possible, exact same machine (if that is what you had in mind) probably not. The partitioner breaks the data up for the reducers, so if they map to the same partition they will be done by the same reducer. If you can partition the data such that the output of one reducer partitions to 1 bucket and is not split then you can get all the data going to one reducer. Doing it this way means there needs to be some transient property that carries over from the step 1 reducer and through the step 2 mapper. Most cases, I would assume, do not have that property.
-Trevor
On Wed, Jun 29, 2011 at 9:05 PM, Steve Lewis <[EMAIL PROTECTED]> wrote:
> I am trying to run an application where I try to generate the cartesion > product of two potentially large data sets. In reality I only need the > cartesian product of > values in the set with a particular integer key. I am considering a design > where the first mappers run through the values of set A emitting that > integer as a key and the item as a value. The reducers are simple identity > reducers. > In the second job the mappers run through set B emitting values with a key > and the item as a value. The reducers read the output of the first job to > run through the values of A. > One issue is that assuming the same hashing partitioner is used there are > the same number of reducers, a specific reducer , say reducer 12 , > will receive the same keys in both jobs and thus part-r-00012 from the > first job is the only file reducer 12 will need to read. > Can I guarantee (without restricting the number of reducers to a smaller > number than the cluster will support) that this condition is met - namely > that the keys in the second job hit the same reducer number as the first > job? What about restarts and failures? > BTW is there any way to find out the size of a cluster?? > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > > >
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext