-Is there a way to insure that different jobs have the same number of reducers
I am trying to run an application where I try to generate the cartesion
product of two potentially large data sets. In reality I only need the
cartesian product of
values in the set with a particular integer key. I am considering a design
where the first mappers run through the values of set A emitting that
integer as a key and the item as a value. The reducers are simple identity
In the second job the mappers run through set B emitting values with a key
and the item as a value. The reducers read the output of the first job to
run through the values of A.
One issue is that assuming the same hashing partitioner is used there are
the same number of reducers, a specific reducer , say reducer 12 ,
will receive the same keys in both jobs and thus part-r-00012 from the
first job is the only file reducer 12 will need to read.
Can I guarantee (without restricting the number of reducers to a smaller
number than the cluster will support) that this condition is met - namely
that the keys in the second job hit the same reducer number as the first
job? What about restarts and failures?
BTW is there any way to find out the size of a cluster??
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033