The approach you suggest is similar to what I am currently doing but it
requires you to size the partitions to the memory available on the reducer.
This is a non-trivial task and is not necessarily guaranteed to scale. It is
true that the simplest approach is to break one of the sets into
sufficiently small partitions to hold a partition in memory and then
generate the Cartesian product but it is
a hack and makes assumptions about partition size.
One elegant solution would involve an ability to restart one of the input
splitters and replay the input data from set A several times until the
mapper had generated all sets of the form <key,(ai,bj>
On Wed, Jun 22, 2011 at 5:13 PM, Jason <[EMAIL PROTECTED]> wrote:
> I remember I had a similar problem.
> The way I approached it was by partitioning one of the data sets. At high
> level these are the steps:
> Suppose you decide to partition set A.
> Each partition represents a subset/range of the A keys and must be small
> enough to fit records in memory.
> Each partition gets sent to a separate reducer by the mapper and
> partitioner logic.
> The second data set B then is *duplicated* for each of the reducers again
> using some trivial logic in mapper and partitioner.
> This assumes that the reducers can process record from both A and B sets.
> Also all records from A preceed ones from B which is trivially done by sort
> When a reducer receives a record from A set, it stores it in memory.
> When a record from set B arrives, the cross product is computed with all A
> records already in memory and results are emitted.
> The job should scale in space as long as you have enough reducers assigned
> and will scale in time with more reducer machines.
> Sent from my iPhone
> On Jun 22, 2011, at 3:16 PM, Steve Lewis <[EMAIL PROTECTED]> wrote:
> > Assume I have two data sources A and B
> > Assume I have an input format and can generate key values for both A and
> > I want an algorithm which will generate the cross product of all values
> in A having the key K and all values in B having the
> > key K.
> > Currently I use a mapper to generate key values for A and have the
> reducer get all values in B with key K and hold them in memory.
> > It works but might not scale.
> > Any bright ideas?
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033