If you have scaling problems, check out the Mahout project. They are
all about distributed scalable linear algebra & more.
On Wed, Jun 22, 2011 at 5:13 PM, Jason <[EMAIL PROTECTED]> wrote:
> I remember I had a similar problem.
> The way I approached it was by partitioning one of the data sets. At high level these are the steps:
> Suppose you decide to partition set A.
> Each partition represents a subset/range of the A keys and must be small enough to fit records in memory.
> Each partition gets sent to a separate reducer by the mapper and partitioner logic.
> The second data set B then is *duplicated* for each of the reducers again using some trivial logic in mapper and partitioner.
> This assumes that the reducers can process record from both A and B sets. Also all records from A preceed ones from B which is trivially done by sort comparator.
> When a reducer receives a record from A set, it stores it in memory.
> When a record from set B arrives, the cross product is computed with all A records already in memory and results are emitted.
> The job should scale in space as long as you have enough reducers assigned and will scale in time with more reducer machines.
> Sent from my iPhone
> On Jun 22, 2011, at 3:16 PM, Steve Lewis <[EMAIL PROTECTED]> wrote:
>> Assume I have two data sources A and B
>> Assume I have an input format and can generate key values for both A and B
>> I want an algorithm which will generate the cross product of all values in A having the key K and all values in B having the
>> key K.
>> Currently I use a mapper to generate key values for A and have the reducer get all values in B with key K and hold them in memory.
>> It works but might not scale.
>> Any bright ideas?
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Skype lordjoe_com