Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Algorithm for cross product


+
Steve Lewis 2011-06-22, 22:16
+
Jason 2011-06-23, 00:13
Copy link to this message
-
Re: Algorithm for cross product
If you have scaling problems, check out the Mahout project. They are
all about distributed scalable linear algebra & more.
http://mahout.apache.org

Lance

On Wed, Jun 22, 2011 at 5:13 PM, Jason <[EMAIL PROTECTED]> wrote:
> I remember I had a similar problem.
> The way I approached it was by partitioning one of the data sets. At high level these are the steps:
>
> Suppose you decide to partition set A.
>
> Each partition represents a subset/range of the A keys and must be small enough to fit records in memory.
>
> Each partition gets sent to a separate reducer by the mapper and partitioner logic.
>
> The second data set B then is *duplicated* for each of the reducers again using some trivial logic in mapper and partitioner.
>
> This assumes that the reducers can process record from both A and B sets. Also all records from A preceed ones from B which is trivially done by sort comparator.
>
> When a reducer receives a record from A set, it stores it in memory.
> When a record from set B arrives, the cross product is computed with all A records already in memory and results are emitted.
>
> The job should scale in space as long as you have enough reducers assigned and will scale in time with more reducer machines.
>
>
> Sent from my iPhone
>
> On Jun 22, 2011, at 3:16 PM, Steve Lewis <[EMAIL PROTECTED]> wrote:
>
>> Assume I have two data sources A and B
>> Assume I have an input format and can generate key values for both A and B
>> I want an algorithm which will generate the cross product of all values in A having the key K and all values in B having the
>> key K.
>> Currently I use a mapper to generate key values for A and  have the reducer get all values in B with key K and hold them in memory.
>> It works but might not scale.
>>
>> Any bright ideas?
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Skype lordjoe_com
>>
>>
>

--
Lance Norskog
[EMAIL PROTECTED]
+
Steve Lewis 2011-06-23, 17:07
+
Jason 2011-06-23, 18:06
+
John Armstrong 2011-06-23, 12:45
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB