Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Cartesian product in hadoop


+
zheyi rong 2013-04-18, 09:47
+
Jagat Singh 2013-04-18, 09:58
+
Azuryy Yu 2013-04-18, 10:21
+
Ajay Srivastava 2013-04-18, 10:50
Copy link to this message
-
Re: Cartesian product in hadoop
Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator
which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing
them to the reducer.
Regards,
Zheyi Rong
On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
[EMAIL PROTECTED]> wrote:

>  Hi Rong,
> You can use following simple method.
>
>  Lets say dataset1 has m records and when you emit these records from
> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
> identifier to identify dataset from where records is being emitted.
> So if R1 is a record in dataset1, the mapper will emit key as (K1,
> DATASET1) and value as R1.
>
>  For dataset2 having n records, emit m records for each record with keys
> K1, K2, …., Km and identifier as DATASET2.
> So if R1' is a record from dataset2, emit m records with key as  (Ki,
> DATASET2) and value R1' where i is from 1 to m.
>
>
>  Write a grouping comparator which group records on first part of key
> i.e. Ki.
>
>  In reducer, for each iteration of reduce there will be one record from
> dataset1 and n records from dataset2. Get the cartesian product, apply
> filter and then output.
>
>
>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then
> you need one more pass of dataset1 to identify the keys and store it to use
> for dataset2.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>
>  This is not suitable for his large dataset.
>
> --Send from my Sony mobile.
> On Apr 18, 2013 5:58 PM, "Jagat Singh" <[EMAIL PROTECTED]> wrote:
>
>>  Hi,
>>
>> Can you have a look at
>>
>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>
>>  Thanks
>>
>>
>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <[EMAIL PROTECTED]> wrote:
>>
>>> Dear all,
>>>
>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>> hadoop.
>>> Specifically, now I have two datasets, each of which contains 20million
>>> lines.
>>> I want to do cartesian product on these two datasets, comparing lines
>>> pairwisely.
>>>
>>>  The output of each comparison can be mostly filtered by a function (
>>> we do not output the
>>> whole result of this cartesian product, but only a small part).
>>>
>>>  I guess one good way is to pass one block from dataset1 and another
>>> block from dataset2
>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>
>>>  Any suggestions?
>>> Thank you very much.
>>>
>>>  Regards,
>>> Zheyi Rong
>>>
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB