Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Effective way to cross two relations


Copy link to this message
-
Re: Effective way to cross two relations
I posted on this very same topic a few weeks ago with no response. It is
still an unresolved issue for me, so if anyone had any ideas it would be
greatly appreciated.

Interestingly enough I ran into issues right around the same size that you
are dealing with (50k rows) so I am wondering if it is an issue with how
Pig handles things. I'd recommend tuning some of the parameters that I
mention in my post (below) as it may help you complete the job.

http://search-hadoop.com/m/kJghFzruCA1/nested+cross&subj=Moving+Cross+of+Large+Data+to+be+Nested
On Thu, Apr 18, 2013 at 9:17 PM, KALLURI, RAJESH K (AG/1000) <
[EMAIL PROTECTED]> wrote:

> I have a relation of about 50000 tuples that I want to join to itself
> either by using a cross operator or something similar. Then I would be
> doing pair wise computation of half the matrix (avoiding comparing to self
> and duplicate).
>
> I was wondering what the most effective way to do this,  below is some
> pseudo pig latin.
>
>
> -- About 50,000 - 70,000 entries
> a = LOAD 'part-r-00000.txt' USING PigStorage()
> AS (id:long,  x:int, y:int);
> -- Same as a , About 50,000 - 70,000 entries
> b = LOAD 'part-r-00000.txt' USING PigStorage()
> AS (id:long,  x:int, y:int);
>
> jnd = join a by id , b by id;
> -- filter comparisons to self and duplicates from the matrix
> -- end up with 50000 X (50000-1)/2 entries
> filter_self = filter jnd by a::id != b::id and a::id > b::id;
>
> raw = foreach filter_self generate a::id as id1, b::id as id2, TOBAG(a::x,
> b::y) as z;
> -- group pairs for comparison
> grpd = group raw by (id1, id2);
> -- calculate similarity between id1 and id2 based on a udf
> prjctd = foreach grpd generate flatten(group), UDF(raw.z);
>
> This e-mail message may contain privileged and/or confidential
> information, and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error,
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other
> use of this e-mail by you is strictly prohibited.
>
> All e-mails and attachments sent and received are subject to monitoring,
> reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for
> checking for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage
> caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
>
>
> The information contained in this email may be subject to the export
> control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR)
> and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
> information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>