Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Problems with Join in pig


+
Paltheru, Srikanth 2011-03-14, 21:40
+
Thejas M Nair 2011-03-14, 23:18
+
Paltheru, Srikanth 2011-03-14, 23:20
+
Olga Natkovich 2011-03-15, 00:23
+
Thejas M Nair 2011-03-15, 00:29
+
Dmitriy Ryaboy 2011-03-15, 00:37
+
Dmitriy Ryaboy 2011-03-15, 00:38
+
Paltheru, Srikanth 2011-03-15, 00:41
Copy link to this message
-
Re: Problems with Join in pig
Replicated-join will only work if the right most relation in join is small enough to fit in available memory, so it will not work with all data sets. But in this case you have one relation which has only one record, that should fit into memory.

The cogroup in your query might be running into some memory issue which might have been fixed in recent versions of pig.

-Thejas

On 3/14/11 4:41 PM, "Paltheru, Srikanth" <[EMAIL PROTECTED]> wrote:

I tried using replicated-join in pig 0.5 it does not work. The feature I am trying to use is supported in 0.5 version as well. It just works for some datasets and doesn't for others.

From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 14, 2011 5:39 PM
To: [EMAIL PROTECTED]
Cc: Thejas M Nair; Paltheru, Srikanth
Subject: Re: Problems with Join in pig

Uh no I am wrong. They are on 20, 18 was 0.4

Yea Srikanth you guys should just upgrade. 0.5 to 0.6 is relatively painless. The jump to 0.7-0.8 is harder, but worth it.

D

On Mon, Mar 14, 2011 at 5:37 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
If they are on 5 that means they have bigger problems. They are on Hadoop 18.

D

On Mon, Mar 14, 2011 at 5:29 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:
Fragment-replicate join will also produce an efficient query plan for this use case - http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Replicated+Joins <http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Replicated+Joins><http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Replicated+Joins>  . It is available in 0.5 as well.
-Thejas
On 3/14/11 3:20 PM, "Paltheru, Srikanth" <[EMAIL PROTECTED]> wrote:

I am using Pig 0.5 version. We don't have plans to upgrade it to a newer version. But the problem I have is the script runs for some files(both larger and smaller than the ones mentioned) but not for this particular one. I get "GC overhead limit" Error.
Thanks
Sri
-----Original Message-----
From: Thejas M Nair [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 14, 2011 4:18 PM
To: [EMAIL PROTECTED]; Paltheru, Srikanth
Subject: Re: Problems with Join in pig

What version of pig are you using ? There have been some memory utilization fixes in 0.8 . For this use case, you can also use the new scalar feature in
0.8 -
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Sc <http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Sc><http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Sc>
alars . That query plan will be more efficient.

You might want to build a new version of pig from svn 0.8 branch because there have been some bug fixes after the release -

svn co http://svn.apache.org/repos/asf/pig/branches/branch-0.8
cd branch-0.8
ant

-Thejas
On 3/14/11 1:40 PM, "Paltheru, Srikanth" <[EMAIL PROTECTED]>
wrote:

> The following pig script runs fine without the 2GB memory setting (see
> in yellow). But fails with memory setting. I am not sure what's
> happening. It's a simple operation of joining one tuple(of 1 row) with the other tuple.
> Here is what I am trying to do:
>
>  1.  grouping all SELECT HIT TIME DATA into a single tuple by doing a
> GROUP ALL.
>  2.  getting the min and max of that set and putting it into MIN HIT DATA.
> This is a tuple with a single row.
>  3.  then grouping SELECT MAX VISIT TIME DATA by visid,  4.  then
> generating  DUMMY_KEY  for every row, along with MAX of start time.
>  5.  then try to join the single tuple in 2 with all tuples generated
> in 4 to get a min time and a max time
>
> Code:
> Shell prompt:
> ## setting heap size to 2 GB
> PIG_OPTS="$PIG_OPTS -Dmapred.child.java.opts=-Xmx2048m"
> export PIG_OPTS
>
> Pig/Grunt
>
> RAW_DATA = LOAD
> '/omniture_test_qa/cleansed_output_1/2011/01/05/wdgesp360/wdgesp360_20
> 11-01-05
> *.tsv.gz' USING PigStorage('\t');
> FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY $6 <= 0; SELECT_CAST_DATA > FOREACH FILTER_EXCLUDES_DATA GENERATE 'DUMMYKEY' AS