Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problems with Join in pig


Copy link to this message
-
RE: Problems with Join in pig
Hi Sri,

You guys should consider moving to the new version. This way you would get a better performing and more stable code as well better support since more people would be using the same code as you.

Olga

-----Original Message-----
From: Paltheru, Srikanth [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 14, 2011 4:21 PM
To: Thejas M Nair; [EMAIL PROTECTED]
Subject: RE: Problems with Join in pig

I am using Pig 0.5 version. We don't have plans to upgrade it to a newer version. But the problem I have is the script runs for some files(both larger and smaller than the ones mentioned) but not for this particular one. I get "GC overhead limit" Error.
Thanks
Sri
-----Original Message-----
From: Thejas M Nair [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 14, 2011 4:18 PM
To: [EMAIL PROTECTED]; Paltheru, Srikanth
Subject: Re: Problems with Join in pig

What version of pig are you using ? There have been some memory utilization fixes in 0.8 . For this use case, you can also use the new scalar feature in
0.8 -
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Sc
alars . That query plan will be more efficient.

You might want to build a new version of pig from svn 0.8 branch because there have been some bug fixes after the release -

svn co http://svn.apache.org/repos/asf/pig/branches/branch-0.8
cd branch-0.8
ant

-Thejas
On 3/14/11 1:40 PM, "Paltheru, Srikanth" <[EMAIL PROTECTED]>
wrote:

> The following pig script runs fine without the 2GB memory setting (see
> in yellow). But fails with memory setting. I am not sure what's
> happening. It's a simple operation of joining one tuple(of 1 row) with the other tuple.
> Here is what I am trying to do:
>
>  1.  grouping all SELECT HIT TIME DATA into a single tuple by doing a
> GROUP ALL.
>  2.  getting the min and max of that set and putting it into MIN HIT DATA.
> This is a tuple with a single row.
>  3.  then grouping SELECT MAX VISIT TIME DATA by visid,  4.  then
> generating  DUMMY_KEY  for every row, along with MAX of start time.
>  5.  then try to join the single tuple in 2 with all tuples generated
> in 4 to get a min time and a max time
>
> Code:
> Shell prompt:
> ## setting heap size to 2 GB
> PIG_OPTS="$PIG_OPTS -Dmapred.child.java.opts=-Xmx2048m"
> export PIG_OPTS
>
> Pig/Grunt
>
> RAW_DATA = LOAD
> '/omniture_test_qa/cleansed_output_1/2011/01/05/wdgesp360/wdgesp360_20
> 11-01-05
> *.tsv.gz' USING PigStorage('\t');
> FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY $6 <= 0; SELECT_CAST_DATA =
> FOREACH FILTER_EXCLUDES_DATA GENERATE 'DUMMYKEY' AS
> DUMMY_KEY,(int)$0 AS hit_time_gmt, (long)$2 AS visid_high, (long)$3 AS
> visid_low, (chararray)$5 AS truncated_hit; SELECT_DATA = FILTER
> SELECT_CAST_DATA BY truncated_hit =='N';  --MIN AND MAX_HIT_TIME_GMT
> FOR THE FILE/SUITE SELECT_HIT_TIME_DATA = FOREACH SELECT_DATA GENERATE
> (int)hit_time_gmt; GROUPED_ALL_DATA = GROUP SELECT_HIT_TIME_DATA ALL
> PARALLEL 100; MIN_HIT_DATA = FOREACH GROUPED_ALL_DATA  GENERATE
> 'DUMMYKEY'AS
> DUMMY_KEY,MIN(SELECT_HIT_TIME_DATA.hit_time_gmt) AS
> MIN_HIT_TIME_GMT,MAX(SELECT_HIT_TIME_DATA.hit_time_gmt) AS
> MAX_HIT_TIME_GMT;  ---MAX_VISIT_START_TIME BY VISITOR_ID
> SELECT_MAX_VISIT_TIME_DATA =  FOREACH SELECT_DATA GENERATE
> visid_high,visid_low,visit_start_time_gmt;
> GROUP_BY_VISID_MAX_VISIT_TIME_DATA = GROUP SELECT_MAX_VISIT_TIME_DATA  
> BY
> (visid_high,visid_low) PARALLEL 100;
> MAX_VISIT_TIME = FOREACH GROUP_BY_VISID_MAX_VISIT_TIME_DATA GENERATE
> 'DUMMYKEY' AS DUMMY_KEY,FLATTEN(group.visid_high) AS
> visid_high,FLATTEN(group.visid_low) AS visid_low,
> MAX(SELECT_MAX_VISIT_TIME_DATA.visit_start_time_gmt) AS
> MAX_VISIT_START_TIME; JOINED_MAX_VISIT_TIME_DATA = COGROUP
> MAX_VISIT_TIME BY DUMMY_KEY OUTER,MIN_HIT_DATA BY DUMMY_KEY OUTER
> PARALLEL 100; MIN_MAX_VISIT_HIT_TIME = FOREACH
> JOINED_MAX_VISIT_TIME_DATA GENERATE
> FLATTEN(MAX_VISIT_TIME.visid_high),FLATTEN(MAX_VISIT_TIME.visid_low),F
> LATTEN(M
> AX_VISIT_TIME.MAX_VISIT_START_TIME),FLATTEN(MIN_HIT_DATA.MIN_HIT_TIME_
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB