Paltheru, Srikanth 2011-03-14, 21:40
Thejas M Nair 2011-03-14, 23:18
Paltheru, Srikanth 2011-03-14, 23:20
-RE: Problems with Join in pig
Olga Natkovich 2011-03-15, 00:23
You guys should consider moving to the new version. This way you would get a better performing and more stable code as well better support since more people would be using the same code as you.
From: Paltheru, Srikanth [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 14, 2011 4:21 PM
To: Thejas M Nair; [EMAIL PROTECTED]
Subject: RE: Problems with Join in pig
I am using Pig 0.5 version. We don't have plans to upgrade it to a newer version. But the problem I have is the script runs for some files(both larger and smaller than the ones mentioned) but not for this particular one. I get "GC overhead limit" Error.
From: Thejas M Nair [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 14, 2011 4:18 PM
To: [EMAIL PROTECTED]; Paltheru, Srikanth
Subject: Re: Problems with Join in pig
What version of pig are you using ? There have been some memory utilization fixes in 0.8 . For this use case, you can also use the new scalar feature in
alars . That query plan will be more efficient.
You might want to build a new version of pig from svn 0.8 branch because there have been some bug fixes after the release -
svn co http://svn.apache.org/repos/asf/pig/branches/branch-0.8
On 3/14/11 1:40 PM, "Paltheru, Srikanth" <[EMAIL PROTECTED]>
> The following pig script runs fine without the 2GB memory setting (see
> in yellow). But fails with memory setting. I am not sure what's
> happening. It's a simple operation of joining one tuple(of 1 row) with the other tuple.
> Here is what I am trying to do:
> 1. grouping all SELECT HIT TIME DATA into a single tuple by doing a
> GROUP ALL.
> 2. getting the min and max of that set and putting it into MIN HIT DATA.
> This is a tuple with a single row.
> 3. then grouping SELECT MAX VISIT TIME DATA by visid, 4. then
> generating DUMMY_KEY for every row, along with MAX of start time.
> 5. then try to join the single tuple in 2 with all tuples generated
> in 4 to get a min time and a max time
> Shell prompt:
> ## setting heap size to 2 GB
> PIG_OPTS="$PIG_OPTS -Dmapred.child.java.opts=-Xmx2048m"
> export PIG_OPTS
> RAW_DATA = LOAD
> *.tsv.gz' USING PigStorage('\t');
> FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY $6 <= 0; SELECT_CAST_DATA =
> FOREACH FILTER_EXCLUDES_DATA GENERATE 'DUMMYKEY' AS
> DUMMY_KEY,(int)$0 AS hit_time_gmt, (long)$2 AS visid_high, (long)$3 AS
> visid_low, (chararray)$5 AS truncated_hit; SELECT_DATA = FILTER
> SELECT_CAST_DATA BY truncated_hit =='N'; --MIN AND MAX_HIT_TIME_GMT
> FOR THE FILE/SUITE SELECT_HIT_TIME_DATA = FOREACH SELECT_DATA GENERATE
> (int)hit_time_gmt; GROUPED_ALL_DATA = GROUP SELECT_HIT_TIME_DATA ALL
> PARALLEL 100; MIN_HIT_DATA = FOREACH GROUPED_ALL_DATA GENERATE
> DUMMY_KEY,MIN(SELECT_HIT_TIME_DATA.hit_time_gmt) AS
> MIN_HIT_TIME_GMT,MAX(SELECT_HIT_TIME_DATA.hit_time_gmt) AS
> MAX_HIT_TIME_GMT; ---MAX_VISIT_START_TIME BY VISITOR_ID
> SELECT_MAX_VISIT_TIME_DATA = FOREACH SELECT_DATA GENERATE
> GROUP_BY_VISID_MAX_VISIT_TIME_DATA = GROUP SELECT_MAX_VISIT_TIME_DATA
> (visid_high,visid_low) PARALLEL 100;
> MAX_VISIT_TIME = FOREACH GROUP_BY_VISID_MAX_VISIT_TIME_DATA GENERATE
> 'DUMMYKEY' AS DUMMY_KEY,FLATTEN(group.visid_high) AS
> visid_high,FLATTEN(group.visid_low) AS visid_low,
> MAX(SELECT_MAX_VISIT_TIME_DATA.visit_start_time_gmt) AS
> MAX_VISIT_START_TIME; JOINED_MAX_VISIT_TIME_DATA = COGROUP
> MAX_VISIT_TIME BY DUMMY_KEY OUTER,MIN_HIT_DATA BY DUMMY_KEY OUTER
> PARALLEL 100; MIN_MAX_VISIT_HIT_TIME = FOREACH
> JOINED_MAX_VISIT_TIME_DATA GENERATE
Thejas M Nair 2011-03-15, 00:29
Dmitriy Ryaboy 2011-03-15, 00:37
Dmitriy Ryaboy 2011-03-15, 00:38
Paltheru, Srikanth 2011-03-15, 00:41
Thejas M Nair 2011-03-15, 00:54