Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Cross product bug pig 0.10?


Copy link to this message
-
Re: Cross product bug pig 0.10?
Mehmet Tepedelenlioglu 2013-05-22, 00:34
Hi,

So I found a somewhat easy way to replicate this error with this script
running in a cluster (distributed). The setting at the top are artificial
to produce the result with only a few lines:

set pig.exec.reducers.bytes.per.reducer 32
set pig.exec.reducers.max 20
X = LOAD '$INPUT' USING PigStorage('$SEPARATOR');
Y = FOREACH X GENERATE COUNT_STAR(TOBAG($0 ..)) as count;
GROUPED = GROUP Y BY count;
MAX = FOREACH GROUPED GENERATE group as tokennum, COUNT(Y) as count;
MAXG = GROUP MAX ALL;
MAXX = foreach MAXG generate FLATTEN(TOP(1,1,MAX));
MAXX = foreach MAXX generate $0 as tokennum;
Z = CROSS MAXX, X;
STORE Z INTO '$OUT' USING PigStorage('$SEPARATOR');

As input I took the line:
1 1
Repeated 13 times.

I think the only think that matters is that pig decides to use more than 1
reducer. In my case this was enough for pig to use 20 reducers. This will
yield:

Input(s):
Successfully read 13 records (413 bytes) from: "/user/mehmet/input2"

Output(s):
Successfully stored 2 records (12 bytes) in: "/tmp/mehmet/out"

But it should be creating 13 lines as it just appends the MAXX to each
input line.

2 odd facts:

1. If you replace
Z = CROSS MAXX, X
 by
Z = CROSS MAXX, X parallel 20

the problem goes away. (Perhaps the CROSS function is not getting the
number of reducers value correctly when it is calculated):

Input(s):
Successfully read 13 records (413 bytes) from: "/user/mehmet/input2"

Output(s):
Successfully stored 13 records (78 bytes) in: "/tmp/mehmet/out"

2. If you skip all the steps that yield MAXX and just load MAXX from a
file, the problem goes away also, which is strange as why should it matter
where MAXX originated from?
I am using Hadoop 2.0.0-cdh4.2.0, Pig version 0.10.0-cdh4.1.2
 

Mehmet
On 5/21/13 1:41 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:

>Any chance you could replicate this for us? Ideally some dummy data and a
>script?
>
>
>2013/5/19 Mehmet Tepedelenlioglu <[EMAIL PROTECTED]>
>
>> Hi,
>>
>> Recently I was taking the cross product between 2 bags of tuples one of
>> which has only one tuple, to append the one with one element to all the
>> others (I know this is not the best way to do this, it was done as a
>> prototype). There seems to be a bug with the cross product where not all
>> the
>> tuples of the larger bag are replicated. All but one of the part files
>>are
>> empty, and everything works just fine in the local mode (probably
>>because
>> it
>> uses only one reducer). Is anybody else aware of this issue?
>>
>> The version is:
>>
>> Apache Pig version 0.10.0-cdh4.1.2 (rexported)
>> compiled Nov 01 2012, 18:38:33
>>
>> Thanks,
>>
>> Mehmet
>>
>>
>>