Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Cross product bug pig 0.10?


Copy link to this message
-
Re: Cross product bug pig 0.10?
Hi,

So I found a somewhat easy way to replicate this error with this script
running in a cluster (distributed). The setting at the top are artificial
to produce the result with only a few lines:

set pig.exec.reducers.bytes.per.reducer 32
set pig.exec.reducers.max 20
X = LOAD '$INPUT' USING PigStorage('$SEPARATOR');
Y = FOREACH X GENERATE COUNT_STAR(TOBAG($0 ..)) as count;
GROUPED = GROUP Y BY count;
MAX = FOREACH GROUPED GENERATE group as tokennum, COUNT(Y) as count;
MAXG = GROUP MAX ALL;
MAXX = foreach MAXG generate FLATTEN(TOP(1,1,MAX));
MAXX = foreach MAXX generate $0 as tokennum;
Z = CROSS MAXX, X;
STORE Z INTO '$OUT' USING PigStorage('$SEPARATOR');

As input I took the line:
1 1
Repeated 13 times.

I think the only think that matters is that pig decides to use more than 1
reducer. In my case this was enough for pig to use 20 reducers. This will
yield:

Input(s):
Successfully read 13 records (413 bytes) from: "/user/mehmet/input2"

Output(s):
Successfully stored 2 records (12 bytes) in: "/tmp/mehmet/out"

But it should be creating 13 lines as it just appends the MAXX to each
input line.

2 odd facts:

1. If you replace
Z = CROSS MAXX, X
 by
Z = CROSS MAXX, X parallel 20

the problem goes away. (Perhaps the CROSS function is not getting the
number of reducers value correctly when it is calculated):

Input(s):
Successfully read 13 records (413 bytes) from: "/user/mehmet/input2"

Output(s):
Successfully stored 13 records (78 bytes) in: "/tmp/mehmet/out"

2. If you skip all the steps that yield MAXX and just load MAXX from a
file, the problem goes away also, which is strange as why should it matter
where MAXX originated from?
I am using Hadoop 2.0.0-cdh4.2.0, Pig version 0.10.0-cdh4.1.2
 

Mehmet
On 5/21/13 1:41 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:

>Any chance you could replicate this for us? Ideally some dummy data and a
>script?
>
>
>2013/5/19 Mehmet Tepedelenlioglu <[EMAIL PROTECTED]>
>
>> Hi,
>>
>> Recently I was taking the cross product between 2 bags of tuples one of
>> which has only one tuple, to append the one with one element to all the
>> others (I know this is not the best way to do this, it was done as a
>> prototype). There seems to be a bug with the cross product where not all
>> the
>> tuples of the larger bag are replicated. All but one of the part files
>>are
>> empty, and everything works just fine in the local mode (probably
>>because
>> it
>> uses only one reducer). Is anybody else aware of this issue?
>>
>> The version is:
>>
>> Apache Pig version 0.10.0-cdh4.1.2 (rexported)
>> compiled Nov 01 2012, 18:38:33
>>
>> Thanks,
>>
>> Mehmet
>>
>>
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB