Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig streaming and multiquery is buggy on local mode ?


+
Thomas Porez 2013-07-11, 13:08
+
Thomas Porez 2013-07-11, 14:18
Copy link to this message
-
Re: Pig streaming and multiquery is buggy on local mode ?
ruby script.rb is the same as cat except that it also print the line on
stdout

# script.rb
ARGF.each_line do |line|
$stderr.puts line
$stdout.puts line
end

So i can see that the script receive correctly all the lines (stderr
receive all the line) but pig read only one line from stdout.
Seems that the multiquery in local mode doesn't handle well the output of
streaming.
2013/7/11 Thomas Porez <[EMAIL PROTECTED]>

> It seems that the script is not correct, some operator have been
> inverted... So the correct version is
>
> # bug.pig
> MYINPUT = LOAD 'myinput';
>
> A = GROUP MYINPUT BY $0;
> B = FOREACH A GENERATE FLATTEN(MYINPUT);
> C = STREAM B THROUGH `ruby script.rb`;
>
> D = GROUP MYINPUT BY $0;
> E = FOREACH D GENERATE FLATTEN(MYINPUT);
> F = STREAM E THROUGH `ruby script.rb`;
>
> STORE C into 'output1';
> STORE F into 'output2';
>
> # I run the script using the following command:
> pig -x local bug.pig
>
> # And show the output
> cat output1/part*
> cat output2/part*
>
>
> 2013/7/11 Thomas Porez <[EMAIL PROTECTED]>
>
>> I realize today a strange behavior of PIG in local mode (streaming +
>> multiquery).
>> I put here a minimal script to reproduce the problem.
>>
>> Suppose an input file with multiple lines for example:
>> # myInput
>> 1
>> 2
>> 3
>> 1
>> 2
>> 3
>>
>> The pig cript is :
>> # bug.pig
>> MyInput = LOAD 'myInput;
>>
>> A = myInput GROUP BY $ 0;
>> B = FOREACH A GENERATE FLATTEN (myInput);
>> C = B STREAM THROUGH `cat`;
>>
>> D = myInput GROUP BY $ 0;
>> E = FOREACH D GENERATE FLATTEN (myInput);
>> STREAM THROUGH E F = `cat`;
>>
>> STORE C into 'output1;
>> STORE F into 'output2;
>>
>> I run the script using the following command:
>> pig -x local bug.pig
>>
>> We should find in output1 and output2 perfect copy of my input file ...
>> but this is not the case. We find only one line (the first line of the file)
>> output1/part cat *
>> output2/part cat *
>>
>> For information, it seems that the script pig hadoop corresponding work
>> properly.
>> If I comment one of the two store operation, it works as expected (i
>> think it's because on multiquery is run).
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB