Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Commands not working properly when stored in pig file


Copy link to this message
-
Re: Commands not working properly when stored in pig file

Hi, Mix:
" second map reduce started executing before first one got completed"
Interesting. Since you just do LOAD for evnt_dtl, without DUMP or STORE it,
Pig shouldn't do anything, especially before STORE command complete.

I have below script and it works fine. So think root cause is something
else. Unless your data is very big?
a = load 'words_and_numbers' as (f1:chararray, f2:chararray);
b = filter a by f1 is not null;
store (foreach (group b all) generate flatten($1)) into 'multipleload/tmp';
c = load 'multipleload/tmp/part-r-00000' as (f3:chararray, f4:chararray);
dump c;

Johnny

It's the multi-query execution optimization. Pig doesn't know it should wait for the STORE before the second LOAD, so it tries to run it in parallel. You have three options:

1. Name the relation you stored and use it instead of loading a new relation:

Data = LOAD '/....' as (,,,, )
NoNullData= FILTER Data by qe is not null;
exp = foreach (group NoNullData all) generate flatten($1);
STORE exp  into 'exp/$inputDatePig';

evnt_dtl = FOREACH exp GENERATE $0 as cust ...

2. Use the EXEC keyword to tell Pig to finish the commands up to that point before running the rest:

Data = LOAD '/....' as (,,,, )
NoNullData= FILTER Data by qe is not null;
STORE (foreach (group NoNullData all) generate flatten($1))  into
'exp/$inputDatePig';
EXEC;
evnt_dtl =LOAD 'exp/$inputDatePig/part-r-00000' AS (cust,,,,,)

3. Disable multi-query execution:
$ pig -no_multiquery x.pig
- Marcos