Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Fwd: Problem with using CROSS in PIG


Copy link to this message
-
Re: Fwd: Problem with using CROSS in PIG
I had the same problem. You can search the mailing list to find out more about it. But, in a nut shell, this happens only when pig calculated the number of reducers it needs. It will go away if you specify the number of reducers in the join step. Try it and tell us if that works.
________________________________
 From: Simonffy Szilvia <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Thursday, August 1, 2013 11:31 PM
Subject: Fwd: Problem with using CROSS in PIG
 

Hi,

I wrote a pig script, and I got not consequent result when running more times the same script.

pig version: pig: 0.11.1
hadoop version: 1.1.2 / 4 node

pig script:
A = LOAD '/tmp/data' AS (request_datetime: chararray, portal_name: chararray, sku: chararray, product_name: chararray, duration: int);
B = FILTER A BY portal_name == 'portal1';
C = FILTER B BY sku == '4505865';

sequence_numbers = LOAD 'sequence_numbers' USING org.apache.hcatalog.pig.HCatLoader();
sequence_number = FILTER sequence_numbers BY key == '20071224_20071230';
sequence_number = FOREACH sequence_number GENERATE
    seq AS seq;
sequence_number = LIMIT sequence_number 1;

D = CROSS C, sequence_number;
E = FOREACH D GENERATE
    request_datetime AS request_datetime,
    portal_name AS portal_name,
    sku AS sku,
    product_name AS product_name,
    duration AS duration,
    seq AS seq;

STORE E INTO '/tmp/data/output/' using PigStorage();

Execution results after five times running:
1. Successfully stored 3 records
2. Successfully stored 5 records
3. Successfully stored 2 records
4. Successfully stored 3 records
5. Successfully stored 1 records

Can anybody tell me what is wrong?

ps.: I made a workaround for skip CROSS, and use join instead of cross.
D JOIN C BY identifier, report_sequence_number BY identifier; //where identifier is a constant number:1
With this changes the result is correct every time.

data: /tmp/data/data.tsv
2013-03-14T10:07:14    portal1    4505865    Julsång (Cantique de Noël) (1997 Digital Remaster)    304
2013-03-14T22:55:49    portal1    4505865    Julsång (Cantique de Noël) (1997 Digital Remaster)    304
2013-03-19T09:11:03    portal1    4505865    Julsång (Cantique de Noël) (1997 Digital Remaster)    304
2013-03-19T09:23:49    portal1    4505865    Julsång (Cantique de Noël) (1997 Digital Remaster)    304
2013-03-19T09:23:49    portal1    4505865    Julsång (Cantique de Noël) (1997 Digital Remaster)    304
2013-03-17T13:36:15    portal1    4505865    Julsång (Cantique de Noël) (1997 Digital Remaster)    304
2013-03-01T09:07:34    portal1    310451    Heroes (Single Version)    215
2013-03-16T16:13:17    portal1    310451    Heroes (Single Version)    215
2013-03-18T23:19:17    portal1    310451    Heroes (Single Version)    215
2013-03-15T07:47:37    portal1    310451    Heroes (Single Version)    215
2013-03-19T13:48:03    portal1    310451    Heroes (Single Version)    215
2013-03-13T15:17:29    portal1    310451    Heroes (Single Version)    215
2013-03-14T14:34:40    portal1    310451    Heroes (Single Version)    215

data: /tmp/sequence_numbers/data.tsv
20071224_20071230    100
20071231_20080106    101
20080107_20080113    102
20080114_20080120    103
20080121_20080127    104
20080128_20080203    105
20080204_20080210    106
20080211_20080217    107
20080218_20080224    108
20080225_20080302    109

br,
Szilvi
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB