Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Join on custom LoadFunc not working correctly


Copy link to this message
-
Join on custom LoadFunc not working correctly
Pradeep Gollakota 2013-05-30, 19:12
Hey guys,

I have a custom Storage function that loads from the Accumulo database
(similar to HBase).
I have the following script that I'm trying to execute:

A = load 'accumulo://table_a'
         using org.apache.accumulo.pig.AccumuloStorage('cf:cq1 cf:cq2',
'-loadKey')
         as (id: chararray, a: chararray, b: chararray);
B = load 'accumulo://table_b'
         using org.apache.accumulo.pig.AccumuloStorage('cf:cq1 cf:cq2',
'-loadKey')
         as (id: chararray, a: chararray, b: chararray);
C = join A by a, B by b;
dump C;

When I execute this dataset A is not getting loaded.
If I do the following:
C = join B by b, A by a;
A is loaded, but B is not.

The current work around I have for this is to store A and B into temporary
storage using PigStorage() and load them again to do my join. However,
that's extra read/write phases that I'd like to avoid. In my implementation
of the AccumuloStorage() function, I set pig.noSplitCombination to true.

I'm not sure what the problem with my LoadFunc is and why it's not loading
both datasets correctly.

Any help would be appreciated.

Thanks
Pradeep