Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Join question


Copy link to this message
-
Join question
Hi,
  I have a simple join question.
base = load 'input1'   USING PigStorage( ',' ) as (id1, field1, field2);
stats = load 'input2' USING PigStorage(',') as (id1, mean, median);
joined = JOIN base BY  id1, stats BY id1;
final = FOREACH joined GENERATE base::id1, base::field1,base::field2,
stats::mean,stats::median;
STORE final INTO   'output'   USING PigStorage( ',' );

But something doesnt feels right.
Inputs are of order MB's.. whereas outputs are like 100GB's...

I tried it on sample file
where base is 35MB
stats is 10MB
and output explodes to GB's??
What am i missing?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB