Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Question on how GroupBy and Join works in Pig


Copy link to this message
-
Question on how GroupBy and Join works in Pig
Hi everyone,

I have a question on behavior of how Group By  and Join works in Pig :

Suppose I have Two data files:

1. cust_info

2. premium_data
cust_info:

ID   name region

2321 Austin     Pondicherry

2375 Martin     California

4286 Lisa Chennai

premium_data:

ID   premium    start_year end_year

2321 345  2009 2010

2375 845  2009 2011

4286 286  2010 2012

2321 213  2001 2004

3041 452  2010 2013

3041 423  2006 2009

===============================
Load the premium_data, group by ID and sum their total premium

grunt> premium_data = load 'premium_data';

grunt> illustrate premium_data;

------------------------------------------------------------------------------------------

| premium_data     | ID:int     | premium:float    | start_year:int    |
end_year:int    |

------------------------------------------------------------------------------------------

|                  | 4286       | 286              | 2010              |
2012            |

------------------------------------------------------------------------------------------

grunt> cust_info = load 'cust_info';

grunt> illustrate cust_info;

------------------------------------------------------------------------

| cust_info     | ID:int     | name:chararray    | region:chararray    |

------------------------------------------------------------------------

|               | 2375       | Martin            | California          |

------------------------------------------------------------------------

grunt> grouped_ID = group premium_data by ID;

When I am giving schema inside my Load statement, I am facing errors on
using group By and Joins.

But if I don't give schema, my fields are treated as ByteArrays and working
fine.

I don't think its a usual behavior. Am I doing something wrong the way I
should use Join and GroupBy ?
grunt> illustrate grouped_ID; -throws errors

grunt> illustrate grouped_ID;

2012-02-06 22:47:31,452 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: file:///

2012-02-06 22:47:31,651 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false

2012-02-06 22:47:31,680 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1

2012-02-06 22:47:31,680 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1

2012-02-06 22:47:31,698 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job

2012-02-06 22:47:31,719 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-02-06 22:47:31,850 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : 1

2012-02-06 22:47:31,851 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : 1

2012-02-06 22:47:31,867 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false

2012-02-06 22:47:31,869 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1

2012-02-06 22:47:31,869 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1

2012-02-06 22:47:31,870 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job

2012-02-06 22:47:31,870 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-02-06 22:47:31,884 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=292

2012-02-06 22:47:31,885 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Neither PARALLEL nor default parallelism is set for this job. Setting
number of reducers to 1

java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be
cast to java.lang.Integer

     at
org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:81)

     at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:117)

     at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:273)

     at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)

     at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

     at
org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:205)

     at
org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)

     at
org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238)

     at
org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:103)

     at
org.apache.pig.pen.LineageTrimmingVisitor.<init>(LineageTrimmingVisitor.java:98)

     at
org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166)

     at org.apache.pig.PigServer.getExamples(PigServer.java:1202)

     at
org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:700)

     at
org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:597)

     at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:308)

     at
org.apache.pig.tools.grunt.GruntParser.parseStopO
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB