Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Question on how GroupBy and Join works in Pig


Copy link to this message
-
Question on how GroupBy and Join works in Pig
Hi everyone,

I have a question on behavior of how Group By  and Join works in Pig :

Suppose I have Two data files:

1. cust_info

2. premium_data
cust_info:

ID   name region

2321 Austin     Pondicherry

2375 Martin     California

4286 Lisa Chennai

premium_data:

ID   premium    start_year end_year

2321 345  2009 2010

2375 845  2009 2011

4286 286  2010 2012

2321 213  2001 2004

3041 452  2010 2013

3041 423  2006 2009

===============================
Load the premium_data, group by ID and sum their total premium

grunt> premium_data = load 'premium_data';

grunt> illustrate premium_data;

------------------------------------------------------------------------------------------

| premium_data     | ID:int     | premium:float    | start_year:int    |
end_year:int    |

------------------------------------------------------------------------------------------

|                  | 4286       | 286              | 2010              |
2012            |

------------------------------------------------------------------------------------------

grunt> cust_info = load 'cust_info';

grunt> illustrate cust_info;

------------------------------------------------------------------------

| cust_info     | ID:int     | name:chararray    | region:chararray    |

------------------------------------------------------------------------

|               | 2375       | Martin            | California          |

------------------------------------------------------------------------

grunt> grouped_ID = group premium_data by ID;

When I am giving schema inside my Load statement, I am facing errors on
using group By and Joins.

But if I don't give schema, my fields are treated as ByteArrays and working
fine.

I don't think its a usual behavior. Am I doing something wrong the way I
should use Join and GroupBy ?
grunt> illustrate grouped_ID; -throws errors

grunt> illustrate grouped_ID;

2012-02-06 22:47:31,452 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: file:///

2012-02-06 22:47:31,651 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false

2012-02-06 22:47:31,680 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1

2012-02-06 22:47:31,680 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1

2012-02-06 22:47:31,698 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job

2012-02-06 22:47:31,719 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-02-06 22:47:31,850 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : 1

2012-02-06 22:47:31,851 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : 1

2012-02-06 22:47:31,867 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false

2012-02-06 22:47:31,869 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1

2012-02-06 22:47:31,869 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1

2012-02-06 22:47:31,870 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job

2012-02-06 22:47:31,870 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-02-06 22:47:31,884 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=292

2012-02-06 22:47:31,885 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Neither PARALLEL nor default parallelism is set for this job. Setting
number of reducers to 1

java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be
cast to java.lang.Integer

     at
org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:81)

     at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:117)

     at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:273)

     at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)

     at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

     at
org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:205)

     at
org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)

     at
org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238)

     at
org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:103)

     at
org.apache.pig.pen.LineageTrimmingVisitor.<init>(LineageTrimmingVisitor.java:98)

     at
org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166)

     at org.apache.pig.PigServer.getExamples(PigServer.java:1202)

     at
org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:700)

     at
org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:597)

     at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:308)

     at
org.apache.pig.tools.grunt.GruntParser.parseStopO