|
Rohini U
2012-03-21, 19:34
Prashant Kommireddi
2012-03-21, 19:42
Dmitriy Ryaboy
2012-03-21, 23:29
Rohini U
2012-03-21, 23:45
Prashant Kommireddi
2012-03-22, 00:02
Dmitriy Ryaboy
2012-03-22, 06:02
Prashant Kommireddi
2012-03-22, 06:41
Rohini U
2012-03-22, 16:03
Rohini U
2012-03-22, 16:06
Dmitriy Ryaboy
2012-03-22, 19:25
Dmitriy Ryaboy
2012-03-22, 19:26
Prashant Kommireddi
2012-03-22, 19:44
Rohini U
2012-03-22, 20:12
Prashant Kommireddi
2012-03-22, 20:27
Jonathan Coveney
2012-03-22, 21:37
Prashant Kommireddi
2012-03-23, 02:10
Prashant Kommireddi
2012-03-23, 19:46
Rohini U
2012-03-23, 19:50
Prashant Kommireddi
2012-03-23, 20:22
|
-
Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededRohini U 2012-03-21, 19:34
Hi,
I have a pig script which does a simple GROUPing followed by couting and I get this error. My data is certaining not that big for it to cause this out of memory error. Is there a chance that this is because of some bug ? Did any one come across this kind of error before? I am using pig 0.9.1 with hadoop 0.20.205 My script: rawItems = LOAD 'in' as (item1, item2, item3, type, count); grouped = GROUP rawItems BY (item1, item2, item3, type); counts = FOREACH grouped { selectedFields = FILTER rawItems BY type="EMPLOYER"; GENERATE FLATTEN(group) as (item1, item2, item3, type) , SUM(selectedFields.count) as count } Stack Trace: 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:662) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:425) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) Thanks -Rohini
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededPrashant Kommireddi 2012-03-21, 19:42
Hi Rohini,
Can you provide some details into how big is the input dataset, data volume that reducers receive from Mappers and the number of reducers you are using? Thanks, Prashant On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> wrote: > Hi, > > I have a pig script which does a simple GROUPing followed by couting and I > get this error. My data is certaining not that big for it to cause this > out of memory error. Is there a chance that this is because of some bug ? > Did any one come across this kind of error before? > > I am using pig 0.9.1 with hadoop 0.20.205 > > My script: > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > counts = FOREACH grouped { > selectedFields = FILTER rawItems BY type="EMPLOYER"; > GENERATE > FLATTEN(group) as (item1, item2, item3, type) , > SUM(selectedFields.count) as count > > } > > Stack Trace: > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): Error > running child : java.lang.OutOfMemoryError: GC overhead limit exceeded > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) > at > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459)
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededDmitriy Ryaboy 2012-03-21, 23:29
you are not doing grouping followed by counting. You are doing grouping
followed by filtering followed by counting. Try filtering before grouping. D On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> wrote: > Hi, > > I have a pig script which does a simple GROUPing followed by couting and I > get this error. My data is certaining not that big for it to cause this > out of memory error. Is there a chance that this is because of some bug ? > Did any one come across this kind of error before? > > I am using pig 0.9.1 with hadoop 0.20.205 > > My script: > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > counts = FOREACH grouped { > selectedFields = FILTER rawItems BY type="EMPLOYER"; > GENERATE > FLATTEN(group) as (item1, item2, item3, type) , > SUM(selectedFields.count) as count > > } > > Stack Trace: > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): Error > running child : java.lang.OutOfMemoryError: GC overhead limit exceeded > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) > at > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) > at > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459)
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededRohini U 2012-03-21, 23:45
My input data size is 9GB and I am using 20 machines.
My grouped criteria has two cases so I want 1) counts by the criteria I have grouped 2) counts of the two inviduals cases in each of my group. So my script in detail is: counts = FOREACH grouped { selectedFields1 = FILTER rawItems BY type="EMPLOYER"; selectedFields2 = FILTER rawItems BY type="LOCATION"; GENERATE FLATTEN(group) as (item1, item2, item3, type) , SUM(selectedFields1.count) as selectFields1Count, SUM(selectedFields2.count) as selectFields2Count, COUNT(rawItems) as groupCriteriaCount } Is there a way way to do this? On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > you are not doing grouping followed by counting. You are doing grouping > followed by filtering followed by counting. > Try filtering before grouping. > > D > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I have a pig script which does a simple GROUPing followed by couting and > I > > get this error. My data is certaining not that big for it to cause this > > out of memory error. Is there a chance that this is because of some bug ? > > Did any one come across this kind of error before? > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > My script: > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > counts = FOREACH grouped { > > selectedFields = FILTER rawItems BY type="EMPLOYER"; > > GENERATE > > FLATTEN(group) as (item1, item2, item3, > type) , > > SUM(selectedFields.count) as count > > > > } > > > > Stack Trace: > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): > Error > > running child : java.lang.OutOfMemoryError: GC overhead limit exceeded > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293) > > at > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453) > > at > > Regards -Rohini ** People of accomplishment rarely sat back & let things happen to them. They went out & happened to things - Leonardo Da Vinci
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededPrashant Kommireddi 2012-03-22, 00:02
Please pull your FILTER out of GROUP BY and do it earlier
http://pig.apache.org/docs/r0.9.1/perf.html#filter In this case, you could use a FILTER followed by a bincond to introduce a new field "employerOrLocation", then do a group by and include the new field in the GROUP BY clause. Thanks, Prashant On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> wrote: > My input data size is 9GB and I am using 20 machines. > > My grouped criteria has two cases so I want 1) counts by the criteria I > have grouped 2) counts of the two inviduals cases in each of my group. > > So my script in detail is: > > counts = FOREACH grouped { > selectedFields1 = FILTER rawItems BY type="EMPLOYER"; > selectedFields2 = FILTER rawItems BY type="LOCATION"; > GENERATE > FLATTEN(group) as (item1, item2, item3, type) , > SUM(selectedFields1.count) as > selectFields1Count, > SUM(selectedFields2.count) as > selectFields2Count, > COUNT(rawItems) as groupCriteriaCount > > } > > Is there a way way to do this? > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > you are not doing grouping followed by counting. You are doing grouping > > followed by filtering followed by counting. > > Try filtering before grouping. > > > > D > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I have a pig script which does a simple GROUPing followed by couting > and > > I > > > get this error. My data is certaining not that big for it to cause > this > > > out of memory error. Is there a chance that this is because of some > bug ? > > > Did any one come across this kind of error before? > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > My script: > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > counts = FOREACH grouped { > > > selectedFields = FILTER rawItems BY > type="EMPLOYER"; > > > GENERATE > > > FLATTEN(group) as (item1, item2, item3, > > type) , > > > SUM(selectedFields.count) as count > > > > > > } > > > > > > Stack Trace: > > > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): > > Error > > > running child : java.lang.OutOfMemoryError: GC overhead limit exceeded > > > at > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > > at > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > > > at > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > > > at > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) > > > at > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > > > at > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) > > > at > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > > > at > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248) > > > at > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316) > > > at
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededDmitriy Ryaboy 2012-03-22, 06:02
Prashant, mind filing a jira with this example? Technically, this is
something we could do automatically. On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Please pull your FILTER out of GROUP BY and do it earlier > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > In this case, you could use a FILTER followed by a bincond to introduce a > new field "employerOrLocation", then do a group by and include the new > field in the GROUP BY clause. > > Thanks, > Prashant > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > My input data size is 9GB and I am using 20 machines. > > > > My grouped criteria has two cases so I want 1) counts by the criteria I > > have grouped 2) counts of the two inviduals cases in each of my group. > > > > So my script in detail is: > > > > counts = FOREACH grouped { > > selectedFields1 = FILTER rawItems BY > type="EMPLOYER"; > > selectedFields2 = FILTER rawItems BY type="LOCATION"; > > GENERATE > > FLATTEN(group) as (item1, item2, item3, > type) , > > SUM(selectedFields1.count) as > > selectFields1Count, > > SUM(selectedFields2.count) as > > selectFields2Count, > > COUNT(rawItems) as groupCriteriaCount > > > > } > > > > Is there a way way to do this? > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > you are not doing grouping followed by counting. You are doing grouping > > > followed by filtering followed by counting. > > > Try filtering before grouping. > > > > > > D > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I have a pig script which does a simple GROUPing followed by couting > > and > > > I > > > > get this error. My data is certaining not that big for it to cause > > this > > > > out of memory error. Is there a chance that this is because of some > > bug ? > > > > Did any one come across this kind of error before? > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > > > My script: > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > > > counts = FOREACH grouped { > > > > selectedFields = FILTER rawItems BY > > type="EMPLOYER"; > > > > GENERATE > > > > FLATTEN(group) as (item1, item2, item3, > > > type) , > > > > SUM(selectedFields.count) as count > > > > > > > > } > > > > > > > > Stack Trace: > > > > > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): > > > Error > > > > running child : java.lang.OutOfMemoryError: GC overhead limit > exceeded > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) > > > > at > > > > > > > > > > > > >
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededPrashant Kommireddi 2012-03-22, 06:41
Sure I can do that. Isn't this something that should be done already? Or
does it not work if the filter is working on a field that is part of the group? On Wed, Mar 21, 2012 at 11:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Prashant, mind filing a jira with this example? Technically, this is > something we could do automatically. > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > Please pull your FILTER out of GROUP BY and do it earlier > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > In this case, you could use a FILTER followed by a bincond to introduce a > > new field "employerOrLocation", then do a group by and include the new > > field in the GROUP BY clause. > > > > Thanks, > > Prashant > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > My input data size is 9GB and I am using 20 machines. > > > > > > My grouped criteria has two cases so I want 1) counts by the criteria I > > > have grouped 2) counts of the two inviduals cases in each of my group. > > > > > > So my script in detail is: > > > > > > counts = FOREACH grouped { > > > selectedFields1 = FILTER rawItems BY > > type="EMPLOYER"; > > > selectedFields2 = FILTER rawItems BY > type="LOCATION"; > > > GENERATE > > > FLATTEN(group) as (item1, item2, item3, > > type) , > > > SUM(selectedFields1.count) as > > > selectFields1Count, > > > SUM(selectedFields2.count) as > > > selectFields2Count, > > > COUNT(rawItems) as groupCriteriaCount > > > > > > } > > > > > > Is there a way way to do this? > > > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > wrote: > > > > > > > you are not doing grouping followed by counting. You are doing > grouping > > > > followed by filtering followed by counting. > > > > Try filtering before grouping. > > > > > > > > D > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Hi, > > > > > > > > > > I have a pig script which does a simple GROUPing followed by > couting > > > and > > > > I > > > > > get this error. My data is certaining not that big for it to cause > > > this > > > > > out of memory error. Is there a chance that this is because of some > > > bug ? > > > > > Did any one come across this kind of error before? > > > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > > > > > My script: > > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > > > > > counts = FOREACH grouped { > > > > > selectedFields = FILTER rawItems BY > > > type="EMPLOYER"; > > > > > GENERATE > > > > > FLATTEN(group) as (item1, item2, item3, > > > > type) , > > > > > SUM(selectedFields.count) as count > > > > > > > > > > } > > > > > > > > > > Stack Trace: > > > > > > > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child > (main): > > > > Error > > > > > running child : java.lang.OutOfMemoryError: GC overhead limit > > exceeded > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406)
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededRohini U 2012-03-22, 16:03
Thanks for the suggestion Prashant. However, that will not work in my case.
If I filter before the group and include the new field in group as you suggested, I get the individual counts broken by the select field critieria. However, I want the totals also without taking the select fields into account. That is why I took the approach I described in my earlier emails. Thanks Rohini On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Please pull your FILTER out of GROUP BY and do it earlier > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > In this case, you could use a FILTER followed by a bincond to introduce a > new field "employerOrLocation", then do a group by and include the new > field in the GROUP BY clause. > > Thanks, > Prashant > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > My input data size is 9GB and I am using 20 machines. > > > > My grouped criteria has two cases so I want 1) counts by the criteria I > > have grouped 2) counts of the two inviduals cases in each of my group. > > > > So my script in detail is: > > > > counts = FOREACH grouped { > > selectedFields1 = FILTER rawItems BY > type="EMPLOYER"; > > selectedFields2 = FILTER rawItems BY type="LOCATION"; > > GENERATE > > FLATTEN(group) as (item1, item2, item3, > type) , > > SUM(selectedFields1.count) as > > selectFields1Count, > > SUM(selectedFields2.count) as > > selectFields2Count, > > COUNT(rawItems) as groupCriteriaCount > > > > } > > > > Is there a way way to do this? > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > you are not doing grouping followed by counting. You are doing grouping > > > followed by filtering followed by counting. > > > Try filtering before grouping. > > > > > > D > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I have a pig script which does a simple GROUPing followed by couting > > and > > > I > > > > get this error. My data is certaining not that big for it to cause > > this > > > > out of memory error. Is there a chance that this is because of some > > bug ? > > > > Did any one come across this kind of error before? > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > > > My script: > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > > > counts = FOREACH grouped { > > > > selectedFields = FILTER rawItems BY > > type="EMPLOYER"; > > > > GENERATE > > > > FLATTEN(group) as (item1, item2, item3, > > > type) , > > > > SUM(selectedFields.count) as count > > > > > > > > } > > > > > > > > Stack Trace: > > > > > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): > > > Error > > > > running child : java.lang.OutOfMemoryError: GC overhead limit > exceeded > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) > > > > at > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570)
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededRohini U 2012-03-22, 16:06
Has a Jira been filed for this? I can send my example I am trying if that
helps. Thanks, Rohini On Wed, Mar 21, 2012 at 11:41 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Sure I can do that. Isn't this something that should be done already? Or > does it not work if the filter is working on a field that is part of the > group? > > On Wed, Mar 21, 2012 at 11:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > Prashant, mind filing a jira with this example? Technically, this is > > something we could do automatically. > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > Please pull your FILTER out of GROUP BY and do it earlier > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > > > In this case, you could use a FILTER followed by a bincond to > introduce a > > > new field "employerOrLocation", then do a group by and include the new > > > field in the GROUP BY clause. > > > > > > Thanks, > > > Prashant > > > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > > > My input data size is 9GB and I am using 20 machines. > > > > > > > > My grouped criteria has two cases so I want 1) counts by the > criteria I > > > > have grouped 2) counts of the two inviduals cases in each of my > group. > > > > > > > > So my script in detail is: > > > > > > > > counts = FOREACH grouped { > > > > selectedFields1 = FILTER rawItems BY > > > type="EMPLOYER"; > > > > selectedFields2 = FILTER rawItems BY > > type="LOCATION"; > > > > GENERATE > > > > FLATTEN(group) as (item1, item2, item3, > > > type) , > > > > SUM(selectedFields1.count) as > > > > selectFields1Count, > > > > SUM(selectedFields2.count) as > > > > selectFields2Count, > > > > COUNT(rawItems) as groupCriteriaCount > > > > > > > > } > > > > > > > > Is there a way way to do this? > > > > > > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > you are not doing grouping followed by counting. You are doing > > grouping > > > > > followed by filtering followed by counting. > > > > > Try filtering before grouping. > > > > > > > > > > D > > > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I have a pig script which does a simple GROUPing followed by > > couting > > > > and > > > > > I > > > > > > get this error. My data is certaining not that big for it to > cause > > > > this > > > > > > out of memory error. Is there a chance that this is because of > some > > > > bug ? > > > > > > Did any one come across this kind of error before? > > > > > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > > > > > > > My script: > > > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > > > > > > > counts = FOREACH grouped { > > > > > > selectedFields = FILTER rawItems BY > > > > type="EMPLOYER"; > > > > > > GENERATE > > > > > > FLATTEN(group) as (item1, item2, > item3, > > > > > type) , > > > > > > SUM(selectedFields.count) as count > > > > > > > > > > > > } > > > > > > > > > > > > Stack Trace: > > > > > > > > > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child > > (main): > > > > > Error > > > > > > running child : java.lang.OutOfMemoryError: GC overhead limit > > > exceeded > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > > > > > at > > > > > > > > > > > > >
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededDmitriy Ryaboy 2012-03-22, 19:25
So, as explained earlier, the reason you are running out of memory is that
you are loading all records into memory when you want to do non-algebraic things to results of grouping. Can you come up with ways to achieve what you need without having to have the raw records at the reducer? One way has been suggested. It's reasonably straightforward to figure out the solution to your question given advice already provided. D On Thu, Mar 22, 2012 at 9:06 AM, Rohini U <[EMAIL PROTECTED]> wrote: > Has a Jira been filed for this? I can send my example I am trying if that > helps. > > Thanks, > Rohini > > On Wed, Mar 21, 2012 at 11:41 PM, Prashant Kommireddi > <[EMAIL PROTECTED]>wrote: > > > Sure I can do that. Isn't this something that should be done already? Or > > does it not work if the filter is working on a field that is part of the > > group? > > > > On Wed, Mar 21, 2012 at 11:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > Prashant, mind filing a jira with this example? Technically, this is > > > something we could do automatically. > > > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Please pull your FILTER out of GROUP BY and do it earlier > > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > > > > > In this case, you could use a FILTER followed by a bincond to > > introduce a > > > > new field "employerOrLocation", then do a group by and include the > new > > > > field in the GROUP BY clause. > > > > > > > > Thanks, > > > > Prashant > > > > > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> > wrote: > > > > > > > > > My input data size is 9GB and I am using 20 machines. > > > > > > > > > > My grouped criteria has two cases so I want 1) counts by the > > criteria I > > > > > have grouped 2) counts of the two inviduals cases in each of my > > group. > > > > > > > > > > So my script in detail is: > > > > > > > > > > counts = FOREACH grouped { > > > > > selectedFields1 = FILTER rawItems BY > > > > type="EMPLOYER"; > > > > > selectedFields2 = FILTER rawItems BY > > > type="LOCATION"; > > > > > GENERATE > > > > > FLATTEN(group) as (item1, item2, item3, > > > > type) , > > > > > SUM(selectedFields1.count) as > > > > > selectFields1Count, > > > > > SUM(selectedFields2.count) as > > > > > selectFields2Count, > > > > > COUNT(rawItems) as groupCriteriaCount > > > > > > > > > > } > > > > > > > > > > Is there a way way to do this? > > > > > > > > > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > you are not doing grouping followed by counting. You are doing > > > grouping > > > > > > followed by filtering followed by counting. > > > > > > Try filtering before grouping. > > > > > > > > > > > > D > > > > > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I have a pig script which does a simple GROUPing followed by > > > couting > > > > > and > > > > > > I > > > > > > > get this error. My data is certaining not that big for it to > > cause > > > > > this > > > > > > > out of memory error. Is there a chance that this is because of > > some > > > > > bug ? > > > > > > > Did any one come across this kind of error before? > > > > > > > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > > > > > > > > > My script: > > > > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > > > > > > > > > counts = FOREACH grouped { > > > > > > > selectedFields = FILTER rawItems BY > > > > > type="EMPLOYER"; > > > > > > > GENERATE
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededDmitriy Ryaboy 2012-03-22, 19:26
It's done for some cases, but this one is different since the group key
needs to change. D On Wed, Mar 21, 2012 at 11:41 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Sure I can do that. Isn't this something that should be done already? Or > does it not work if the filter is working on a field that is part of the > group? > > On Wed, Mar 21, 2012 at 11:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > Prashant, mind filing a jira with this example? Technically, this is > > something we could do automatically. > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > Please pull your FILTER out of GROUP BY and do it earlier > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > > > In this case, you could use a FILTER followed by a bincond to > introduce a > > > new field "employerOrLocation", then do a group by and include the new > > > field in the GROUP BY clause. > > > > > > Thanks, > > > Prashant > > > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > > > My input data size is 9GB and I am using 20 machines. > > > > > > > > My grouped criteria has two cases so I want 1) counts by the > criteria I > > > > have grouped 2) counts of the two inviduals cases in each of my > group. > > > > > > > > So my script in detail is: > > > > > > > > counts = FOREACH grouped { > > > > selectedFields1 = FILTER rawItems BY > > > type="EMPLOYER"; > > > > selectedFields2 = FILTER rawItems BY > > type="LOCATION"; > > > > GENERATE > > > > FLATTEN(group) as (item1, item2, item3, > > > type) , > > > > SUM(selectedFields1.count) as > > > > selectFields1Count, > > > > SUM(selectedFields2.count) as > > > > selectFields2Count, > > > > COUNT(rawItems) as groupCriteriaCount > > > > > > > > } > > > > > > > > Is there a way way to do this? > > > > > > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > you are not doing grouping followed by counting. You are doing > > grouping > > > > > followed by filtering followed by counting. > > > > > Try filtering before grouping. > > > > > > > > > > D > > > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I have a pig script which does a simple GROUPing followed by > > couting > > > > and > > > > > I > > > > > > get this error. My data is certaining not that big for it to > cause > > > > this > > > > > > out of memory error. Is there a chance that this is because of > some > > > > bug ? > > > > > > Did any one come across this kind of error before? > > > > > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > > > > > > > My script: > > > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > > > > > > > counts = FOREACH grouped { > > > > > > selectedFields = FILTER rawItems BY > > > > type="EMPLOYER"; > > > > > > GENERATE > > > > > > FLATTEN(group) as (item1, item2, > item3, > > > > > type) , > > > > > > SUM(selectedFields.count) as count > > > > > > > > > > > > } > > > > > > > > > > > > Stack Trace: > > > > > > > > > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child > > (main): > > > > > Error > > > > > > running child : java.lang.OutOfMemoryError: GC overhead limit > > > exceeded > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > > > > > at > > > > > > > > > > > > > > > > >
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededPrashant Kommireddi 2012-03-22, 19:44
Hi Rohini,
>From your query it looks like you are already grouping it by TYPE, so not sure why you would want the SUM of, say "EMPLOYER" type in "LOCATION" and vice-versa. Your output is already broken down by TYPE. Thanks, Prashant On Thu, Mar 22, 2012 at 9:03 AM, Rohini U <[EMAIL PROTECTED]> wrote: > Thanks for the suggestion Prashant. However, that will not work in my case. > > If I filter before the group and include the new field in group as you > suggested, I get the individual counts broken by the select field > critieria. However, I want the totals also without taking the select fields > into account. That is why I took the approach I described in my earlier > emails. > > Thanks > Rohini > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > Please pull your FILTER out of GROUP BY and do it earlier > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > In this case, you could use a FILTER followed by a bincond to introduce a > > new field "employerOrLocation", then do a group by and include the new > > field in the GROUP BY clause. > > > > Thanks, > > Prashant > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > My input data size is 9GB and I am using 20 machines. > > > > > > My grouped criteria has two cases so I want 1) counts by the criteria I > > > have grouped 2) counts of the two inviduals cases in each of my group. > > > > > > So my script in detail is: > > > > > > counts = FOREACH grouped { > > > selectedFields1 = FILTER rawItems BY > > type="EMPLOYER"; > > > selectedFields2 = FILTER rawItems BY > type="LOCATION"; > > > GENERATE > > > FLATTEN(group) as (item1, item2, item3, > > type) , > > > SUM(selectedFields1.count) as > > > selectFields1Count, > > > SUM(selectedFields2.count) as > > > selectFields2Count, > > > COUNT(rawItems) as groupCriteriaCount > > > > > > } > > > > > > Is there a way way to do this? > > > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > wrote: > > > > > > > you are not doing grouping followed by counting. You are doing > grouping > > > > followed by filtering followed by counting. > > > > Try filtering before grouping. > > > > > > > > D > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Hi, > > > > > > > > > > I have a pig script which does a simple GROUPing followed by > couting > > > and > > > > I > > > > > get this error. My data is certaining not that big for it to cause > > > this > > > > > out of memory error. Is there a chance that this is because of some > > > bug ? > > > > > Did any one come across this kind of error before? > > > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > > > > > My script: > > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > > > > > counts = FOREACH grouped { > > > > > selectedFields = FILTER rawItems BY > > > type="EMPLOYER"; > > > > > GENERATE > > > > > FLATTEN(group) as (item1, item2, item3, > > > > type) , > > > > > SUM(selectedFields.count) as count > > > > > > > > > > } > > > > > > > > > > Stack Trace: > > > > > > > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child > (main): > > > > Error > > > > > running child : java.lang.OutOfMemoryError: GC overhead limit > > exceeded > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededRohini U 2012-03-22, 20:12
Hi Prashant,
Here is my script in full. raw = LOAD 'input' using MyCustomLoader(); searches = FOREACH raw GENERATE day, searchType, FLATTEN(impBag) AS (adType, clickCount) ; groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50; counts = FOREACH groupedSearches{ type1 = FILTER searches BY adType == 'type1'; type2 = FILTER searches BY adType == 'type2'; GENERATE FLATTEN(group) AS (day, searchType), COUNT(searches) numSearches, SUM(clickCount) AS clickCountPerSearchType, SUM(type1.clickCount) AS type1ClickCount, SUM(type2.clickCount) AS type2ClickCount; } ; As you can see above, I am counting the counts by the day and search type in clickCountPerSearchType and for each of them i need the counts broken by the ad type. Thanks for your help! Thanks, Rohini On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Hi Rohini, > > From your query it looks like you are already grouping it by TYPE, so not > sure why you would want the SUM of, say "EMPLOYER" type in "LOCATION" and > vice-versa. Your output is already broken down by TYPE. > > Thanks, > Prashant > > On Thu, Mar 22, 2012 at 9:03 AM, Rohini U <[EMAIL PROTECTED]> wrote: > > > Thanks for the suggestion Prashant. However, that will not work in my > case. > > > > If I filter before the group and include the new field in group as you > > suggested, I get the individual counts broken by the select field > > critieria. However, I want the totals also without taking the select > fields > > into account. That is why I took the approach I described in my earlier > > emails. > > > > Thanks > > Rohini > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > Please pull your FILTER out of GROUP BY and do it earlier > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > > > In this case, you could use a FILTER followed by a bincond to > introduce a > > > new field "employerOrLocation", then do a group by and include the new > > > field in the GROUP BY clause. > > > > > > Thanks, > > > Prashant > > > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > > > My input data size is 9GB and I am using 20 machines. > > > > > > > > My grouped criteria has two cases so I want 1) counts by the > criteria I > > > > have grouped 2) counts of the two inviduals cases in each of my > group. > > > > > > > > So my script in detail is: > > > > > > > > counts = FOREACH grouped { > > > > selectedFields1 = FILTER rawItems BY > > > type="EMPLOYER"; > > > > selectedFields2 = FILTER rawItems BY > > type="LOCATION"; > > > > GENERATE > > > > FLATTEN(group) as (item1, item2, item3, > > > type) , > > > > SUM(selectedFields1.count) as > > > > selectFields1Count, > > > > SUM(selectedFields2.count) as > > > > selectFields2Count, > > > > COUNT(rawItems) as groupCriteriaCount > > > > > > > > } > > > > > > > > Is there a way way to do this? > > > > > > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > you are not doing grouping followed by counting. You are doing > > grouping > > > > > followed by filtering followed by counting. > > > > > Try filtering before grouping. > > > > > > > > > > D > > > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I have a pig script which does a simple GROUPing followed by > > couting > > > > and > > > > > I > > > > > > get this error. My data is certaining not that big for it to > cause > > > > this > > > > > > out of memory error. Is there a chance that this is because of
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededPrashant Kommireddi 2012-03-22, 20:27
This makes more sense, grouping and filter are on different columns. I will
open a JIRA soon. What version of Pig and Hadoop are you using? Thanks, Prashant On Thu, Mar 22, 2012 at 1:12 PM, Rohini U <[EMAIL PROTECTED]> wrote: > Hi Prashant, > > Here is my script in full. > > > raw = LOAD 'input' using MyCustomLoader(); > > searches = FOREACH raw GENERATE > day, searchType, > FLATTEN(impBag) AS (adType, clickCount) > ; > > groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50; > counts = FOREACH groupedSearches{ > type1 = FILTER searches BY adType == 'type1'; > type2 = FILTER searches BY adType == 'type2'; > GENERATE > FLATTEN(group) AS (day, searchType), > COUNT(searches) numSearches, > SUM(clickCount) AS clickCountPerSearchType, > SUM(type1.clickCount) AS type1ClickCount, > SUM(type2.clickCount) AS type2ClickCount; > } > ; > > As you can see above, I am counting the counts by the day and search type > in clickCountPerSearchType and for each of them i need the counts broken by > the ad type. > > Thanks for your help! > Thanks, > Rohini > > > On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi > <[EMAIL PROTECTED]>wrote: > > > Hi Rohini, > > > > From your query it looks like you are already grouping it by TYPE, so not > > sure why you would want the SUM of, say "EMPLOYER" type in "LOCATION" and > > vice-versa. Your output is already broken down by TYPE. > > > > Thanks, > > Prashant > > > > On Thu, Mar 22, 2012 at 9:03 AM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > Thanks for the suggestion Prashant. However, that will not work in my > > case. > > > > > > If I filter before the group and include the new field in group as you > > > suggested, I get the individual counts broken by the select field > > > critieria. However, I want the totals also without taking the select > > fields > > > into account. That is why I took the approach I described in my earlier > > > emails. > > > > > > Thanks > > > Rohini > > > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Please pull your FILTER out of GROUP BY and do it earlier > > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > > > > > In this case, you could use a FILTER followed by a bincond to > > introduce a > > > > new field "employerOrLocation", then do a group by and include the > new > > > > field in the GROUP BY clause. > > > > > > > > Thanks, > > > > Prashant > > > > > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[EMAIL PROTECTED]> > wrote: > > > > > > > > > My input data size is 9GB and I am using 20 machines. > > > > > > > > > > My grouped criteria has two cases so I want 1) counts by the > > criteria I > > > > > have grouped 2) counts of the two inviduals cases in each of my > > group. > > > > > > > > > > So my script in detail is: > > > > > > > > > > counts = FOREACH grouped { > > > > > selectedFields1 = FILTER rawItems BY > > > > type="EMPLOYER"; > > > > > selectedFields2 = FILTER rawItems BY > > > type="LOCATION"; > > > > > GENERATE > > > > > FLATTEN(group) as (item1, item2, item3, > > > > type) , > > > > > SUM(selectedFields1.count) as > > > > > selectFields1Count, > > > > > SUM(selectedFields2.count) as > > > > > selectFields2Count, > > > > > COUNT(rawItems) as groupCriteriaCount > > > > > > > > > > } > > > > > > > > > > Is there a way way to do this? > > > > > > > > > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > you are not doing grouping followed by counting. You are doing > > > grouping > > > > > > followed by filtering followed by counting.
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededJonathan Coveney 2012-03-22, 21:37
Rohini,
In the meantime, something like the following should work: aw = LOAD 'input' using MyCustomLoader(); searches = FOREACH raw GENERATE day, searchType, FLATTEN(impBag) AS (adType, clickCount) ; searches_2 = foreach searches generate *, ( adType == 'type1' ? clickCount : 0 ) as type1_clickCount, ( adType == 'type2' ? clickCount : 0 ) as type2_clickCount; groupedSearches = GROUP searches_2 BY (day, searchType) PARALLEL 50; counts = FOREACH groupedSearches{ GENERATE FLATTEN(group) AS (day, searchType), COUNT(searches) numSearches, SUM(clickCount) AS clickCountPerSearchType, SUM(searches_2. type1_clickCount) AS type1ClickCount, SUM(searches_2. type2_clickCount) AS type2ClickCount; } ; 2012/3/22 Rohini U <[EMAIL PROTECTED]> > Thanks Prashant, > I am using Pig 0.9.1 and hadoop 0.20.205 > > Thanks, > Rohini > > On Thu, Mar 22, 2012 at 1:27 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > This makes more sense, grouping and filter are on different columns. I > will > > open a JIRA soon. > > > > What version of Pig and Hadoop are you using? > > > > Thanks, > > Prashant > > > > On Thu, Mar 22, 2012 at 1:12 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > Hi Prashant, > > > > > > Here is my script in full. > > > > > > > > > raw = LOAD 'input' using MyCustomLoader(); > > > > > > searches = FOREACH raw GENERATE > > > day, searchType, > > > FLATTEN(impBag) AS (adType, clickCount) > > > ; > > > > > > groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50; > > > counts = FOREACH groupedSearches{ > > > type1 = FILTER searches BY adType == 'type1'; > > > type2 = FILTER searches BY adType == 'type2'; > > > GENERATE > > > FLATTEN(group) AS (day, searchType), > > > COUNT(searches) numSearches, > > > SUM(clickCount) AS clickCountPerSearchType, > > > SUM(type1.clickCount) AS type1ClickCount, > > > SUM(type2.clickCount) AS type2ClickCount; > > > } > > > ; > > > > > > As you can see above, I am counting the counts by the day and search > type > > > in clickCountPerSearchType and for each of them i need the counts > broken > > by > > > the ad type. > > > > > > Thanks for your help! > > > Thanks, > > > Rohini > > > > > > > > > On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi > > > <[EMAIL PROTECTED]>wrote: > > > > > > > Hi Rohini, > > > > > > > > From your query it looks like you are already grouping it by TYPE, so > > not > > > > sure why you would want the SUM of, say "EMPLOYER" type in "LOCATION" > > and > > > > vice-versa. Your output is already broken down by TYPE. > > > > > > > > Thanks, > > > > Prashant > > > > > > > > On Thu, Mar 22, 2012 at 9:03 AM, Rohini U <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Thanks for the suggestion Prashant. However, that will not work in > my > > > > case. > > > > > > > > > > If I filter before the group and include the new field in group as > > you > > > > > suggested, I get the individual counts broken by the select field > > > > > critieria. However, I want the totals also without taking the > select > > > > fields > > > > > into account. That is why I took the approach I described in my > > earlier > > > > > emails. > > > > > > > > > > Thanks > > > > > Rohini > > > > > > > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi < > > > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > Please pull your FILTER out of GROUP BY and do it earlier > > > > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > > > > > > > > > In this case, you could use a FILTER followed by a bincond to > > > > introduce a > > > > > > new field "employerOrLocation", then do a group by and include > the > > > new > > > > > > field in the GROUP BY clause.
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededPrashant Kommireddi 2012-03-23, 02:10
Rohini,
Here is the JIRA. https://issues.apache.org/jira/browse/PIG-2610 Can you please post the stacktrace as a comment to it? Thanks, Prashant On Thu, Mar 22, 2012 at 2:37 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > Rohini, > > In the meantime, something like the following should work: > > aw = LOAD 'input' using MyCustomLoader(); > > searches = FOREACH raw GENERATE > day, searchType, > FLATTEN(impBag) AS (adType, clickCount) > ; > > searches_2 = foreach searches generate *, ( adType == 'type1' ? clickCount > : 0 ) as type1_clickCount, ( adType == 'type2' ? clickCount : 0 ) as > type2_clickCount; > > groupedSearches = GROUP searches_2 BY (day, searchType) PARALLEL 50; > counts = FOREACH groupedSearches{ > GENERATE > FLATTEN(group) AS (day, searchType), > COUNT(searches) numSearches, > SUM(clickCount) AS clickCountPerSearchType, > SUM(searches_2. type1_clickCount) AS type1ClickCount, > SUM(searches_2. type2_clickCount) AS type2ClickCount; > } > ; > > 2012/3/22 Rohini U <[EMAIL PROTECTED]> > > > Thanks Prashant, > > I am using Pig 0.9.1 and hadoop 0.20.205 > > > > Thanks, > > Rohini > > > > On Thu, Mar 22, 2012 at 1:27 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > This makes more sense, grouping and filter are on different columns. I > > will > > > open a JIRA soon. > > > > > > What version of Pig and Hadoop are you using? > > > > > > Thanks, > > > Prashant > > > > > > On Thu, Mar 22, 2012 at 1:12 PM, Rohini U <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Prashant, > > > > > > > > Here is my script in full. > > > > > > > > > > > > raw = LOAD 'input' using MyCustomLoader(); > > > > > > > > searches = FOREACH raw GENERATE > > > > day, searchType, > > > > FLATTEN(impBag) AS (adType, clickCount) > > > > ; > > > > > > > > groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50; > > > > counts = FOREACH groupedSearches{ > > > > type1 = FILTER searches BY adType == 'type1'; > > > > type2 = FILTER searches BY adType == 'type2'; > > > > GENERATE > > > > FLATTEN(group) AS (day, searchType), > > > > COUNT(searches) numSearches, > > > > SUM(clickCount) AS clickCountPerSearchType, > > > > SUM(type1.clickCount) AS type1ClickCount, > > > > SUM(type2.clickCount) AS type2ClickCount; > > > > } > > > > ; > > > > > > > > As you can see above, I am counting the counts by the day and search > > type > > > > in clickCountPerSearchType and for each of them i need the counts > > broken > > > by > > > > the ad type. > > > > > > > > Thanks for your help! > > > > Thanks, > > > > Rohini > > > > > > > > > > > > On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi > > > > <[EMAIL PROTECTED]>wrote: > > > > > > > > > Hi Rohini, > > > > > > > > > > From your query it looks like you are already grouping it by TYPE, > so > > > not > > > > > sure why you would want the SUM of, say "EMPLOYER" type in > "LOCATION" > > > and > > > > > vice-versa. Your output is already broken down by TYPE. > > > > > > > > > > Thanks, > > > > > Prashant > > > > > > > > > > On Thu, Mar 22, 2012 at 9:03 AM, Rohini U <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > Thanks for the suggestion Prashant. However, that will not work > in > > my > > > > > case. > > > > > > > > > > > > If I filter before the group and include the new field in group > as > > > you > > > > > > suggested, I get the individual counts broken by the select field > > > > > > critieria. However, I want the totals also without taking the > > select > > > > > fields > > > > > > into account. That is why I took the approach I described in my > > > earlier > > > > > > emails. > > > > > > > > > > > > Thanks > > > > > > Rohini > > > > > > > > > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi <
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededPrashant Kommireddi 2012-03-23, 19:46
Rohini, it's fine even if you could reply with the stacktrace here. I can
add it to JIRA. Thanks, Prashant On Thu, Mar 22, 2012 at 7:10 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Rohini, > > Here is the JIRA. https://issues.apache.org/jira/browse/PIG-2610 > > Can you please post the stacktrace as a comment to it? > > Thanks, > Prashant > > > On Thu, Mar 22, 2012 at 2:37 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > >> Rohini, >> >> In the meantime, something like the following should work: >> >> aw = LOAD 'input' using MyCustomLoader(); >> >> searches = FOREACH raw GENERATE >> day, searchType, >> FLATTEN(impBag) AS (adType, clickCount) >> ; >> >> searches_2 = foreach searches generate *, ( adType == 'type1' ? clickCount >> : 0 ) as type1_clickCount, ( adType == 'type2' ? clickCount : 0 ) as >> type2_clickCount; >> >> groupedSearches = GROUP searches_2 BY (day, searchType) PARALLEL 50; >> counts = FOREACH groupedSearches{ >> GENERATE >> FLATTEN(group) AS (day, searchType), >> COUNT(searches) numSearches, >> SUM(clickCount) AS clickCountPerSearchType, >> SUM(searches_2. type1_clickCount) AS type1ClickCount, >> SUM(searches_2. type2_clickCount) AS type2ClickCount; >> } >> ; >> >> 2012/3/22 Rohini U <[EMAIL PROTECTED]> >> >> > Thanks Prashant, >> > I am using Pig 0.9.1 and hadoop 0.20.205 >> > >> > Thanks, >> > Rohini >> > >> > On Thu, Mar 22, 2012 at 1:27 PM, Prashant Kommireddi < >> [EMAIL PROTECTED] >> > >wrote: >> > >> > > This makes more sense, grouping and filter are on different columns. I >> > will >> > > open a JIRA soon. >> > > >> > > What version of Pig and Hadoop are you using? >> > > >> > > Thanks, >> > > Prashant >> > > >> > > On Thu, Mar 22, 2012 at 1:12 PM, Rohini U <[EMAIL PROTECTED]> wrote: >> > > >> > > > Hi Prashant, >> > > > >> > > > Here is my script in full. >> > > > >> > > > >> > > > raw = LOAD 'input' using MyCustomLoader(); >> > > > >> > > > searches = FOREACH raw GENERATE >> > > > day, searchType, >> > > > FLATTEN(impBag) AS (adType, clickCount) >> > > > ; >> > > > >> > > > groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50; >> > > > counts = FOREACH groupedSearches{ >> > > > type1 = FILTER searches BY adType == 'type1'; >> > > > type2 = FILTER searches BY adType == 'type2'; >> > > > GENERATE >> > > > FLATTEN(group) AS (day, searchType), >> > > > COUNT(searches) numSearches, >> > > > SUM(clickCount) AS clickCountPerSearchType, >> > > > SUM(type1.clickCount) AS type1ClickCount, >> > > > SUM(type2.clickCount) AS type2ClickCount; >> > > > } >> > > > ; >> > > > >> > > > As you can see above, I am counting the counts by the day and search >> > type >> > > > in clickCountPerSearchType and for each of them i need the counts >> > broken >> > > by >> > > > the ad type. >> > > > >> > > > Thanks for your help! >> > > > Thanks, >> > > > Rohini >> > > > >> > > > >> > > > On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi >> > > > <[EMAIL PROTECTED]>wrote: >> > > > >> > > > > Hi Rohini, >> > > > > >> > > > > From your query it looks like you are already grouping it by >> TYPE, so >> > > not >> > > > > sure why you would want the SUM of, say "EMPLOYER" type in >> "LOCATION" >> > > and >> > > > > vice-versa. Your output is already broken down by TYPE. >> > > > > >> > > > > Thanks, >> > > > > Prashant >> > > > > >> > > > > On Thu, Mar 22, 2012 at 9:03 AM, Rohini U <[EMAIL PROTECTED]> >> > wrote: >> > > > > >> > > > > > Thanks for the suggestion Prashant. However, that will not work >> in >> > my >> > > > > case. >> > > > > > >> > > > > > If I filter before the group and include the new field in group >> as >> > > you >> > > > > > suggested, I get the individual counts broken by the select
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededRohini U 2012-03-23, 19:50
Prashant,
I just added the stackhere as comment to the opened jira. Thanks for the help. -Rohini On Fri, Mar 23, 2012 at 12:46 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Rohini, it's fine even if you could reply with the stacktrace here. I can > add it to JIRA. > > Thanks, > Prashant > > On Thu, Mar 22, 2012 at 7:10 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > Rohini, > > > > Here is the JIRA. https://issues.apache.org/jira/browse/PIG-2610 > > > > Can you please post the stacktrace as a comment to it? > > > > Thanks, > > Prashant > > > > > > On Thu, Mar 22, 2012 at 2:37 PM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > >> Rohini, > >> > >> In the meantime, something like the following should work: > >> > >> aw = LOAD 'input' using MyCustomLoader(); > >> > >> searches = FOREACH raw GENERATE > >> day, searchType, > >> FLATTEN(impBag) AS (adType, clickCount) > >> ; > >> > >> searches_2 = foreach searches generate *, ( adType == 'type1' ? > clickCount > >> : 0 ) as type1_clickCount, ( adType == 'type2' ? clickCount : 0 ) as > >> type2_clickCount; > >> > >> groupedSearches = GROUP searches_2 BY (day, searchType) PARALLEL 50; > >> counts = FOREACH groupedSearches{ > >> GENERATE > >> FLATTEN(group) AS (day, searchType), > >> COUNT(searches) numSearches, > >> SUM(clickCount) AS clickCountPerSearchType, > >> SUM(searches_2. type1_clickCount) AS type1ClickCount, > >> SUM(searches_2. type2_clickCount) AS type2ClickCount; > >> } > >> ; > >> > >> 2012/3/22 Rohini U <[EMAIL PROTECTED]> > >> > >> > Thanks Prashant, > >> > I am using Pig 0.9.1 and hadoop 0.20.205 > >> > > >> > Thanks, > >> > Rohini > >> > > >> > On Thu, Mar 22, 2012 at 1:27 PM, Prashant Kommireddi < > >> [EMAIL PROTECTED] > >> > >wrote: > >> > > >> > > This makes more sense, grouping and filter are on different > columns. I > >> > will > >> > > open a JIRA soon. > >> > > > >> > > What version of Pig and Hadoop are you using? > >> > > > >> > > Thanks, > >> > > Prashant > >> > > > >> > > On Thu, Mar 22, 2012 at 1:12 PM, Rohini U <[EMAIL PROTECTED]> > wrote: > >> > > > >> > > > Hi Prashant, > >> > > > > >> > > > Here is my script in full. > >> > > > > >> > > > > >> > > > raw = LOAD 'input' using MyCustomLoader(); > >> > > > > >> > > > searches = FOREACH raw GENERATE > >> > > > day, searchType, > >> > > > FLATTEN(impBag) AS (adType, clickCount) > >> > > > ; > >> > > > > >> > > > groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50; > >> > > > counts = FOREACH groupedSearches{ > >> > > > type1 = FILTER searches BY adType == 'type1'; > >> > > > type2 = FILTER searches BY adType == 'type2'; > >> > > > GENERATE > >> > > > FLATTEN(group) AS (day, searchType), > >> > > > COUNT(searches) numSearches, > >> > > > SUM(clickCount) AS clickCountPerSearchType, > >> > > > SUM(type1.clickCount) AS type1ClickCount, > >> > > > SUM(type2.clickCount) AS type2ClickCount; > >> > > > } > >> > > > ; > >> > > > > >> > > > As you can see above, I am counting the counts by the day and > search > >> > type > >> > > > in clickCountPerSearchType and for each of them i need the counts > >> > broken > >> > > by > >> > > > the ad type. > >> > > > > >> > > > Thanks for your help! > >> > > > Thanks, > >> > > > Rohini > >> > > > > >> > > > > >> > > > On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi > >> > > > <[EMAIL PROTECTED]>wrote: > >> > > > > >> > > > > Hi Rohini, > >> > > > > > >> > > > > From your query it looks like you are already grouping it by > >> TYPE, so > >> > > not > >> > > > > sure why you would want the SUM of, say "EMPLOYER" type in > >> "LOCATION" > >> > > and > >> > > > > vice-versa. Your output is already broken down by TYPE.
-
Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceededPrashant Kommireddi 2012-03-23, 20:22
Thanks.
On Fri, Mar 23, 2012 at 12:50 PM, Rohini U <[EMAIL PROTECTED]> wrote: > Prashant, > > I just added the stackhere as comment to the opened jira. > > > Thanks for the help. > > -Rohini > > On Fri, Mar 23, 2012 at 12:46 PM, Prashant Kommireddi > <[EMAIL PROTECTED]>wrote: > > > Rohini, it's fine even if you could reply with the stacktrace here. I can > > add it to JIRA. > > > > Thanks, > > Prashant > > > > On Thu, Mar 22, 2012 at 7:10 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > Rohini, > > > > > > Here is the JIRA. https://issues.apache.org/jira/browse/PIG-2610 > > > > > > Can you please post the stacktrace as a comment to it? > > > > > > Thanks, > > > Prashant > > > > > > > > > On Thu, Mar 22, 2012 at 2:37 PM, Jonathan Coveney <[EMAIL PROTECTED] > > >wrote: > > > > > >> Rohini, > > >> > > >> In the meantime, something like the following should work: > > >> > > >> aw = LOAD 'input' using MyCustomLoader(); > > >> > > >> searches = FOREACH raw GENERATE > > >> day, searchType, > > >> FLATTEN(impBag) AS (adType, clickCount) > > >> ; > > >> > > >> searches_2 = foreach searches generate *, ( adType == 'type1' ? > > clickCount > > >> : 0 ) as type1_clickCount, ( adType == 'type2' ? clickCount : 0 ) as > > >> type2_clickCount; > > >> > > >> groupedSearches = GROUP searches_2 BY (day, searchType) PARALLEL 50; > > >> counts = FOREACH groupedSearches{ > > >> GENERATE > > >> FLATTEN(group) AS (day, searchType), > > >> COUNT(searches) numSearches, > > >> SUM(clickCount) AS clickCountPerSearchType, > > >> SUM(searches_2. type1_clickCount) AS > type1ClickCount, > > >> SUM(searches_2. type2_clickCount) AS > type2ClickCount; > > >> } > > >> ; > > >> > > >> 2012/3/22 Rohini U <[EMAIL PROTECTED]> > > >> > > >> > Thanks Prashant, > > >> > I am using Pig 0.9.1 and hadoop 0.20.205 > > >> > > > >> > Thanks, > > >> > Rohini > > >> > > > >> > On Thu, Mar 22, 2012 at 1:27 PM, Prashant Kommireddi < > > >> [EMAIL PROTECTED] > > >> > >wrote: > > >> > > > >> > > This makes more sense, grouping and filter are on different > > columns. I > > >> > will > > >> > > open a JIRA soon. > > >> > > > > >> > > What version of Pig and Hadoop are you using? > > >> > > > > >> > > Thanks, > > >> > > Prashant > > >> > > > > >> > > On Thu, Mar 22, 2012 at 1:12 PM, Rohini U <[EMAIL PROTECTED]> > > wrote: > > >> > > > > >> > > > Hi Prashant, > > >> > > > > > >> > > > Here is my script in full. > > >> > > > > > >> > > > > > >> > > > raw = LOAD 'input' using MyCustomLoader(); > > >> > > > > > >> > > > searches = FOREACH raw GENERATE > > >> > > > day, searchType, > > >> > > > FLATTEN(impBag) AS (adType, clickCount) > > >> > > > ; > > >> > > > > > >> > > > groupedSearches = GROUP searches BY (day, searchType) PARALLEL > 50; > > >> > > > counts = FOREACH groupedSearches{ > > >> > > > type1 = FILTER searches BY adType == 'type1'; > > >> > > > type2 = FILTER searches BY adType == 'type2'; > > >> > > > GENERATE > > >> > > > FLATTEN(group) AS (day, searchType), > > >> > > > COUNT(searches) numSearches, > > >> > > > SUM(clickCount) AS clickCountPerSearchType, > > >> > > > SUM(type1.clickCount) AS type1ClickCount, > > >> > > > SUM(type2.clickCount) AS type2ClickCount; > > >> > > > } > > >> > > > ; > > >> > > > > > >> > > > As you can see above, I am counting the counts by the day and > > search > > >> > type > > >> > > > in clickCountPerSearchType and for each of them i need the > counts > > >> > broken > > >> > > by > > >> > > > the ad type. > > >> > > > > > >> > > > Thanks for your help! > > >> > > > Thanks, > > >> > > > Rohini > > >> > > > > > >> > > > > > >> > > > On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi |