Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Spilling issue - Optimize "GROUP BY"


Copy link to this message
-
Re: Spilling issue - Optimize "GROUP BY"
"and after trying it on several datanodes in the end it failes"
Default task attempts = 4?

1. It's better to provde logs
2. Do you use any "balancing" properties, for eaxmple
pig.exec.reducers.bytes.per.reducer ?

I suppose you have unbalanced data
2014/1/10 Zebeljan, Nebojsa <[EMAIL PROTECTED]>

> Hi,
> I'm encountering for a "simple" pig script, spilling issues. All map tasks
> and reducers succeed pretty fast except the last reducer!
> The last reducer always starts spilling after ~10mins and after trying it
> on several datanodes in the end it failes.
>
> Do you have any idea, how I could optimize the GROUP BY, so I don't run
> into spilling issues.
>
> Thanks in advance!
>
> Below the pig script:
> ###
> dataImport = LOAD <some data>;
> generatedData = FOREACH dataImport GENERATE Field_A, Field_B, Field_C;
> groupedData = GROUP generatedData BY (Field_B, Field_C);
>
> result = FOREACH groupedData {
>     counter_1 = FILTER generatedData BY <some fields>;
>     counter_2 = FILTER generatedData BY <some fields>;
>     GENERATE
>         group.Field_B,
>         group.Field_C,
>         COUNT(counter_1),
>         COUNT(counter_2);
>     }
>
> STORE result INTO <some path> USING PigStorage();
> ###
>
> Regards,
> Nebo
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB