Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Spilling issue - Optimize "GROUP BY"


Copy link to this message
-
Re: Spilling issue - Optimize "GROUP BY"
"and after trying it on several datanodes in the end it failes"
Default task attempts = 4?

1. It's better to provde logs
2. Do you use any "balancing" properties, for eaxmple
pig.exec.reducers.bytes.per.reducer ?

I suppose you have unbalanced data
2014/1/10 Zebeljan, Nebojsa <[EMAIL PROTECTED]>

> Hi,
> I'm encountering for a "simple" pig script, spilling issues. All map tasks
> and reducers succeed pretty fast except the last reducer!
> The last reducer always starts spilling after ~10mins and after trying it
> on several datanodes in the end it failes.
>
> Do you have any idea, how I could optimize the GROUP BY, so I don't run
> into spilling issues.
>
> Thanks in advance!
>
> Below the pig script:
> ###
> dataImport = LOAD <some data>;
> generatedData = FOREACH dataImport GENERATE Field_A, Field_B, Field_C;
> groupedData = GROUP generatedData BY (Field_B, Field_C);
>
> result = FOREACH groupedData {
>     counter_1 = FILTER generatedData BY <some fields>;
>     counter_2 = FILTER generatedData BY <some fields>;
>     GENERATE
>         group.Field_B,
>         group.Field_C,
>         COUNT(counter_1),
>         COUNT(counter_2);
>     }
>
> STORE result INTO <some path> USING PigStorage();
> ###
>
> Regards,
> Nebo
>