Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Spilling issue - Optimize "GROUP BY"


Copy link to this message
-
Re: Spilling issue - Optimize "GROUP BY"
Pradeep Gollakota 2014-01-10, 18:23
Did you mean to say "timeout" instead of "spill"? Spills don't cause task
failures (unless a spill fails). Default timeout for a task is 10 min. It
would be very helpful to have a stack trace to look at, at the very least.
On Fri, Jan 10, 2014 at 7:53 AM, Zebeljan, Nebojsa <
[EMAIL PROTECTED]> wrote:

> Hi Serega,
> Default task attempts = 4
> --> Yes, 4 task attempts
>
> Do you use any "balancing" properties, for eaxmple
> pig.exec.reducers.bytes.per.reducer
> --> No
>
> I suppose you have unbalanced data
> --> I guess so
>
> It's better to provide logs
> --> Unfortunately not possible any more "May be cleaned up by Task
> Tracker, if older logs"
>
> Regards,
> Nebo
> ________________________________________
> From: Serega Sheypak [[EMAIL PROTECTED]]
> Sent: Friday, January 10, 2014 2:32 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Spilling issue - Optimize "GROUP BY"
>
> "and after trying it on several datanodes in the end it failes"
> Default task attempts = 4?
>
> 1. It's better to provde logs
> 2. Do you use any "balancing" properties, for eaxmple
> pig.exec.reducers.bytes.per.reducer ?
>
> I suppose you have unbalanced data
>
>
> 2014/1/10 Zebeljan, Nebojsa <[EMAIL PROTECTED]>
>
> > Hi,
> > I'm encountering for a "simple" pig script, spilling issues. All map
> tasks
> > and reducers succeed pretty fast except the last reducer!
> > The last reducer always starts spilling after ~10mins and after trying it
> > on several datanodes in the end it failes.
> >
> > Do you have any idea, how I could optimize the GROUP BY, so I don't run
> > into spilling issues.
> >
> > Thanks in advance!
> >
> > Below the pig script:
> > ###
> > dataImport = LOAD <some data>;
> > generatedData = FOREACH dataImport GENERATE Field_A, Field_B, Field_C;
> > groupedData = GROUP generatedData BY (Field_B, Field_C);
> >
> > result = FOREACH groupedData {
> >     counter_1 = FILTER generatedData BY <some fields>;
> >     counter_2 = FILTER generatedData BY <some fields>;
> >     GENERATE
> >         group.Field_B,
> >         group.Field_C,
> >         COUNT(counter_1),
> >         COUNT(counter_2);
> >     }
> >
> > STORE result INTO <some path> USING PigStorage();
> > ###
> >
> > Regards,
> > Nebo
> >
>
>
>