Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Spilling issue - Optimize "GROUP BY"


Copy link to this message
-
Re: Spilling issue - Optimize "GROUP BY"
If it is indeed a balancing issue, you could load to counter 1 and 2, filter, group/count, and join. That way you assure that the filtering is done after the mappers, and then the combiner kicks in for the counts, and the join is done on unique keys you grouped on already. Downside is 2 MR steps instead of 1.
On Jan 10, 2014, at 10:41 AM, Zebeljan, Nebojsa <[EMAIL PROTECTED]> wrote:

> Yes, you're right. It spills over 600sec. (10 mins) and than it fails.
>
> I don't want to increase the time out and therefore I wonder if there is a way to optimize the pig script or to add some arguments to tune the performance ...
> ________________________________________
> From: Pradeep Gollakota [[EMAIL PROTECTED]]
> Sent: Friday, January 10, 2014 7:23 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Spilling issue - Optimize "GROUP BY"
>
> Did you mean to say "timeout" instead of "spill"? Spills don't cause task
> failures (unless a spill fails). Default timeout for a task is 10 min. It
> would be very helpful to have a stack trace to look at, at the very least.
>
>
> On Fri, Jan 10, 2014 at 7:53 AM, Zebeljan, Nebojsa <
> [EMAIL PROTECTED]> wrote:
>
>> Hi Serega,
>> Default task attempts = 4
>> --> Yes, 4 task attempts
>>
>> Do you use any "balancing" properties, for eaxmple
>> pig.exec.reducers.bytes.per.reducer
>> --> No
>>
>> I suppose you have unbalanced data
>> --> I guess so
>>
>> It's better to provide logs
>> --> Unfortunately not possible any more "May be cleaned up by Task
>> Tracker, if older logs"
>>
>> Regards,
>> Nebo
>> ________________________________________
>> From: Serega Sheypak [[EMAIL PROTECTED]]
>> Sent: Friday, January 10, 2014 2:32 PM
>> To: [EMAIL PROTECTED]
>> Subject: Re: Spilling issue - Optimize "GROUP BY"
>>
>> "and after trying it on several datanodes in the end it failes"
>> Default task attempts = 4?
>>
>> 1. It's better to provde logs
>> 2. Do you use any "balancing" properties, for eaxmple
>> pig.exec.reducers.bytes.per.reducer ?
>>
>> I suppose you have unbalanced data
>>
>>
>> 2014/1/10 Zebeljan, Nebojsa <[EMAIL PROTECTED]>
>>
>>> Hi,
>>> I'm encountering for a "simple" pig script, spilling issues. All map
>> tasks
>>> and reducers succeed pretty fast except the last reducer!
>>> The last reducer always starts spilling after ~10mins and after trying it
>>> on several datanodes in the end it failes.
>>>
>>> Do you have any idea, how I could optimize the GROUP BY, so I don't run
>>> into spilling issues.
>>>
>>> Thanks in advance!
>>>
>>> Below the pig script:
>>> ###
>>> dataImport = LOAD <some data>;
>>> generatedData = FOREACH dataImport GENERATE Field_A, Field_B, Field_C;
>>> groupedData = GROUP generatedData BY (Field_B, Field_C);
>>>
>>> result = FOREACH groupedData {
>>>    counter_1 = FILTER generatedData BY <some fields>;
>>>    counter_2 = FILTER generatedData BY <some fields>;
>>>    GENERATE
>>>        group.Field_B,
>>>        group.Field_C,
>>>        COUNT(counter_1),
>>>        COUNT(counter_2);
>>>    }
>>>
>>> STORE result INTO <some path> USING PigStorage();
>>> ###
>>>
>>> Regards,
>>> Nebo
>>>
>>
>>
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB