Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Some proposals for Pig performance optimization


Copy link to this message
-
Re: Some proposals for Pig performance optimization
Jie Li 2012-06-21, 21:22
Thanks Thejas for the comments! See my answers inline.

> 1. Order-by
> The comparison against hive order-by is misleading. Hive does not do total
> ordering, unless you use a single reducer.
> But yes, in case of pig, the sampling phase is unnecessary, if you use a
> single reducer. A single reducer can make sense if the data you are sorting
> is small. I agree that it makes sense to remove the sampling phase in pig in
> such cases.

Yes the environment set up uses only 1GB data, so there is only 1
reducer for the order-by.  I've also updated the doc that Hive always
uses 1 reducer for the order-by.

I'll also make sure Pig/Hive use same number of maps/reduces if
possible and update the doc.

> 2. Lazy type conversion
> Can you add a note about how many records are there in input vs output ?
> In this example, we can improve by using the logical optimizer, so only
> necessary parts are typecast before the filter.
>

I've purposely filtered out all the input records. From the logical
plan, the filter is not pushed above the foreach, which can be a
separate issue that need investigating. Therefore, each record is
fully deserialized and then thrown away.

> One problem in pig is that it uses java objects like Integer, String etc
> which are final types. Which means that we can't create a subclass by that
> delays the conversion until it actually gets used.  The types are part of
> the udf interface. We should consider if we want to do something like this,
> when we add new udf interfaces.
>
> Some thoughts on serialization/deserialization improvements that i had
> written earlier - http://wiki.apache.org/pig/AvoidingSedes
>

Thanks for sharing these thoughts! I'll incorporate it into the doc
and discuss more details later.

Jie

> Thanks,
> Thejas
>
>
>
>
>
>
>
> On 6/21/12 11:14 AM, Jie Li wrote:
>>
>> Hello everyone,
>>
>> I compiled a list of possible optimizaiton for Pig's performance.
>>
>>
>> https://cwiki.apache.org/confluence/display/PIG/Pig+Performance+Optimization
>>
>> As I haven't been very familiar with the codebase, I'm likely to
>> underestimate the complexity involved, so any input will be
>> appreciated.
>>
>> Thanks,
>> Jie
>
>