Thanks Thejas for the comments! See my answers inline.
> 1. Order-by
> The comparison against hive order-by is misleading. Hive does not do total
> ordering, unless you use a single reducer.
> But yes, in case of pig, the sampling phase is unnecessary, if you use a
> single reducer. A single reducer can make sense if the data you are sorting
> is small. I agree that it makes sense to remove the sampling phase in pig in
> such cases.
Yes the environment set up uses only 1GB data, so there is only 1
reducer for the order-by. I've also updated the doc that Hive always
uses 1 reducer for the order-by.
I'll also make sure Pig/Hive use same number of maps/reduces if
possible and update the doc.
> 2. Lazy type conversion
> Can you add a note about how many records are there in input vs output ?
> In this example, we can improve by using the logical optimizer, so only
> necessary parts are typecast before the filter.
I've purposely filtered out all the input records. From the logical
plan, the filter is not pushed above the foreach, which can be a
separate issue that need investigating. Therefore, each record is
fully deserialized and then thrown away.
> One problem in pig is that it uses java objects like Integer, String etc
> which are final types. Which means that we can't create a subclass by that
> delays the conversion until it actually gets used. The types are part of
> the udf interface. We should consider if we want to do something like this,
> when we add new udf interfaces.
> Some thoughts on serialization/deserialization improvements that i had
> written earlier - http://wiki.apache.org/pig/AvoidingSedes
Thanks for sharing these thoughts! I'll incorporate it into the doc
and discuss more details later.
> On 6/21/12 11:14 AM, Jie Li wrote:
>> Hello everyone,
>> I compiled a list of possible optimizaiton for Pig's performance.
>> As I haven't been very familiar with the codebase, I'm likely to
>> underestimate the complexity involved, so any input will be