Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> COUNT(A.field1)

Copy link to this message
Re: COUNT(A.field1)
Yes, Zebra has columnar storage format.
Regarding selective deserialization  (ie deserializing only columns that are
actually needed for the pig query) - As per my understanding elephant-bird
has a protocol buffer based loader which does lazy deserialization.
PigStorage also does something similar- when PigStorage is used to load
data, pigstorage returns bytearray type and there is type-casting foreach
added by pig after the load which does the type conversion on the fields
that are required in rest of the query.


On 9/3/10 8:05 PM, "Renato Marroquín Mogrovejo"

> Thanks Dmitriy! Hey, a couple of final questions please.
> Which are the deserializers that implement this selective deserialization?
> And the columnar storage used is Zebra?
> Thanks again for the great replies.
> Renato M.
> 2010/9/2 Dmitriy Ryaboy <[EMAIL PROTECTED]>
>> Pig has selective deserialization and columnar storage if the loader you
>> are using implements it. So that depends on what you are doing. Naturally,
>> if your data is not stored in a way that separates the columns, Pig can't
>> magically read them separately :).
>> You should try to always use combiners.
>> -D
>> On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
>> [EMAIL PROTECTED]> wrote:
>>> So in terms of performance is the same if I count just a single column or
>>> the whole data set, right?
>>> But what Thejas said about the loader having optimizations (selective
>>> deserialization or columnar storage) is something that Pig actually has? or
>>> is it something planned for the future?
>>> And hey using a combiner shouldn't be a thing we should try to avoid? I
>>> mean for the COUNT case, a combiner is needed, but are there any other
>>> operations that are put into that combiner? like trying to reuse the
>>> computation being made?
>>> Thanks for the replies (>>>
>>> Renato M.
>>> 2010/8/29 Mridul Muralidharan <[EMAIL PROTECTED]>
>>>> Reason why COUNT(a.field1) would have better performance is 'cos pig does
>>>> not 'know' what is required from a tuple in case of COUNT(a).
>>>> In a custom mapred job, we can optimize it away so that only the single
>>>> required field is projected out : but that is obviously not possible here
>>>> (COUNT is a udf) : so the entire tuple is deserialized from input.
>>>> Ofcourse, the performance difference, as Dmitriy noted, would not be very
>>>> high.
>>>> Regards,
>>>> Mridul
>>>> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
>>>>> Hi, this is also interesting and kinda confusing for me too (>>>>>  From the db world, the second one would have a better performance, but
>>>>> Pig
>>>>> doesn't save statistics on the data, so it has to read the whole file
>>>>> anyways, and like the count operation is mainly done on the map side,
>>>>> all
>>>>> attributes will be read anyways, but the ones that are not interesting
>>>>> for
>>>>> us will be dismissed and not passed to the reducer part of the job, and
>>>>> besides wouldn't the presence of null values affect the performance? For
>>>>> example, if a2 would have many null values, then less values would be
>>>>> passed
>>>>> too right?
>>>>> Renato M.
>>>>> 2010/8/27 Mridul Muralidharan<[EMAIL PROTECTED]>
>>>>>> On second thoughts, that part is obvious - duh
>>>>>> - Mridul
>>>>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>>>>>> But it does for COUNT(A.a2) ?
>>>>>>> That is interesting, and somehow weird :)
>>>>>>> Thanks !
>>>>>>> Mridul
>>>>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>>>>>  I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>>>>> a3, and project all of them.
>>>>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>>>>> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>   wrote: