Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> COUNT(A.field1)


Copy link to this message
-
Re: COUNT(A.field1)
Yes, Zebra has columnar storage format.
Regarding selective deserialization  (ie deserializing only columns that are
actually needed for the pig query) - As per my understanding elephant-bird
has a protocol buffer based loader which does lazy deserialization.
PigStorage also does something similar- when PigStorage is used to load
data, pigstorage returns bytearray type and there is type-casting foreach
added by pig after the load which does the type conversion on the fields
that are required in rest of the query.

-Thejas

On 9/3/10 8:05 PM, "Renato Marroquín Mogrovejo"
<[EMAIL PROTECTED]> wrote:

> Thanks Dmitriy! Hey, a couple of final questions please.
> Which are the deserializers that implement this selective deserialization?
> And the columnar storage used is Zebra?
> Thanks again for the great replies.
>
> Renato M.
>
> 2010/9/2 Dmitriy Ryaboy <[EMAIL PROTECTED]>
>
>> Pig has selective deserialization and columnar storage if the loader you
>> are using implements it. So that depends on what you are doing. Naturally,
>> if your data is not stored in a way that separates the columns, Pig can't
>> magically read them separately :).
>>
>> You should try to always use combiners.
>>
>> -D
>>
>>
>> On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
>> [EMAIL PROTECTED]> wrote:
>>
>>> So in terms of performance is the same if I count just a single column or
>>> the whole data set, right?
>>> But what Thejas said about the loader having optimizations (selective
>>> deserialization or columnar storage) is something that Pig actually has? or
>>> is it something planned for the future?
>>> And hey using a combiner shouldn't be a thing we should try to avoid? I
>>> mean for the COUNT case, a combiner is needed, but are there any other
>>> operations that are put into that combiner? like trying to reuse the
>>> computation being made?
>>> Thanks for the replies (>>>
>>> Renato M.
>>>
>>>
>>> 2010/8/29 Mridul Muralidharan <[EMAIL PROTECTED]>
>>>
>>>
>>>>
>>>> Reason why COUNT(a.field1) would have better performance is 'cos pig does
>>>> not 'know' what is required from a tuple in case of COUNT(a).
>>>> In a custom mapred job, we can optimize it away so that only the single
>>>> required field is projected out : but that is obviously not possible here
>>>> (COUNT is a udf) : so the entire tuple is deserialized from input.
>>>>
>>>> Ofcourse, the performance difference, as Dmitriy noted, would not be very
>>>> high.
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
>>>>
>>>>> Hi, this is also interesting and kinda confusing for me too (>>>>>  From the db world, the second one would have a better performance, but
>>>>> Pig
>>>>> doesn't save statistics on the data, so it has to read the whole file
>>>>> anyways, and like the count operation is mainly done on the map side,
>>>>> all
>>>>> attributes will be read anyways, but the ones that are not interesting
>>>>> for
>>>>> us will be dismissed and not passed to the reducer part of the job, and
>>>>> besides wouldn't the presence of null values affect the performance? For
>>>>> example, if a2 would have many null values, then less values would be
>>>>> passed
>>>>> too right?
>>>>>
>>>>> Renato M.
>>>>>
>>>>> 2010/8/27 Mridul Muralidharan<[EMAIL PROTECTED]>
>>>>>
>>>>>
>>>>>> On second thoughts, that part is obvious - duh
>>>>>>
>>>>>> - Mridul
>>>>>>
>>>>>>
>>>>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>>>>>
>>>>>>
>>>>>>> But it does for COUNT(A.a2) ?
>>>>>>> That is interesting, and somehow weird :)
>>>>>>>
>>>>>>> Thanks !
>>>>>>> Mridul
>>>>>>>
>>>>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>>>>>
>>>>>>>  I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>>>>> a3, and project all of them.
>>>>>>>>
>>>>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>>>>> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>   wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB