Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> COUNT(A.field1)


Copy link to this message
-
Re: COUNT(A.field1)
Wow...thanks for all the discussion and insight guys.

On Aug 29, 2010, at 10:01 AM, Mridul Muralidharan wrote:

>
>
> Reason why COUNT(a.field1) would have better performance is 'cos pig does not 'know' what is required from a tuple in case of COUNT(a).
> In a custom mapred job, we can optimize it away so that only the single required field is projected out : but that is obviously not possible here (COUNT is a udf) : so the entire tuple is deserialized from input.
>
> Ofcourse, the performance difference, as Dmitriy noted, would not be very high.
>
>
> Regards,
> Mridul
>
>
> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
>> Hi, this is also interesting and kinda confusing for me too (>> From the db world, the second one would have a better performance, but Pig
>> doesn't save statistics on the data, so it has to read the whole file
>> anyways, and like the count operation is mainly done on the map side, all
>> attributes will be read anyways, but the ones that are not interesting for
>> us will be dismissed and not passed to the reducer part of the job, and
>> besides wouldn't the presence of null values affect the performance? For
>> example, if a2 would have many null values, then less values would be passed
>> too right?
>>
>> Renato M.
>>
>> 2010/8/27 Mridul Muralidharan<[EMAIL PROTECTED]>
>>
>>>
>>> On second thoughts, that part is obvious - duh
>>>
>>> - Mridul
>>>
>>>
>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>>
>>>>
>>>> But it does for COUNT(A.a2) ?
>>>> That is interesting, and somehow weird :)
>>>>
>>>> Thanks !
>>>> Mridul
>>>>
>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>>
>>>>> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>> a3, and project all of them.
>>>>>
>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>   wrote:
>>>>>
>>>>>
>>>>>     I am not sure why second option is better - in both cases, you are
>>>>>     shipping only the combined counts from map to reduce.
>>>>>     On other hand, first could be better since it means we need to
>>>>>     project only 'a1' - and none of the other fields.
>>>>>
>>>>>     Or did I miss something here ?
>>>>>     I am not very familiar to what pig does in this case right now.
>>>>>
>>>>>     Regards,
>>>>>     Mridul
>>>>>
>>>>>
>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>>
>>>>>         Generally speaking, the second option will be more performant as
>>>>>         it might
>>>>>         let you drop column a3 early. In most cases the magnitude of
>>>>>         this is likely
>>>>>         to be very small as COUNT is an algebraic function, so most of
>>>>>         the work is
>>>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>>>         are shipped
>>>>>         from mappers to reducers. However, if A is very wide, or a
>>>>>         column store, or
>>>>>         has non-negligible deserialization cost that can be offset by
>>>>> only
>>>>>         deserializing a few fields -- the second option is better.
>>>>>
>>>>>         -D
>>>>>
>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<[EMAIL PROTECTED]
>>>>>         <mailto:[EMAIL PROTECTED]>>    wrote:
>>>>>
>>>>>             Wondering about performance and count...
>>>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>>>             B = GROUP A by a1;
>>>>>             -- which preferred?
>>>>>             C = FOREACH B GENERATE COUNT(A);
>>>>>             -- or would this only send a single field through the COUNT
>>>>>             and be more
>>>>>             performant?
>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>