Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> COUNT(A.field1)


Copy link to this message
-
Re: COUNT(A.field1)
Wow...thanks for all the discussion and insight guys.

On Aug 29, 2010, at 10:01 AM, Mridul Muralidharan wrote:

>
>
> Reason why COUNT(a.field1) would have better performance is 'cos pig does not 'know' what is required from a tuple in case of COUNT(a).
> In a custom mapred job, we can optimize it away so that only the single required field is projected out : but that is obviously not possible here (COUNT is a udf) : so the entire tuple is deserialized from input.
>
> Ofcourse, the performance difference, as Dmitriy noted, would not be very high.
>
>
> Regards,
> Mridul
>
>
> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
>> Hi, this is also interesting and kinda confusing for me too (>> From the db world, the second one would have a better performance, but Pig
>> doesn't save statistics on the data, so it has to read the whole file
>> anyways, and like the count operation is mainly done on the map side, all
>> attributes will be read anyways, but the ones that are not interesting for
>> us will be dismissed and not passed to the reducer part of the job, and
>> besides wouldn't the presence of null values affect the performance? For
>> example, if a2 would have many null values, then less values would be passed
>> too right?
>>
>> Renato M.
>>
>> 2010/8/27 Mridul Muralidharan<[EMAIL PROTECTED]>
>>
>>>
>>> On second thoughts, that part is obvious - duh
>>>
>>> - Mridul
>>>
>>>
>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>>
>>>>
>>>> But it does for COUNT(A.a2) ?
>>>> That is interesting, and somehow weird :)
>>>>
>>>> Thanks !
>>>> Mridul
>>>>
>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>>
>>>>> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>> a3, and project all of them.
>>>>>
>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>   wrote:
>>>>>
>>>>>
>>>>>     I am not sure why second option is better - in both cases, you are
>>>>>     shipping only the combined counts from map to reduce.
>>>>>     On other hand, first could be better since it means we need to
>>>>>     project only 'a1' - and none of the other fields.
>>>>>
>>>>>     Or did I miss something here ?
>>>>>     I am not very familiar to what pig does in this case right now.
>>>>>
>>>>>     Regards,
>>>>>     Mridul
>>>>>
>>>>>
>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>>
>>>>>         Generally speaking, the second option will be more performant as
>>>>>         it might
>>>>>         let you drop column a3 early. In most cases the magnitude of
>>>>>         this is likely
>>>>>         to be very small as COUNT is an algebraic function, so most of
>>>>>         the work is
>>>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>>>         are shipped
>>>>>         from mappers to reducers. However, if A is very wide, or a
>>>>>         column store, or
>>>>>         has non-negligible deserialization cost that can be offset by
>>>>> only
>>>>>         deserializing a few fields -- the second option is better.
>>>>>
>>>>>         -D
>>>>>
>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<[EMAIL PROTECTED]
>>>>>         <mailto:[EMAIL PROTECTED]>>    wrote:
>>>>>
>>>>>             Wondering about performance and count...
>>>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>>>             B = GROUP A by a1;
>>>>>             -- which preferred?
>>>>>             C = FOREACH B GENERATE COUNT(A);
>>>>>             -- or would this only send a single field through the COUNT
>>>>>             and be more
>>>>>             performant?
>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB