Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Please help with grouped count


+
Mark 2012-05-11, 00:23
+
Jonathan Coveney 2012-05-11, 00:32
+
Mark 2012-05-11, 16:47
Copy link to this message
-
Re: Please help with grouped count
Also, using your example, how could I limit the number of terms per country?

On 5/11/12 9:47 AM, Mark wrote:
> Thank you so much, that's pretty much what I was going for but with a
> slightly different output.
>
> Just to be clear... are these equivalent?
>
> b = foreach (group a by (country, search_term)) generate
> flatten(group) as
> (country, search_term), COUNT(a) as ct;
>
>
> b = group a by (country, search_term);
> c = foreach b generate flatten(group) as (country, search_term),
> COUNT(a) as ct;
>
> I'm guessing so... I didn't know you could combine/nest these statements.
>
>
> After experimenting with your example I'm pretty sure I understand
> everything that's going on. I can work with this format but I was
> wondering how would I massage this into something like:
>
> (country1, top term1, topterm2, topterm3, ...)
> (country2, top term1, topterm2, topterm3, ...)
> (country3, top term1, topterm2, topterm3, ...)
>
> Maybe it has to be something like this:
>
> (country1, (top term1, topterm2, topterm3, ...))
>
> So one row per country with the first value being the country and the
> following values the top terms in order? Is this even possible with Pig?
>
> Thanks for the clarification.
>
>
> On 5/10/12 5:32 PM, Jonathan Coveney wrote:
>> a = load 'log' as (country:chararray, search_term:chararray);
>> b = foreach (group a by (country, search_term)) generate
>> flatten(group) as
>> (country, search_term), COUNT(a) as ct;
>> c = order b by country asc, ct desc;
>>
>> It sort of depends what format you want the output in, though. Note:
>> if you
>> know that the number of search terms is low you could do this in
>> memory and
>> do it in one m/r job, but this version will be scalable.
>>
>> If this solution doesn't make sense, I can help explain it. It's
>> important
>> to know what format you want the output in. This would give you every
>> country (in ascending alphabetical order), and then the search term and
>> count starting with the highest.
>>
>> 2012/5/10 Mark<[EMAIL PROTECTED]>
>>
>>> We have logs in the following format
>>>
>>> us, foo
>>> us, foo
>>> fr, fizz
>>> us, bar
>>> fr, baz
>>> fr, fizz
>>> us, foo
>>> fr, fizz
>>>
>>> Where the first column is a country and the second column is a
>>> search term.
>>>
>>> How in the world can I output the country followed by the top terms in
>>> order of occurrence... ie:
>>>
>>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>>
>>> Thanks
>>>
>>>
>>>
+
Jonathan Coveney 2012-05-11, 17:49
+
Mark 2012-05-11, 18:05
+
Jonathan Coveney 2012-05-11, 21:18
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB