Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Please help with grouped count


+
Mark 2012-05-11, 00:23
+
Jonathan Coveney 2012-05-11, 00:32
+
Mark 2012-05-11, 16:47
Copy link to this message
-
Re: Please help with grouped count
Mark 2012-05-11, 16:48
Also, using your example, how could I limit the number of terms per country?

On 5/11/12 9:47 AM, Mark wrote:
> Thank you so much, that's pretty much what I was going for but with a
> slightly different output.
>
> Just to be clear... are these equivalent?
>
> b = foreach (group a by (country, search_term)) generate
> flatten(group) as
> (country, search_term), COUNT(a) as ct;
>
>
> b = group a by (country, search_term);
> c = foreach b generate flatten(group) as (country, search_term),
> COUNT(a) as ct;
>
> I'm guessing so... I didn't know you could combine/nest these statements.
>
>
> After experimenting with your example I'm pretty sure I understand
> everything that's going on. I can work with this format but I was
> wondering how would I massage this into something like:
>
> (country1, top term1, topterm2, topterm3, ...)
> (country2, top term1, topterm2, topterm3, ...)
> (country3, top term1, topterm2, topterm3, ...)
>
> Maybe it has to be something like this:
>
> (country1, (top term1, topterm2, topterm3, ...))
>
> So one row per country with the first value being the country and the
> following values the top terms in order? Is this even possible with Pig?
>
> Thanks for the clarification.
>
>
> On 5/10/12 5:32 PM, Jonathan Coveney wrote:
>> a = load 'log' as (country:chararray, search_term:chararray);
>> b = foreach (group a by (country, search_term)) generate
>> flatten(group) as
>> (country, search_term), COUNT(a) as ct;
>> c = order b by country asc, ct desc;
>>
>> It sort of depends what format you want the output in, though. Note:
>> if you
>> know that the number of search terms is low you could do this in
>> memory and
>> do it in one m/r job, but this version will be scalable.
>>
>> If this solution doesn't make sense, I can help explain it. It's
>> important
>> to know what format you want the output in. This would give you every
>> country (in ascending alphabetical order), and then the search term and
>> count starting with the highest.
>>
>> 2012/5/10 Mark<[EMAIL PROTECTED]>
>>
>>> We have logs in the following format
>>>
>>> us, foo
>>> us, foo
>>> fr, fizz
>>> us, bar
>>> fr, baz
>>> fr, fizz
>>> us, foo
>>> fr, fizz
>>>
>>> Where the first column is a country and the second column is a
>>> search term.
>>>
>>> How in the world can I output the country followed by the top terms in
>>> order of occurrence... ie:
>>>
>>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>>
>>> Thanks
>>>
>>>
>>>
+
Jonathan Coveney 2012-05-11, 17:49
+
Mark 2012-05-11, 18:05
+
Jonathan Coveney 2012-05-11, 21:18