Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Please help with grouped count


Copy link to this message
-
Re: Please help with grouped count
Mark 2012-05-11, 18:05
Not entirely following you on the Order UDF.. doesn one exist or are you
proposing I create this myself?

Thanks.

On 5/11/12 10:49 AM, Jonathan Coveney wrote:
> The nesting and the non-nested are indeed the same. To limit, you could do:
>
> b = group a by (country, search_term);
> c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
> as ct;
> d = foreach (group c by country) generate ORDER_UDF(TOP(b, 2, 10)); --I
> forget the order of commands, check documentation
>
> the ORDER_UDF doesn't exist. It's important to remember that bags have no
> order, so if you want a TUPLE in term order, you need to take in a bag, and
> then get the elements, sort them, and put them in a Tuple. If you need help
> with that, less us know.
>
> 2012/5/11 Mark<[EMAIL PROTECTED]>
>
>> Also, using your example, how could I limit the number of terms per
>> country?
>>
>>
>> On 5/11/12 9:47 AM, Mark wrote:
>>
>>> Thank you so much, that's pretty much what I was going for but with a
>>> slightly different output.
>>>
>>> Just to be clear... are these equivalent?
>>>
>>> b = foreach (group a by (country, search_term)) generate flatten(group) as
>>> (country, search_term), COUNT(a) as ct;
>>>
>>>
>>> b = group a by (country, search_term);
>>> c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
>>> as ct;
>>>
>>> I'm guessing so... I didn't know you could combine/nest these statements.
>>>
>>>
>>> After experimenting with your example I'm pretty sure I understand
>>> everything that's going on. I can work with this format but I was wondering
>>> how would I massage this into something like:
>>>
>>> (country1, top term1, topterm2, topterm3, ...)
>>> (country2, top term1, topterm2, topterm3, ...)
>>> (country3, top term1, topterm2, topterm3, ...)
>>>
>>> Maybe it has to be something like this:
>>>
>>> (country1, (top term1, topterm2, topterm3, ...))
>>>
>>> So one row per country with the first value being the country and the
>>> following values the top terms in order? Is this even possible with Pig?
>>>
>>> Thanks for the clarification.
>>>
>>>
>>> On 5/10/12 5:32 PM, Jonathan Coveney wrote:
>>>
>>>> a = load 'log' as (country:chararray, search_term:chararray);
>>>> b = foreach (group a by (country, search_term)) generate flatten(group)
>>>> as
>>>> (country, search_term), COUNT(a) as ct;
>>>> c = order b by country asc, ct desc;
>>>>
>>>> It sort of depends what format you want the output in, though. Note: if
>>>> you
>>>> know that the number of search terms is low you could do this in memory
>>>> and
>>>> do it in one m/r job, but this version will be scalable.
>>>>
>>>> If this solution doesn't make sense, I can help explain it. It's
>>>> important
>>>> to know what format you want the output in. This would give you every
>>>> country (in ascending alphabetical order), and then the search term and
>>>> count starting with the highest.
>>>>
>>>> 2012/5/10 Mark<[EMAIL PROTECTED]**>
>>>>
>>>>   We have logs in the following format
>>>>> us, foo
>>>>> us, foo
>>>>> fr, fizz
>>>>> us, bar
>>>>> fr, baz
>>>>> fr, fizz
>>>>> us, foo
>>>>> fr, fizz
>>>>>
>>>>> Where the first column is a country and the second column is a search
>>>>> term.
>>>>>
>>>>> How in the world can I output the country followed by the top terms in
>>>>> order of occurrence... ie:
>>>>>
>>>>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>>>>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>