Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Simple word count in pig..


+
jamal sasha 2013-11-19, 23:45
+
inelu nagamallikarjuna 2013-11-20, 10:24
Copy link to this message
-
Re: Simple word count in pig..
Jamal,

You're going to want to use a FLATTEN and another group by. Consider:

flattened   = foreach processed generate id, flatten(tokens) as token;
frequency = foreach (group flattened by (id, token)) generate
                        flatten(group)         as (id, token),
                        COUNT(flattened) as freq;

Of course, this will spawn another map-reduce job. However, since COUNT is algebraic, pig will make use of combiners drastically reducing the amount of data sent to the reducers.

--jacob
@thedatachef

On Nov 19, 2013, at 5:45 PM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I have data already processed in following form:
>
>
> ( id ,{ bag of words})
> So for example:
>
> (foobar, {(foo), (foo),(foobar),(bar)})
> (foo,{(bar),(bar)})
>
> and so on..
> describe processed gives me:
> processed: {id: chararray,tokens: {tuple_of_tokens: (token: chararray)}}
>
>
> Now what I want is.. also count the number of times a word appears in this
> data and output it as
> foobar, foo, 2
> foobar,foobar,1
> foobar,bar,1
> foo,bar,2
>
> and so on...
>
> How do I do this in pig?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB