Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Simple word count in pig..


Copy link to this message
-
Re: Simple word count in pig..
Hai,

 Please go through the following code,

Input Data:
-----------
DocName    Tokens
--------------
cricket    sachin,sehwag,dravid,dhoni
movie    amir,salman,hruthik,ranveer
cricket    sachin,ganguly,rohit,dhoni
cricket    sehwag,sachin,dravid,kohli
movie    salman,amir,sharukh

==================================================Pig UDF
--------
package com.pig.udf;

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;

public class WordBag extends EvalFunc<String> {

    @Override
    public String exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0) {
            return null;
        }
        DataBag myBag = (DataBag) input.get(0);
        String frequency = "";
        Iterator<Tuple> itr = myBag.iterator();
        Tuple tuple = null;
        Map<String, Integer> wordcount = new HashMap<String, Integer>();
        while (itr.hasNext()) {
            tuple = itr.next();
            DataBag tokens = (DataBag) tuple.get(0);
            Iterator<Tuple> it = tokens.iterator();
            while(it.hasNext())
            {
                tuple = it.next();
                String token = (String) tuple.get(0);
                if (wordcount.containsKey(token)) {
                    int count = wordcount.get(token);
                    count++;
                    wordcount.put(token, count);
                } else {
                    wordcount.put(token, 1);
                }
            }
        }
        Set<String> keys = wordcount.keySet();
        for (String key : keys) {
            frequency = frequency + " " + key + ":" + wordcount.get(key);
        }
        return frequency;
    }
}

Build a jar for the above UDF and add it to pig script;

=======================================================================================PigScript:
----------
register /home/hadoopz/naga/bigdata/pig-0.10.0/pigscripts/wordbag.jar
news = load '/pig/news' using PigStorage() as (doc:chararray,
content:chararray);
words = foreach news generate doc, TOKENIZE(content, ',') as mywords;
describe words;
wordcount = foreach grpwords generate group,
com.pig.udf.WordBag(words.mywords);
dump wordcount;

=========================================================================================Output
------
docName    Tokens and their Frequency
----------------------------------
(movie, sharukh:1 salman:2 ranveer:1 hruthik:1 amir:2)
(cricket, sehwag:2 kohli:1 rohit:1 ganguly:1 sachin:3 dhoni:2 dravid:2)
On Wed, Nov 20, 2013 at 5:15 AM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I have data already processed in following form:
>
>
> ( id ,{ bag of words})
> So for example:
>
> (foobar, {(foo), (foo),(foobar),(bar)})
> (foo,{(bar),(bar)})
>
> and so on..
> describe processed gives me:
> processed: {id: chararray,tokens: {tuple_of_tokens: (token: chararray)}}
>
>
> Now what I want is.. also count the number of times a word appears in this
> data and output it as
> foobar, foo, 2
> foobar,foobar,1
> foobar,bar,1
> foo,bar,2
>
> and so on...
>
> How do I do this in pig?
>

--
Thanks and Regards
Nagamallikarjuna
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB