Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Beginner. Help needed in getting started


+
Mohit Singh 2012-08-30, 04:20
Copy link to this message
-
Re: Beginner. Help needed in getting started
Hi Mohit,

assuming you are using pig 0.9+, please check this link and learn how to
write user defined functions in python:
http://archive.cloudera.com/cdh4/cdh/4/pig/udf.html#python-udfs

for your problem, you can handle it like this:

1. load data from text file
2. pass the data line by line through your UDF, your UDF should take a line
as input, and output the line with a additional
time_information ("morning", "afternoon", "evening")
3. group them by id
4. for each grouped result, filter and calculate the sum of the cost
by time_information
5. write them to file

additional reference:
http://ofps.oreilly.com/titles/9781449302641/index.html

--
Thanks,
TianYi

not a naive English speaker, correct me if i made mistakes....

On Thu, Aug 30, 2012 at 2:20 PM, Mohit Singh <[EMAIL PROTECTED]> wrote:

>  am new to hadoop and all its derivatives. And I am really getting
> intimidated by the abundance of information available.
>
> But one thing I have realized is that to start implementing/using hadoop or
> distributed codes, one has to basically change the way they think about a
> problem.
>
> I was wondering if someone can help me in the following.
>
> So, basically (like anyone else) I have a raw data.. I want to parse it and
> extract some information and then run some algorithm and save the results.
>
> Lets say I have a text file "foo.txt" where data is like:
>
>  id,$value,garbage_field,time_string\n
>   1, 200, grrrr,2012:12:2:13:00:00
>   2, 12.22,jlfa,2012:12:4:15:00:00
>   1, 2, ajf, 2012:12:22:13:56:00
>
> As you can see that the id can be repeated.This id can be like how much
> money a customer has spent!! What I want to do is save the result in a file
> which contains how much money each of the customer has spent in
> "morning","afternoon""evening""night" (You can define your some time
> buckets to define what morning and all is. For example here probably
>
>      1, 0,202,0,0
> 1 is the id, 0--> 0$ spent in morning, 202 in afternon, 0 in evening and
> night
>
> Now I have a python code for it.. But I have to implement this in pig.. to
> get started. If anyone can just write/guide me thru this.. Thats all I need
> to get started.
>
> Thanks
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB