Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Beginner. Help needed in getting started


+
Mohit Singh 2012-08-30, 04:20
Copy link to this message
-
Re: Beginner. Help needed in getting started
TianYi Zhu 2012-08-30, 04:51
Hi Mohit,

assuming you are using pig 0.9+, please check this link and learn how to
write user defined functions in python:
http://archive.cloudera.com/cdh4/cdh/4/pig/udf.html#python-udfs

for your problem, you can handle it like this:

1. load data from text file
2. pass the data line by line through your UDF, your UDF should take a line
as input, and output the line with a additional
time_information ("morning", "afternoon", "evening")
3. group them by id
4. for each grouped result, filter and calculate the sum of the cost
by time_information
5. write them to file

additional reference:
http://ofps.oreilly.com/titles/9781449302641/index.html

--
Thanks,
TianYi

not a naive English speaker, correct me if i made mistakes....

On Thu, Aug 30, 2012 at 2:20 PM, Mohit Singh <[EMAIL PROTECTED]> wrote:

>  am new to hadoop and all its derivatives. And I am really getting
> intimidated by the abundance of information available.
>
> But one thing I have realized is that to start implementing/using hadoop or
> distributed codes, one has to basically change the way they think about a
> problem.
>
> I was wondering if someone can help me in the following.
>
> So, basically (like anyone else) I have a raw data.. I want to parse it and
> extract some information and then run some algorithm and save the results.
>
> Lets say I have a text file "foo.txt" where data is like:
>
>  id,$value,garbage_field,time_string\n
>   1, 200, grrrr,2012:12:2:13:00:00
>   2, 12.22,jlfa,2012:12:4:15:00:00
>   1, 2, ajf, 2012:12:22:13:56:00
>
> As you can see that the id can be repeated.This id can be like how much
> money a customer has spent!! What I want to do is save the result in a file
> which contains how much money each of the customer has spent in
> "morning","afternoon""evening""night" (You can define your some time
> buckets to define what morning and all is. For example here probably
>
>      1, 0,202,0,0
> 1 is the id, 0--> 0$ spent in morning, 202 in afternon, 0 in evening and
> night
>
> Now I have a python code for it.. But I have to implement this in pig.. to
> get started. If anyone can just write/guide me thru this.. Thats all I need
> to get started.
>
> Thanks
>