Mohit Singh 2012-08-30, 04:20
assuming you are using pig 0.9+, please check this link and learn how to
write user defined functions in python:
for your problem, you can handle it like this:
1. load data from text file
2. pass the data line by line through your UDF, your UDF should take a line
as input, and output the line with a additional
time_information ("morning", "afternoon", "evening")
3. group them by id
4. for each grouped result, filter and calculate the sum of the cost
5. write them to file
not a naive English speaker, correct me if i made mistakes....
On Thu, Aug 30, 2012 at 2:20 PM, Mohit Singh <[EMAIL PROTECTED]> wrote:
> am new to hadoop and all its derivatives. And I am really getting
> intimidated by the abundance of information available.
> But one thing I have realized is that to start implementing/using hadoop or
> distributed codes, one has to basically change the way they think about a
> I was wondering if someone can help me in the following.
> So, basically (like anyone else) I have a raw data.. I want to parse it and
> extract some information and then run some algorithm and save the results.
> Lets say I have a text file "foo.txt" where data is like:
> 1, 200, grrrr,2012:12:2:13:00:00
> 2, 12.22,jlfa,2012:12:4:15:00:00
> 1, 2, ajf, 2012:12:22:13:56:00
> As you can see that the id can be repeated.This id can be like how much
> money a customer has spent!! What I want to do is save the result in a file
> which contains how much money each of the customer has spent in
> "morning","afternoon""evening""night" (You can define your some time
> buckets to define what morning and all is. For example here probably
> 1, 0,202,0,0
> 1 is the id, 0--> 0$ spent in morning, 202 in afternon, 0 in evening and
> Now I have a python code for it.. But I have to implement this in pig.. to
> get started. If anyone can just write/guide me thru this.. Thats all I need
> to get started.