Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Beginner. Help needed in getting started


Copy link to this message
-
Beginner. Help needed in getting started
am new to hadoop and all its derivatives. And I am really getting
intimidated by the abundance of information available.

But one thing I have realized is that to start implementing/using hadoop or
distributed codes, one has to basically change the way they think about a
problem.

I was wondering if someone can help me in the following.

So, basically (like anyone else) I have a raw data.. I want to parse it and
extract some information and then run some algorithm and save the results.

Lets say I have a text file "foo.txt" where data is like:

 id,$value,garbage_field,time_string\n
  1, 200, grrrr,2012:12:2:13:00:00
  2, 12.22,jlfa,2012:12:4:15:00:00
  1, 2, ajf, 2012:12:22:13:56:00

As you can see that the id can be repeated.This id can be like how much
money a customer has spent!! What I want to do is save the result in a file
which contains how much money each of the customer has spent in
"morning","afternoon""evening""night" (You can define your some time
buckets to define what morning and all is. For example here probably

     1, 0,202,0,0
1 is the id, 0--> 0$ spent in morning, 202 in afternon, 0 in evening and night

Now I have a python code for it.. But I have to implement this in pig.. to
get started. If anyone can just write/guide me thru this.. Thats all I need
to get started.

Thanks
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB