Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Simple .py custom loader for slightly-nested input?


+
Dan Brickley 2012-07-05, 20:21
Copy link to this message
-
RE: Simple .py custom loader for slightly-nested input?
Not sure of your desired "final output" but below is the pseudo code how I solved a similar problem with pig and python.

Use PigStorage with new-line as the delimiter (or whatever you are using to denote a new line) in order to throw PIG a "fakie" and have it load the whole line as the tuple.

tv_in = load '$tv_in_path' using PigStorage('\n') as (line:chararray);

Pass each line to a python UDF

tv_in2 = foreach tv_in generate udf.explode_tv(line);

That gets the whole line into the python UDF so that you can do your custom parsing.

Since you don't know the total number of item:minute pairs you are going to have to decide what you want to return.

You could do a bag of item:minute pairs something like: R:bag{T:tuple(timestamp, userid, channeled, total_duration, itemids:bag{iT:tuple(itemid, minutes)} or you could create a tuple for each item:minute pair: R:bag{T:tuple(timestamp, userid, channeled, total_duration, itemid, minutes)}.

Hope this helps.

Will Duckworth  Senior Vice President, Software Engineering  | comScore, Inc.(NASDAQ:SCOR)
o +1 (703) 438-2108 | m +1 (301) 606-2977 | mailto:[EMAIL PROTECTED]
.....................................................................................................

Introducing Mobile Metrix 2.0 - The next generation of mobile behavioral measurement
www.comscore.com/MobileMetrix
-----Original Message-----
From: Dan Brickley [mailto:[EMAIL PROTECTED]]
Sent: Thursday, July 05, 2012 4:21 PM
To: [EMAIL PROTECTED]
Subject: Simple .py custom loader for slightly-nested input?

Cutting this over from #hadoop-pig IRC:

hi Pig people. I have some TV viewing logs in a text format - example http://pastebin.com/raw.php?i=HS4zy2pP - ... unfortunately it has some nesting/list structure, so I can't see a way to read it with an 'out of the box' Pig loader. Is the conventional practice to write a custom loader? (Python? Java? anything?). The actual parsing is quite trivial but I'm unsure how to hook into Pig infrastructure. Ideally it would be a simple linked .py file, not messing around with complex java builds etc.

I found e.g. http://arunxjacob.blogspot.com/2010/12/writing-custom-pig-loader.html
(for a Java loader). I hate to sound ungrateful but this is looking a bit heavy, compared to the simplicity of the task. Would a Python loader be simpler? (ie. just a second .py script alongside my .pig script). I was suprised that I wasn't able to find an example of someone having done this.

Here's the target format, below. Each row is a TV-viewing session, with a channel and total time, followed by a space-separate list of item:minute pairs for a sequence of consecutive viewed items on that channel making up that total.

Thanks for any pointers. I don't mind coding, I just want to find the right framework to plug into...

cheers,

Dan

2012-03-01T00:00:29Z 1360015279 mychannela 0 asdfasdf:0 2012-03-01T00:04:23Z 0728509428 mychannelb 6 bsdf92c1:6 2012-03-01T00:01:23Z 0516050342 mchannela 20 b00s123k0:19 b0dfgdfgk1:1

(fields: timestamp userid channelid total_duration ... then a sequence of {itemid}:{mins} for each item viewed in that session of viewing the channel. These will sum to the total_duration.)
+
Dan Brickley 2012-07-06, 14:13
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB