Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Simple .py custom loader for slightly-nested input?

Dan Brickley 2012-07-05, 20:21
Copy link to this message
RE: Simple .py custom loader for slightly-nested input?
Not sure of your desired "final output" but below is the pseudo code how I solved a similar problem with pig and python.

Use PigStorage with new-line as the delimiter (or whatever you are using to denote a new line) in order to throw PIG a "fakie" and have it load the whole line as the tuple.

tv_in = load '$tv_in_path' using PigStorage('\n') as (line:chararray);

Pass each line to a python UDF

tv_in2 = foreach tv_in generate udf.explode_tv(line);

That gets the whole line into the python UDF so that you can do your custom parsing.

Since you don't know the total number of item:minute pairs you are going to have to decide what you want to return.

You could do a bag of item:minute pairs something like: R:bag{T:tuple(timestamp, userid, channeled, total_duration, itemids:bag{iT:tuple(itemid, minutes)} or you could create a tuple for each item:minute pair: R:bag{T:tuple(timestamp, userid, channeled, total_duration, itemid, minutes)}.

Hope this helps.

Will Duckworth  Senior Vice President, Software Engineering  | comScore, Inc.(NASDAQ:SCOR)
o +1 (703) 438-2108 | m +1 (301) 606-2977 | mailto:[EMAIL PROTECTED]

Introducing Mobile Metrix 2.0 - The next generation of mobile behavioral measurement
-----Original Message-----
From: Dan Brickley [mailto:[EMAIL PROTECTED]]
Sent: Thursday, July 05, 2012 4:21 PM
Subject: Simple .py custom loader for slightly-nested input?

Cutting this over from #hadoop-pig IRC:

hi Pig people. I have some TV viewing logs in a text format - example http://pastebin.com/raw.php?i=HS4zy2pP - ... unfortunately it has some nesting/list structure, so I can't see a way to read it with an 'out of the box' Pig loader. Is the conventional practice to write a custom loader? (Python? Java? anything?). The actual parsing is quite trivial but I'm unsure how to hook into Pig infrastructure. Ideally it would be a simple linked .py file, not messing around with complex java builds etc.

I found e.g. http://arunxjacob.blogspot.com/2010/12/writing-custom-pig-loader.html
(for a Java loader). I hate to sound ungrateful but this is looking a bit heavy, compared to the simplicity of the task. Would a Python loader be simpler? (ie. just a second .py script alongside my .pig script). I was suprised that I wasn't able to find an example of someone having done this.

Here's the target format, below. Each row is a TV-viewing session, with a channel and total time, followed by a space-separate list of item:minute pairs for a sequence of consecutive viewed items on that channel making up that total.

Thanks for any pointers. I don't mind coding, I just want to find the right framework to plug into...



2012-03-01T00:00:29Z 1360015279 mychannela 0 asdfasdf:0 2012-03-01T00:04:23Z 0728509428 mychannelb 6 bsdf92c1:6 2012-03-01T00:01:23Z 0516050342 mchannela 20 b00s123k0:19 b0dfgdfgk1:1

(fields: timestamp userid channelid total_duration ... then a sequence of {itemid}:{mins} for each item viewed in that session of viewing the channel. These will sum to the total_duration.)
Dan Brickley 2012-07-06, 14:13