Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Load a file in a Python UDF


Copy link to this message
-
Re: Load a file in a Python UDF
Hi Russell,

This might be a bit late, but here's an example of how you can load a file
in python and pass the results back to Pig:
https://github.com/mortarcode/python-files

It's a Mortar project but the pig script (
https://github.com/mortarcode/python-files/blob/master/pigscripts/python-files.pig)
and python udf file (
https://github.com/mortarcode/python-files/blob/master/udfs/python/python-files.py)
should work fine without Mortar as long as you explicitly set the AWS key
parameters in the Pig script and have boto installed.

This example uses a small file - if you want to read a larger file you'll
need to handle boto/s3 issues with downloading large files or have Python
read directly from hdfs.  I've found s3 actually works pretty well though
for small files like this.  Reading larger files in Python doesn't work
very well because you have to worry about running out of memory when
passing everything back from Python to Java.

  Jeremy Karn / Lead Developer
MORTAR DATA / 519 277 4391 / www.mortardata.com
On Sun, Jul 20, 2014 at 5:14 PM, Russell Jurney <[EMAIL PROTECTED]>
wrote: