Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Advanced HDFS operations from Python embedded scripts


Copy link to this message
-
Re: Advanced HDFS operations from Python embedded scripts
Clément MATHIEU 2013-01-18, 09:12
On 2013-01-17 23:11, Jakub Glapa wrote:

Hi Jakub,

> my pig script is going to produce a set of files that will be an
> input for
> a different process. The script would be running periodically so the
> number
> of files would be growing.
> I would like to implement an expiry mechanism were I could remove
> files
> that are older than x or the number of files has reached some
> threshold.
>
> I know a crazy way were in bash script you can call "hadoop fs -ls
> ...",
> parse the output and then execute "rmr" on matching entries.
>
> Is there a "human" way to do this from under python script? Pig.fs()

I had the same issue than you few months ago. The public Pig scripting
API only exposes a FsShell object which is way too limited to do any
real work. However it is possible to get access to the Hadoop FileSystem
API from a Python script:
def get_fs():
     """Return a org.apache.hadoop.fs.FileSystem instance."""
     # Pig scripting API exports a FsShell but not a FileSystem object.
     ctx   = ScriptPigContext.get()
     props = ctx.getPigContext().getProperties()
     conf  = ConfigurationUtil.toConfiguration(props)
     fs    = FileSystem.get(conf)
     return fs
Once you have a FileSystem object you can do whatever you want using
the standard Hadoop API.
Hope this helps.

-- Clément