-Re: Advanced HDFS operations from Python embedded scripts
Clément MATHIEU 2013-01-18, 09:12
On 2013-01-17 23:11, Jakub Glapa wrote:
> my pig script is going to produce a set of files that will be an
> input for
> a different process. The script would be running periodically so the
> of files would be growing.
> I would like to implement an expiry mechanism were I could remove
> that are older than x or the number of files has reached some
> I know a crazy way were in bash script you can call "hadoop fs -ls
> parse the output and then execute "rmr" on matching entries.
> Is there a "human" way to do this from under python script? Pig.fs()
I had the same issue than you few months ago. The public Pig scripting
API only exposes a FsShell object which is way too limited to do any
real work. However it is possible to get access to the Hadoop FileSystem
API from a Python script:
"""Return a org.apache.hadoop.fs.FileSystem instance."""
# Pig scripting API exports a FsShell but not a FileSystem object.
ctx = ScriptPigContext.get()
props = ctx.getPigContext().getProperties()
conf = ConfigurationUtil.toConfiguration(props)
fs = FileSystem.get(conf)
Once you have a FileSystem object you can do whatever you want using
the standard Hadoop API.
Hope this helps.