-Re: Advanced HDFS operations from Python embedded scripts
Jakub Glapa 2013-01-18, 10:33
that looks promising, thanks Clement!
On Fri, Jan 18, 2013 at 9:12 AM, Clément MATHIEU <[EMAIL PROTECTED]>wrote:
> On 2013-01-17 23:11, Jakub Glapa wrote:
> Hi Jakub,
> my pig script is going to produce a set of files that will be an input for
>> a different process. The script would be running periodically so the
>> of files would be growing.
>> I would like to implement an expiry mechanism were I could remove files
>> that are older than x or the number of files has reached some threshold.
>> I know a crazy way were in bash script you can call "hadoop fs -ls ...",
>> parse the output and then execute "rmr" on matching entries.
>> Is there a "human" way to do this from under python script? Pig.fs()
> I had the same issue than you few months ago. The public Pig scripting API
> only exposes a FsShell object which is way too limited to do any real work.
> However it is possible to get access to the Hadoop FileSystem API from a
> Python script:
> def get_fs():
> """Return a org.apache.hadoop.fs.**FileSystem instance."""
> # Pig scripting API exports a FsShell but not a FileSystem object.
> ctx = ScriptPigContext.get()
> props = ctx.getPigContext().**getProperties()
> conf = ConfigurationUtil.**toConfiguration(props)
> fs = FileSystem.get(conf)
> return fs
> Once you have a FileSystem object you can do whatever you want using the
> standard Hadoop API.
> Hope this helps.
> -- Clément