Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Advanced HDFS operations from Python embedded scripts


+
Jakub Glapa 2013-01-17, 22:11
Copy link to this message
-
Re: Advanced HDFS operations from Python embedded scripts
On 2013-01-17 23:11, Jakub Glapa wrote:

Hi Jakub,

> my pig script is going to produce a set of files that will be an
> input for
> a different process. The script would be running periodically so the
> number
> of files would be growing.
> I would like to implement an expiry mechanism were I could remove
> files
> that are older than x or the number of files has reached some
> threshold.
>
> I know a crazy way were in bash script you can call "hadoop fs -ls
> ...",
> parse the output and then execute "rmr" on matching entries.
>
> Is there a "human" way to do this from under python script? Pig.fs()

I had the same issue than you few months ago. The public Pig scripting
API only exposes a FsShell object which is way too limited to do any
real work. However it is possible to get access to the Hadoop FileSystem
API from a Python script:
def get_fs():
     """Return a org.apache.hadoop.fs.FileSystem instance."""
     # Pig scripting API exports a FsShell but not a FileSystem object.
     ctx   = ScriptPigContext.get()
     props = ctx.getPigContext().getProperties()
     conf  = ConfigurationUtil.toConfiguration(props)
     fs    = FileSystem.get(conf)
     return fs
Once you have a FileSystem object you can do whatever you want using
the standard Hadoop API.
Hope this helps.

-- Clément
+
Jakub Glapa 2013-01-18, 10:33
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB