|
|
-
Advanced HDFS operations from Python embedded scripts
Jakub Glapa 2013-01-17, 22:11
Hi, my pig script is going to produce a set of files that will be an input for a different process. The script would be running periodically so the number of files would be growing. I would like to implement an expiry mechanism were I could remove files that are older than x or the number of files has reached some threshold.
I know a crazy way were in bash script you can call "hadoop fs -ls ...", parse the output and then execute "rmr" on matching entries.
Is there a "human" way to do this from under python script? Pig.fs() doesn't come in handy because it doesn't return anything to the script but maybe I'm missing something? How could I approach that differently other than writing a java program or using shell? Python looks like a great idea but seems a bit limited at least in version 0.10.1.
I appreciate any help!
-- regards, Jakub Glapa
+
Jakub Glapa 2013-01-17, 22:11
-
Re: Advanced HDFS operations from Python embedded scripts
Clément MATHIEU 2013-01-18, 09:12
On 2013-01-17 23:11, Jakub Glapa wrote:
Hi Jakub,
> my pig script is going to produce a set of files that will be an > input for > a different process. The script would be running periodically so the > number > of files would be growing. > I would like to implement an expiry mechanism were I could remove > files > that are older than x or the number of files has reached some > threshold. > > I know a crazy way were in bash script you can call "hadoop fs -ls > ...", > parse the output and then execute "rmr" on matching entries. > > Is there a "human" way to do this from under python script? Pig.fs()
I had the same issue than you few months ago. The public Pig scripting API only exposes a FsShell object which is way too limited to do any real work. However it is possible to get access to the Hadoop FileSystem API from a Python script: def get_fs(): """Return a org.apache.hadoop.fs.FileSystem instance.""" # Pig scripting API exports a FsShell but not a FileSystem object. ctx = ScriptPigContext.get() props = ctx.getPigContext().getProperties() conf = ConfigurationUtil.toConfiguration(props) fs = FileSystem.get(conf) return fs Once you have a FileSystem object you can do whatever you want using the standard Hadoop API. Hope this helps.
-- Clément
+
Clément MATHIEU 2013-01-18, 09:12
-
Re: Advanced HDFS operations from Python embedded scripts
Jakub Glapa 2013-01-18, 10:33
that looks promising, thanks Clement!
-- regards, pozdrawiam, Jakub Glapa On Fri, Jan 18, 2013 at 9:12 AM, Clément MATHIEU <[EMAIL PROTECTED]>wrote:
> On 2013-01-17 23:11, Jakub Glapa wrote: > > Hi Jakub, > > > my pig script is going to produce a set of files that will be an input for >> a different process. The script would be running periodically so the >> number >> of files would be growing. >> I would like to implement an expiry mechanism were I could remove files >> that are older than x or the number of files has reached some threshold. >> >> I know a crazy way were in bash script you can call "hadoop fs -ls ...", >> parse the output and then execute "rmr" on matching entries. >> >> Is there a "human" way to do this from under python script? Pig.fs() >> > > I had the same issue than you few months ago. The public Pig scripting API > only exposes a FsShell object which is way too limited to do any real work. > However it is possible to get access to the Hadoop FileSystem API from a > Python script: > > > def get_fs(): > """Return a org.apache.hadoop.fs.**FileSystem instance.""" > # Pig scripting API exports a FsShell but not a FileSystem object. > ctx = ScriptPigContext.get() > props = ctx.getPigContext().**getProperties() > conf = ConfigurationUtil.**toConfiguration(props) > fs = FileSystem.get(conf) > return fs > > > Once you have a FileSystem object you can do whatever you want using the > standard Hadoop API. > > > Hope this helps. > > -- Clément >
+
Jakub Glapa 2013-01-18, 10:33
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext