Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Advanced HDFS operations from Python embedded scripts


Copy link to this message
-
Re: Advanced HDFS operations from Python embedded scripts
Jakub Glapa 2013-01-18, 10:33
that looks promising, thanks Clement!

--
regards,
pozdrawiam,
Jakub Glapa
On Fri, Jan 18, 2013 at 9:12 AM, Clément MATHIEU <[EMAIL PROTECTED]>wrote:

> On 2013-01-17 23:11, Jakub Glapa wrote:
>
> Hi Jakub,
>
>
>  my pig script is going to produce a set of files that will be an input for
>> a different process. The script would be running periodically so the
>> number
>> of files would be growing.
>> I would like to implement an expiry mechanism were I could remove files
>> that are older than x or the number of files has reached some threshold.
>>
>> I know a crazy way were in bash script you can call "hadoop fs -ls ...",
>> parse the output and then execute "rmr" on matching entries.
>>
>> Is there a "human" way to do this from under python script? Pig.fs()
>>
>
> I had the same issue than you few months ago. The public Pig scripting API
> only exposes a FsShell object which is way too limited to do any real work.
> However it is possible to get access to the Hadoop FileSystem API from a
> Python script:
>
>
> def get_fs():
>     """Return a org.apache.hadoop.fs.**FileSystem instance."""
>     # Pig scripting API exports a FsShell but not a FileSystem object.
>     ctx   = ScriptPigContext.get()
>     props = ctx.getPigContext().**getProperties()
>     conf  = ConfigurationUtil.**toConfiguration(props)
>     fs    = FileSystem.get(conf)
>     return fs
>
>
> Once you have a FileSystem object you can do whatever you want using the
> standard Hadoop API.
>
>
> Hope this helps.
>
> -- Clément
>