Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Don't process already processed files?


Copy link to this message
-
Re: Don't process already processed files?
It's probably less work to have some kind of a script control Pig execution
and keep track of what's been processed and pass in an input path to your
Pig script dynamically.  For example, you could create a
control.py/rb/shfile which would somehow keep track of what's been
processed (maybe a
simple file) and then figure out the input path to pass to pig during
execution via a parameter: pig --param
inputpath="/some/dynamic/input/path/for/pig".

You'd then setup your cron job to run your control script instead of the
Pig script directly.
On Wed, Mar 27, 2013 at 6:24 AM, John Farrelly <
[EMAIL PROTECTED]> wrote:

> Hi there,
>
> In our system, we have multiple pig scripts that run against a particular
> HDFS directory.  The pig scripts can run at different times, and are
> scheduled to run regularly.  Is there a way to point a pig script at the
> same directory for multiple executions, but make sure that it only
> processed new files that it hasn't seen before?  I was thinking of using a
> custom PathFilter for my loader, but I thought I would ask to see if there
> is already a way to do this, rather than me reinventing the wheel (!).
>
> Thanks,
> John.
> </pre>****************************************************************************************<br>This
> email and any files transmitted with are confidential and intended solely
> for the<br>use of the individual or entity to whom they are addressed.  If
> you have received this<br>email in error then please delete it and notify
> the sender. Do not make a copy or forward<br>it to anyone.  This footnote
> also confirms that this email message has been swept for the<br>presence of
> computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48
> Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G.
> Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> (UK).<br>Registered in Ireland, Company No. 370343, VAT
> Reg.No.IE6390343O<br>****************************************************************************************</pre>
>

--
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: [EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB