Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Don't process already processed files?

Copy link to this message
Re: Don't process already processed files?
It's probably less work to have some kind of a script control Pig execution
and keep track of what's been processed and pass in an input path to your
Pig script dynamically.  For example, you could create a
control.py/rb/shfile which would somehow keep track of what's been
processed (maybe a
simple file) and then figure out the input path to pass to pig during
execution via a parameter: pig --param

You'd then setup your cron job to run your control script instead of the
Pig script directly.
On Wed, Mar 27, 2013 at 6:24 AM, John Farrelly <

> Hi there,
> In our system, we have multiple pig scripts that run against a particular
> HDFS directory.  The pig scripts can run at different times, and are
> scheduled to run regularly.  Is there a way to point a pig script at the
> same directory for multiple executions, but make sure that it only
> processed new files that it hasn't seen before?  I was thinking of using a
> custom PathFilter for my loader, but I thought I would ask to see if there
> is already a way to do this, rather than me reinventing the wheel (!).
> Thanks,
> John.
> </pre>****************************************************************************************<br>This
> email and any files transmitted with are confidential and intended solely
> for the<br>use of the individual or entity to whom they are addressed.  If
> you have received this<br>email in error then please delete it and notify
> the sender. Do not make a copy or forward<br>it to anyone.  This footnote
> also confirms that this email message has been swept for the<br>presence of
> computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48
> Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G.
> Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> (UK).<br>Registered in Ireland, Company No. 370343, VAT
> Reg.No.IE6390343O<br>****************************************************************************************</pre>

Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248