Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Don't process already processed files?


Copy link to this message
-
Re: Don't process already processed files?
Yes, the state of what files have been processed needs to be tracked
outside of the script somehow. Two other approaches come to mind as well:

- Use the HDFS file systems as a work queue. Move files from /incoming to
/processed for example after processing them.
- Put files in a time-partitioned directory and run your jobs for explicit
time intervals. This approach is more common.
On Wed, Mar 27, 2013 at 7:30 AM, John Farrelly <
[EMAIL PROTECTED]> wrote:

> Thanks Mike.  That's what I was thinking, but I was wondering if (hoping!)
> there was something already to do it :)
>
> Thanks,
> John.
>
> -----Original Message-----
> From: Mike Sukmanowsky [mailto:[EMAIL PROTECTED]]
> Sent: 27 March 2013 14:05
> To: [EMAIL PROTECTED]
> Subject: Re: Don't process already processed files?
>
> It's probably less work to have some kind of a script control Pig
> execution and keep track of what's been processed and pass in an input path
> to your Pig script dynamically.  For example, you could create a
> control.py/rb/shfile which would somehow keep track of what's been
> processed (maybe a simple file) and then figure out the input path to pass
> to pig during execution via a parameter: pig --param
> inputpath="/some/dynamic/input/path/for/pig".
>
> You'd then setup your cron job to run your control script instead of the
> Pig script directly.
>
>
> On Wed, Mar 27, 2013 at 6:24 AM, John Farrelly <
> [EMAIL PROTECTED]> wrote:
>
> > Hi there,
> >
> > In our system, we have multiple pig scripts that run against a
> > particular HDFS directory.  The pig scripts can run at different
> > times, and are scheduled to run regularly.  Is there a way to point a
> > pig script at the same directory for multiple executions, but make
> > sure that it only processed new files that it hasn't seen before?  I
> > was thinking of using a custom PathFilter for my loader, but I thought
> > I would ask to see if there is already a way to do this, rather than me
> reinventing the wheel (!).
> >
> > Thanks,
> > John.
> > </pre>****************************************************************
> > ************************<br>This email and any files transmitted with
> > are confidential and intended solely for the<br>use of the individual
> > or entity to whom they are addressed.  If you have received
> > this<br>email in error then please delete it and notify the sender. Do
> > not make a copy or forward<br>it to anyone.  This footnote also
> > confirms that this email message has been swept for the<br>presence of
> > computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48
> > Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G.
> > Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> > (UK).<br>Registered in Ireland, Company No. 370343, VAT
> > Reg.No.IE6390343O<br>*************************************************
> > ***************************************</pre>
> >
>
>
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: [EMAIL PROTECTED]
> </pre>****************************************************************************************<br>This
> email and any files transmitted with are confidential and intended solely
> for the<br>use of the individual or entity to whom they are addressed.  If
> you have received this<br>email in error then please delete it and notify
> the sender. Do not make a copy or forward<br>it to anyone.  This footnote
> also confirms that this email message has been swept for the<br>presence of
> computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48
> Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G.
> Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> (UK).<br>Registered in Ireland, Company No. 370343, VAT
> Reg.No.IE6390343O<br>****************************************************************************************</pre>
>
>
--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB