-Re: how to run jobs every 30 minutes?
edward choi 2010-12-16, 06:36
The first recommendation (gluing all my command line apps) is what I am
The other ones you mentioned are just out of my league right now, since I am
quite new to Java world, not to mention JRuby, Groovy, Jython, etc.
But when I get comfortable with the environment and start to look for more
options I'll refer to your message. Thanks for the advanced info :-)
2010/12/15 Chris K Wensel <[EMAIL PROTECTED]>
> I see it this way.
> You can glue a bunch of discrete command line apps together that may or may
> not have dependencies between one another in a new syntax. which is darn
> nice if you already have a bunch of discrete ready to run command line apps
> sitting around that need to be strung together, that can't be used as
> libraries and instantiated through their APIs.
> Or, you can string all your work together through the APIs with a turing
> complete language and run them all from a single command line interface (and
> hand that to cron, or some other tool).
> In this case you can use Java, or easier languages like JRuby, Groovy,
> Jython, Clojure, etc which were designed for this purpose. (They don't run
> on the cluster, they only run Hadoop client side).
> Think ant vs graddle (or any other build tool that uses a scripting
> language and not a configuration file) if you want a concrete example.
> Cascading itself is a query API (and query planner). But it also exposes to
> the user the ability to run discrete 'processes' in dependency order for
> you. Either Cascading (Hadoop) Flows or Riffle annotated process objects.
> They all can be intermingled and managed from the same dependency scheduler.
> Cascading has one, and Riffle has one.
> So you can run> Flow -> Mahout -> Pig -> Mahout -> Flow -> shell ->
> whattheheckever from the same application.
> Cascading also has the ability to only run 'stale' processes. Think 'make'
> file. When re-running a job where only one file of many has changed, this is
> a big win.
> I personally like parameterizing my applications via the command line and
> letting my cli options drive the workflows. for example, my testing,
> integration, production environments are much different, so its very easy to
> drive specific runs of the jobs by changing a cli arg. (args4j makes this
> darn simple)
> if I am chaining multiple CLI apps into a bigger production app,
> parameterizing that I suspect will be error prone, esp if the input/output
> data points (jdbc vs file) are different in different contexts.
> you can find Riffle here, https://github.com/cwensel/riffle (its Apache
> Licensed, contributions welcomed)
> On Dec 14, 2010, at 1:30 AM, Alejandro Abdelnur wrote:
> > Ed,
> > Actually Oozie is quite different from Cascading.
> > * Cascading allows you to write 'queries' using a Java API and they get
> > translated into MR jobs.
> > * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a
> > (workflow jobs) and has timer+data dependency triggers (coordinator
> > Regards.
> > Alejandro
> > On Tue, Dec 14, 2010 at 1:26 PM, edward choi <[EMAIL PROTECTED]> wrote:
> >> Thanks for the tip. I took a look at it.
> >> Looks similar to Cascading I guess...?
> >> Anyway thanks for the info!!
> >> Ed
> >> 2010/12/8 Alejandro Abdelnur <[EMAIL PROTECTED]>
> >>> Or, if you want to do it in a reliable way you could use an Oozie
> >>> coordinator job.
> >>> On Wed, Dec 8, 2010 at 1:53 PM, edward choi <[EMAIL PROTECTED]> wrote:
> >>>> My mistake. Come to think about it, you are right, I can just make an
> >>>> infinite loop inside the Hadoop application.
> >>>> Thanks for the reply.
> >>>> 2010/12/7 Harsh J <[EMAIL PROTECTED]>
> >>>>> Hi,
> >>>>> On Tue, Dec 7, 2010 at 2:25 PM, edward choi <[EMAIL PROTECTED]>
> >>>>>> Hi,
> >>>>>> I'm planning to crawl a certain web site every 30 minutes.
> >>>>>> How would I get it done in Hadoop?